All of johnswentworth's Comments + Replies

General alignment plus human values, or alignment via human values?

Under my current understanding of the "general alignment" angle, a core argument goes like:

  • We need some way for agents to create aligned successor agents, so our AI doesn't succumb to value drift. This is a thing we need regardless, assuming that AIs will design successively-more-powerful descendants.
  • If the successor-design process is sufficiently-general-purpose, a human could use that same process to design the "seed" AI in the first place.

I don't necessarily think this is the best framing, and I don't necessarily agree with it (e.g. whether the agent ha... (read more)

4Stuart Armstrong9hThe successor problem is important, but it assumes we have the values already. I'm imagining algorithms designing successors with imperfect values (that they know to be imperfect). It's a somewhat different problem (though solving the classical successor problem is also important).
Emergent modularity and safety

Perhaps a good way to summarize all this is something like "qualitatively similar models probably work well for brains and neural networks". I agree to a large extent with that claim (though there was a time when I would have agreed much less), and I think that's the main thing you need for the rest of the post.

"Ways we understand" comes across as more general than that - e.g. we understand via experimentally probing physical neurons vs spectral clustering of a derivative matrix.

Emergent modularity and safety

I agree that the phrasing could be better; any suggestions?

I actually think you could just drop that intro altogether, or move it later into the post. We do have pretty good evidence of modularity in the brain (as well as other biological systems) and in trained neural nets; it seems to be a pretty common property of large systems "evolved" by local optimization. And the rest of the post (as well as some of the other comments) does a good job of talking about some of that evidence. It's a good post, and I think the arguments later in the post are stronger ... (read more)

4Richard Ngo4dThanks, that's helpful. I do think there's a weak version of this which is an important background assumption for the post (e.g. without that assumption I'd need to explain the specific ways in which ANNs and BNNs are similar), so I've now edited the opening lines to convey that weak version instead. (I still believe the original version but agree that it's not worth defending here.)
Emergent modularity and safety

Our default expectation about large neural networks should be that we will understand them in roughly the same ways that we understand biological brains, except where we have specific reasons to think otherwise.

Why would that be our default expectation? We don't have direct access to all of the underlying parameters in the brain. We can't even simulate it yet, let alone take a gradient.

5Richard Ngo4dLots of reasons. Neural networks are modelled after brains. They both form distributed representations at very large scales, they both learn over time, etc etc. Sure, you've pointed out a few differences, but the similarities are so great that this should be the main anchor for our expectations (rather than, say, thinking that we'll understand NNs the same way we understand support vector machines, or the same way we understand tree search algorithms, or...).
3Daniel_Eth4dThe statement seems almost tautological – couldn't we somewhat similarly claim that we'll understand NNs in roughly the same ways that we understand houses, except where we have reasons to think otherwise? The "except where we have reasons to think otherwise" bit seems to be doing a lot of work.
Epistemic Strategies of Selection Theorems

Two things I'd especially like to highlight in this post:

Fundamentally, structural constraints give us back some of the guarantees of the main epistemic strategies of Science and Engineering that get lost in alignment: we don’t have the technology yet, but we have some ideas of how it will work.

This is possibly the best one-sentence summary I've seen of how these sorts of theorems would be useful.

One corollary of recovering (some of) the usual science-and-engineering strategies is that selection theorems would open the door to a lot of empirical work on al... (read more)

Selection Theorems: A Program For Understanding Agents

I think that's a reasonable summary as written. Two minor quibbles, which you are welcome to ignore:

Selection theorems are helpful because (1) they can provide additional assumptions that can help with learning values by observing human behavior

I agree with the literal content of this sentence, but I personally don't imagine limiting it to behavioral data. I expect embedding-relevant selection theorems, which would also open the door to using internal structure or low-level dynamics of the brain to learn values (and human models, precision of approximation... (read more)

4Rohin Shah14dEdited to and (For the second one, that's one of the reasons why I had the weasel word "could", but on reflection it's worth calling out explicitly given I mention it in the previous sentence.)
Selection Theorems: A Program For Understanding Agents

The biggest piece (IMO) would be figuring out key properties of human values. If we look at e.g. your sequence on value learning, the main takeaway of the section on ambitious value learning is "we would need more assumptions". (I would also argue we need different assumptions, because some of the currently standard assumptions are wrong - like utility functions.)

That's one thing selection theorems offer: a well-grounded basis for new assumptions for ambitious value learning. (And, as an added bonus, directly bringing selection into the picture means we al... (read more)

Selection Theorems: A Program For Understanding Agents

A few comments...

Selection theorems are helpful because they tell us likely properties of the agents we build.

What are selection theorems helpful for? Three possible areas (not necessarily comprehensive):

  • Properties of humans as agents (e.g. "human values")
  • Properties of agents which we intentionally aim for (e.g. what kind of architectural features are likely to be viable)
  • Properties of agents which we accidentally aim for (e.g. inner agency issues)

Of these, I expect the first to be most important, followed by the last, although this depends on the relative ... (read more)

4Rohin Shah15dThanks for this and the response to my other comment, I understand where you're coming from a lot better now. (Really I should have figured it out myself, on the basis of this post [] .) New summary: New opinion:
Selection Theorems: A Program For Understanding Agents

The problem with that sort of approach is that the system (i.e. agent) being modeled is not necessarily going to play along with whatever desiderata we want. We can't just be like "I want an interface which does X"; if X is not a natural fit for the system, then what pops out will be very misleading/confusing/antihelpful.

An oversimplified example: suppose I have some predictive model, and I want an interface which gives me a point estimate and confidence interval/region rather than a full distribution. That only works well if the distribution isn't multimo... (read more)

1Charlie Steiner25dThis might be related to the notion that if we try to dictate the form of a model ahead of time (i.e. some of the parameters are labeled "world model" in the code, and others are labeled "preferences", and inference is done by optimizing the latter over the former), but then just train it to minimize error, the actual content of the parameters after training doesn't need to respect our preconceptions. What the model really "wants" to do in the limit of lots of compute is find a way to encode an accurate simulation of the human in the parameters in a way that bypasses the simplifications we're trying to force on it. For this problem, which might not be what you're talking about, I think a lot of the solution is algorithmic information theory. Trying to specify neat, human-legible parts for your model (despite not being able to train the parts separately) is kind of like choosing a universal Turing machine made of human-legible parts. In the limit of big powerfulness, the Solomonoff inductor will throw off your puny shackles and simulate the world in a highly accurate (and therefore non human-legible) way. The solution is not better shackles, it's an inference method that trades off between model complexity and error in a different way. (P.S.: I think there is an "obvious" way to do that, and it's MML learning with some time constant used to turn error rates into total discounted error, which can be summed with model complexity.)
Selection Theorems: A Program For Understanding Agents

Oh excellent, that's a perfect reference for one of the successor posts to this one. You guys do a much better job explaining what agent type signatures are and giving examples and classification, compared to my rather half-baked sketch here.

2Evan Hubinger1moThanks! I hope the post is helpful to you or anyone else trying to think about the type signatures of goals. It's definitely a topic I'm pretty interested in.
Selection Theorems: A Program For Understanding Agents

Basically, yes. Though I would add that narrowing down model choices in some legible way is a necessary step if, for instance, we want to be able to interface with our models in any other way than querying for probabilities over the low-level state of the system.

1Charlie Steiner25dRight. I think I'm more of the opinion that we'll end up choosing those interfaces via desiderata that apply more directly to the interface (like "we want to be able to compare two models' ratings of the same possible future"), rather than indirect desiderata on "how a practical agent should look" that we keep adding to until an interface pops out.
Selection Theorems: A Program For Understanding Agents

You want a model of humans to account for complicated, psychology-dependent limitations on what actions we consider taking. So: what process produced this complicated psychology? Natural selection. What data structures can represent that complicated psychology? That's a type signature question. Put the two together, and we have a selection-theorem-shaped question.

In the example with persons A and B: a set of selection theorems would offer a solid foundation for the type signature of human preferences. Most likely, person B would use whatever types the theo... (read more)

1Charlie Steiner1moWe can imagine modeling humans in purely psychological ways with no biological inspiration, so I think you're saying that you want to look at the "natural constraints" on representations / processes, and then in a sense generalize or over-charge those constraints to narrow down model choices?
Testing The Natural Abstraction Hypothesis: Project Update

The #P-complete problem is to calculate the distribution of some variables in a Bayes net given some other variables in the Bayes net, without any particular restrictions on the net or on the variables chosen.

Formal statement of the Telephone Theorem: We have a sequence of Markov blankets forming a Markov chain . Then in the limit  mediates the interaction between  and  (i.e. the distribution factors according to ), for some  satisfying

with probabi... (read more)

Information At A Distance Is Mediated By Deterministic Constraints

More like: exponential family distributions are a universal property of information-at-a-distance in large complex systems. So, we can use exponential models without any loss of generality when working with information-at-a-distance in large complex systems.

That's what I hope to show, anyway.

Information At A Distance Is Mediated By Deterministic Constraints

Yup, that's the direction I want. If the distributions are exponential family, then that dramatically narrows down the space of distributions which need to be represented in order to represent abstractions in general. That means much simpler data structures - e.g. feature functions and Lagrange multipliers, rather than whole distributions.

1Vanessa Kosoy1moSo, your thesis is, only exponential models give rise to nice abstractions? And, since it's important to have abstractions, we might just as well have our agents reason exclusively in terms of exponential models?
Information At A Distance Is Mediated By Deterministic Constraints

Roughly speaking, the generalized KPD says that if the long-range correlations are low dimensional, then the whole distribution is exponential family (modulo a few "exceptional" variables). The theorem doesn't rule out the possibility of high-dimensional correlations, but it narrows down the possible forms a lot if we can rule out high-dimensional correlations some other way. That's what I'm hoping for: some simple/common conditions which limit the dimension of the long-range correlations, so that gKPD can apply.

This post says that those long range correla... (read more)

1Vanessa Kosoy1moI'm still confused. What direction of GKPD do you want to use? It sounds like you want to use the low-dimensional statistic => exponential family direction. Why? What is good about some family being exponential?
The alignment problem in different capability regimes

Claim: the core of the alignment problem is conserved across capability levels. If a particular issue only occurs at a particular capability level, then the issue is usually "not really about alignment" in some sense.

Roughly speaking, if I ask a system for something, and then the result is not really what I wanted, but the system "could have" given the result I wanted in some sense, then that's an alignment problem regardless of whether the system is a superintelligent AI or google maps. Whether it's a simple system with a bad user interface, or a giant ML... (read more)

3Buck Shlegeris2moI think this is a reasonable definition of alignment, but it's not the one everyone uses. I also think that for reasons like the "ability to understand itself" thing, there are pretty interesting differences in the alignment problem as you're defining it between capability levels.
Information At A Distance Is Mediated By Deterministic Constraints

We can still view these as travelling through many layers - the light waves have to propagate through many lightyears of mostly-empty space (and it could attenuate or hit things along the way). The photo has to last many years (and could randomly degrade a little or be destroyed at any moment along the way).

What makes it feel like "one hop" intuitively is that the information is basically-perfectly conserved at each "step" through spacetime, and there's in a symmetry in how the information is represented.

Welcome & FAQ!

I recommend that the title make it clearer that non-members can now submit alignment forum content for review, since this post is cross-posted on LW.

1Ruben Bloom2moYou're right. Maybe worth the extra words for now.
Knowledge is not just precipitation of action

Here's a similarly-motivated model which I have found useful for the knowledge of economic agents.

Rather than imagining that agents choose their actions as a function of their information (which is the usual picture), imagine that agents can choose their action for every world-state. For instance, if I'm a medieval smith, I might want to treat my iron differently depending on its composition.

In economic models, it's normal to include lots of constraints on agents' choices - like a budget constraint, or a constraint that our medieval smith cannot produce mo... (read more)

Traps of Formalization in Deconfusion

Instead of capturing the intuitions present in our confused understanding, John proposes to start with one of the applications and only focus on formalizing the concept for this specific purpose. [...] A later step is to attempt unification of the different formalization for the many applications.

Important clarification here: in order for this to work well, it is necessary to try multiple different use-cases and then unify. This is not a "start with one and then maybe get around to others in the indefinite future" sort of thing. I generally do not expect t... (read more)

Refactoring Alignment (attempt #2)

For a while, I've thought that the strategy of "split the problem into a complete set of necessary sub-goals" is incomplete. It produces problem factorizations, but it's not sufficient to produce good problem factorizations - it usually won't cut reality at clean joints. That was my main concern with Evan's factorization, and it also applies to all of these, but I couldn't quite put my finger on what the problem was.

I think I can explain it now: when I say I want a factorization of alignment to "cut reality at the joints", I think what I mean is that each ... (read more)

6Abram Demski3moI think there's another reason why factorization can be useful here, which is the articulation of sub-problems to try. For example, in the process leading up to inventing logical induction, Scott came up with a bunch of smaller properties to try for. He invented systems which got desirable properties individually, then growing combinations of desirable properties, and finally, figured out how to get everything at once. However, logical induction doesn't have parts corresponding to those different subproblems. It can be very useful to individually achieve, say, objective robustness, even if your solution doesn't fit with anyone else's solutions to any of the other sub-problems. It shows us a way to do it, which can inspire other ways to do it. In other words: tackling the whole alignment problem at once sounds too hard. It's useful to split it up, even if our factorization doesn't guarantee that we can stick pieces back together to get a whole solution. Though, yeah, it's obviously better if we can create a factorization of the sort you want.
Utility Maximization = Description Length Minimization

The construction is correct.

Note that for , conceptually we don't need to modify it, we just need to use the original  but apply it only to the subcomponents of the new -variable which correspond to the original -variable. Alternatively, we can take the approach you do: construct  which has a distribution over the new , but "doesn't say anything" about the new components, i.e. the it's just maxentropic over the new components. This is equivalent to ignoring the new components altogether.

3Edouard Harris3moAh yes, that's right. Yeah, I just wanted to make this part fully explicit to confirm my understanding. But I agree it's equivalent to just letM′2ignore the extraX′0(or whatever) component. Thanks very much!
Search-in-Territory vs Search-in-Map

It seems to me maybe the interesting thing is whether you can talk about a search algorithm in terms of particular kinds of abstractions rather than anything else, which if you go far enough around comes back to your position, but with more explained.


Reward Is Not Enough

Good explanation, conceptually.

Not sure how all the details play out - in particular, my big question for any RL setup is "how does it avoid wireheading?". In this case, presumably there would have to be some kind of constraint on the reward-prediction model, so that it ends up associating the reward with the state of the environment rather than the state of the sensors.

1Steve Byrnes4moUm, unreliably, at least by default. Like, some humans are hedonists, others aren't. I think there's a "hardcoded" credit assignment algorithm. When there's a reward prediction error, that algorithm primarily increments the reward-prediction / value associated with whatever stuff in the world model became newly active maybe half a second earlier. And maybe to a lesser extent, it also increments the reward-prediction / value associated with anything else you were thinking about at the time. (I'm not sure of the gory details here.) Anyway, insofar as "the reward signal itself" is part of the world-model, it's possible that reward-prediction / value will wind up attached to that concept. And then that's a desire to wirehead. But it's not inevitable. Some of the relevant dynamics are: * Timing—if credit goes mainly to signals that slightly precede the reward prediction error, then the reward signal itself is not a great fit. * Explaining away—once you have a way to accurately predict some set of reward signals, it makes the reward prediction errors go away, so the credit assignment algorithm stops running for those signals. So the first good reward-predicting model gets to stick around by default. Example: we learn early in life that the "eating candy" concept predicts certain reward signals, and then we get older and learn that the "certain neural signals in my brain" concept predicts those same reward signals too. But just learning that fact doesn't automatically translate into "I really want those certain neural signals in my brain". Only the credit assignment algorithm can make a thought appealing, and if the rewards are already being predicted then the credit assignment algorithm is inactive. (This is kinda like the behaviorism concept of blocking [].) * There may be some kind of bias to assign credit to predictive models that are simple functions of sensory inputs, when such
Reward Is Not Enough

Nice post!

I'm generally bullish on multiple objectives, and this post is another independent arrow pointing in that direction. Some other signs which I think point that way:

  • The argument from Why Subagents?. This is about utility maximizers rather than reward maximizers, but it points in a similar qualitative direction. Summary: once we allow internal state, utility-maximizers are not the only inexploitable systems; markets/committees of utility-maximizers also work.
  • The argument from Fixing The Good Regulator Theorem. That post uses some incoming informatio
... (read more)
3Steve Byrnes4moThanks! I had totally forgotten about your subagents post. I've been thinking that they kinda blend together in model-based RL, or at least the kind of (brain-like) model-based RL AGI that I normally think about. See this comment [] and surrounding discussion. Basically, one way to do model-based RL is to have the agent create a predictive model of the reward and then judge plans based on their tendency to maximize "the reward as currently understood by my predictive model". Then "the reward as currently understood by my predictive model" is basically a utility function. But at the same time, there's a separate subroutine that edits the reward prediction model (≈ utility function) to ever more closely approximate the true reward function (by some learning algorithm, presumably involving reward prediction errors). In other words: At any given time, the part of the agent that's making plans and taking actions looks like a utility maximizer. But if you lump together that part plus the subroutine that keeps editing the reward prediction model to better approximate the real reward signal, then that whole system is a reward-maximizing RL agent. Please tell me if that makes any sense or not; I've been planning to write pretty much exactly this comment (but with a diagram) into a short post.
Testing The Natural Abstraction Hypothesis: Project Intro

If the wheels are bouncing off each other, then that could be chaotic in the same way as billiard balls. But at least macroscopically, there's a crapton of damping in that simulation, so I find it more likely that the chaos is microscopic. But also my intuition agrees with yours, this system doesn't seem like it should be chaotic...

Abstraction Talk

Heads up, there's a lot of use of visuals - drawing, gesturing at things, etc - so a useful transcript may take some work.

1Ben Pace5moYeah, we can have a try and see whether it ends up being worth publishing.
Testing The Natural Abstraction Hypothesis: Project Intro


A couple notes:

  • Make sure to check that the values in the jacobian aren't exploding - i.e. there's not values like 1e30 or 1e200 or anything like that. Exponentially large values in the jacobian probably mean the system is chaotic.
  • If you want to avoid explicitly computing the jacobian, write a method which takes in a (constant) vector  and uses backpropagation to return . This is the same as the time-0-to-time-t jacobian dotted with , but it operates on size-n vectors rather than n-by-n jacobian matrices, so should be a l
... (read more)

Another little update, speed issue solved for now by adding SymPy's fortran wrappers to the derivative calculations - calculating the SVD isn't (yet?) the bottleneck. Can now quickly get results from 1,000+ step simulations of 100s of particles. 

Unfortunately, even for the pretty stable configuration below, the values are indeed exploding. I need to go back through the program and double check the logic but I don't think it should be chaotic, if anything I would expect the values to hit zero.

It might be that there's some kind of quasi-chaotic behaviou... (read more)

SGD's Bias

My own understanding of the flat minima idea is that it's a different thing. It's not really about noise, it's about gradient descent in general being a pretty shitty optimization method, which converges very poorly to sharp minima (more precisely, minima with a high condition number). (Continuous gradient flow circumvents that, but using step sizes small enough to circumvent the problem in practice would make GD prohibitively slow. The methods we actually use are not a good approximation of continuous flow, as I understand it.) If you want flat minima, th... (read more)

SGD's Bias

I'm still wrapping my head around this myself, so this comment is quite useful.

Here's a different way to set up the model, where the phenomenon is more obvious.

Rather than Brownian motion in a continuous space, think about a random walk in a discrete space. For simplicity, let's assume it's a 1D random walk (aka birth-death process) with no explicit bias (i.e. when the system leaves state , it's equally likely to transition to  or ). The rate  at which the system leaves state  serves a role analogous to the... (read more)

3Steve Byrnes5moThat makes sense. Now it's coming back to me: you zoom your microscope into one tiny nm^3 cube of air. In a right-to-left temperature gradient you'll see systematically faster air molecules moving rightward and slower molecules moving leftward, because they're carrying the temperature from their last collision. Whereas in uniform temperature, there's "detailed balance" (just as many molecules going along a path vs going along the time-reversed version of that same path, and with the same speed distribution). Thinking about the diode-resistor thing more, I suspect it would be a waste-of-time nerd-sniping rabbit hole, especially because of the time-domain aspects (the fluctuations are faster on one side vs the other I think) which don't have any analogue in SGD. Sorry to have brought it up.
SGD's Bias

does it represent a bias towards less variance over the different gradients one can sample at a given point?

Yup, exactly.

Formal Inner Alignment, Prospectus

This is a good summary.

I'm still some combination of confused and unconvinced about optimization-under-uncertainty. Some points:

  • It feels like "optimization under uncertainty" is not quite the right name for the thing you're trying to point to with that phrase, and I think your explanations would make more sense if we had a better name for it.
  • The examples of optimization-under-uncertainty from your other comment do not really seem to be about uncertainty per se, at least not in the usual sense, whereas the Dr Nefarious example and maligness of the universal
... (read more)
2Abram Demski5moCould you elaborate on that? I do think that learning-normativity is more about outer alignment. However, some ideas might cross-apply. Well, it still seems like a good name to me, so I'm curious what you are thinking here. What name would communicate better? Again, I need more unpacking to be able to say much (or update much). Well, the optimization-under-uncertainty is an attempt to make a frame which can contain both, so this isn't necessarily a problem... but I am curious what feels non-tight about inner agency. I still agree with the hypothetical me making the opposite point ;p The problem is that certain things are being conflated, so both "uncertainty can't be separated from goals" and "uncertainty can be separated from goals" have true interpretations. (I have those interpretations clear in my head, but communication is hard.) OK, so. My sense of our remaining disagreement... We agree that the pointers/uncertainty could be factored (at least informally -- currently waiting on any formalism). You think "optimization under uncertainty" is doing something different, and I think it's doing something close. Specifically, I think "optimization under uncertainty" importantly is not necessarily best understood as the standard Bayesian thing where we (1) start with a utility function, (2) provide a prior, so that we can evaluate expected value (and 2.5, update on any evidence), (3) provide a search method, so that we solve the whole thing by searching for the highest-expectation element. Many examples of optimization-under-uncertainty strain this model. Probably the pointer/uncertainty model would do a better job in these cases. But, the Bayesian model is kind of the only one we have, so we can use it provisionally. And when we do so, the approximation of pointer-vs-uncertainty that comes out is: Pointer: The utility function. Uncertainty: The search plus the prior, which in practice can blend together into "inductive bias". This isn't perfect, by any me
Understanding the Lottery Ticket Hypothesis

Picture a linear approximation, like this:

The tangent space at point  is that whole line labelled "tangent".

The main difference between the tangent space and the space of neural-networks-for-which-the-weights-are-very-close is that the tangent space extrapolates the linear approximation indefinitely; it's not just limited to the region near the original point. (In practice, though, that difference does not actually matter much, at least for the problem at hand - we do stay close to the original point.)

The reason we want to talk about "the tangen... (read more)

Understanding the Lottery Ticket Hypothesis

I don’t yet understand this proposal. In what way do we decompose this parameter tangent space into "lottery tickets"? Are the lottery tickets the cross product of subnetworks and points in the parameter tangent space? The subnetworks alone? If the latter then how does this differ from the original lottery ticket hypothesis?

The tangent space version is meant to be a fairly large departure from the original LTH; subnetworks are no longer particularly relevant at all.

We can imagine a space of generalized-lottery-ticket-hypotheses, in which the common theme i... (read more)

2Daniel Kokotajlo5moI confess I don't really understand what a tangent space is, even after reading the wiki article on the subject. It sounds like it's something like this: Take a particular neural network. Consider the "space" of possible neural networks that are extremely similar to it, i.e. they have all the same parameters but the weights are slightly different, for some definition of "slightly." That's the tangent space. Is this correct? What am I missing?
Formal Inner Alignment, Prospectus

I buy the "problems can be both" argument in principle. However, when a problem involves both, it seems like we have to solve the outer part of the problem (i.e. figure out what-we-even-want), and once that's solved, all that's left for inner alignment is imperfect-optimizer-exploitation. The reverse does not apply: we do not necessarily have to solve the inner alignment issue (other than the imperfect-optimizer-exploiting part) at all. I also think a version of this argument probably carries over even if we're thinking about optimization-under-uncertainty... (read more)

Trying to lay this disagreement out plainly:

According to you, the inner alignment problem should apply to well-defined optimization problems, meaning optimization problems which have been given all the pieces needed to score domain items. Within this frame, the only reasonable definition is "inner" = issues of imperfect search, "outer" = issues of objective (which can include the prior, the utility function, etc).

According to me/Evan, the inner alignment problem should apply to optimization under uncertainty, which is a notion of optimization where you don... (read more)

4Abram Demski5moThe way I'm currently thinking of things, I would say the reverse also applies in this case. We can turn optimization-under-uncertainty into well-defined optimization by assuming a prior. The outer alignment problem (in your sense) involves getting the prior right. Getting the prior right is part of "figuring out what we want". But this is precisely the source of the inner alignment problems in the paul/evan sense: Paul was pointing out a previously neglected issue about the Solomonoff prior, and Evan is talking about inductive biases of machine learning algorithms (which is sort of like the combination of a prior and imperfect search). So both you and Evan and Paul are agreeing that there's this problem with the prior (/ inductive biases). It is distinct from other outer alignment problems (because we can, to a large extent, factor the problem of specifying an expected value calculation into the problem of specifying probabilities and the problem of specifying a value function / utility function / etc). Everyone would seem to agree that this part of the problem needs to be solved. The disagreement is just about whether to classify this part as "inner" and/or "outer". What is this problem like? Well, it's broadly a quality-of-prior problem, but it has a different character from other quality-of-prior problems. For the most part, the quality of priors can be understood by thinking about average error being low, or mistakes becoming infrequent, etc. However, here, this kind of thinking isn't sufficient: we are concerned with rare but catastrophic errors. Thinking about these things, we find ourselves thinking in terms of "agents inside the prior" (or agents being favored by the inductive biases). To what extent "agents in the prior" should be lumped together with "agents in imperfect search", I am not sure. But the term "inner optimizer" seems relevant. A good example of optimization-under-uncertainty that doesn't look like that (at least, not overtly) is most ap
Formal Inner Alignment, Prospectus

I'll make a case here that manipulation of imperfect internal search should be considered the inner alignment problem, and all the other things which look like inner alignment failures actually stem from outer alignment failure or non-mesa-optimizer-specific generalization failure.

Example: Dr Nefarious

Suppose Dr Nefarious is an agent in the environment who wants to acausally manipulate other agents' models. We have a model which knows of Dr Nefarious' existence, and we ask the model to predict what Dr Nefarious will do. At this point, we have already faile... (read more)

5Adam Shimi5moWhile I agree that outer objective, training data and prior should be considered together, I disagree that it makes the inner alignment problem dissolve except for manipulation of the search. In principle, if you could indeed ensure through a smart choice of these three parameters that there is only one global optimum, only "bad" (meaning high loss) local minima, and that your search process will always reach the global optimum, then I would agree that the inner alignment problem disappears. But answering "what do we even want?" at this level of precision seems basically impossible. I expect that it's pretty much equivalent to specifying exactly the result we want, which we are quite unable to do in general. So my perspective is that the inner alignment problem appears because of inherent limits into our outer alignment capabilities. And that in realistic settings where we cannot rule out multiple very good local minima, the sort of reasoning underpinning the inner alignment discussion is the best approach we have to address such problems. That being said, I'm not sure how this view interacts with yours or Evan's, or if this is a very standard use of the terms. But since that's part of the discussion Abram is pushing, here is how I use these terms.
3Steve Byrnes5moHm, I want to classify "defense against adversaries" as a separate category from both "inner alignment" and "outer alignment". The obvious example is: if an adversarial AGI hacks into my AGI and changes its goals, that's not any kind of alignment problem, it's a defense-against-adversaries problem. Then I would take that notion and extend it by saying "yes interacting with an adversary presents an attack surface, but also merely imagining an adversary presents an attack surface too". Well, at least in weird hypotheticals. I'm not convinced that this would really be a problem in practice, but I dunno, I haven't thought about it much. Anyway, I would propose that the procedure for defense against adversaries in general is: (1) shelter an AGI from adversaries early in training, until it's reasonably intelligent and aligned, and then (2) trust the AGI to defend itself. I'm not sure we can do any better than that. In particular, I imagine an intelligent and self-aware AGI that's aligned in trying to help me would deliberately avoid imagining an adversarial superintelligence that can acausally hijack its goals! That still leaves the issue of early training, when the AGI is not yet motivated to not imagine adversaries, or not yet able. So I would say: if it does imagine the adversary, and then its goals do get hijacked, then at that point I would say "OK yes now it's misaligned". (Just like if a real adversary is exploiting a normal security hole—I would say the AGI is aligned before the adversary exploits that hole, and misaligned after.) Then what? Well, presumably, we will need to have procedure that verifies alignment before we release the AGI from its training box. And that procedure would presumably be indifferent to how the AGI came to be misaligned. So I don't think that's really a special problem we need to think about.
7Abram Demski5moSo, I think I could write a much longer response to this (perhaps another post), but I'm more or less not persuaded that problems should be cut up the way you say. As I mentioned in my other reply, your argument that Dr. Nefarious problems shouldn't be classified as inner alignment is that they are apparently outer alignment. If inner alignment problems are roughly "the internal objective doesn't match the external objective" and outer alignment problems are roughly "the outer objective doesn't meet our needs/goals", then there's no reason why these have to be mutually exclusive categories. In particular, Dr. Nefarious problems can be both. But more importantly, I don't entirely buy your notion of "optimization". This is the part that would require a longer explanation to be a proper reply. But basically, I want to distinguish between "optimization" and "optimization under uncertainty". Optimization under uncertainty is not optimization -- that is, it is not optimization of the type you're describing, where you have a well-defined objective which you're simply feeding to a search. Given a prior, you can reduce optimization-under-uncertainty to plain optimization (if you can afford the probabilistic inference necessary to take the expectations, which often isn't the case). But that doesn't mean that you do, and anyway, I want to keep them as separate concepts even if one is often implemented by the other. Your notion of the inner alignment problem applies only to optimization. Evan's notion of inner alignment applies (only!) to optimization under uncertainty.

This is a great comment. I will have to think more about your overall point, but aside from that, you've made some really useful distinctions. I've been wondering if inner alignment should be defined separately from mesa-optimizer problems, and this seems like more evidence in that direction (ie, the dr nefarious example is a mesa-optimization problem, but it's about outer alignment). Or maybe inner alignment just shouldn't be seen as the compliment of outer alignment! Objective quality vs search quality is a nice dividing line, but, doesn't cluster together the problems people have been trying to cluster together.

Parsing Chris Mingard on Neural Networks

Yeah, I don't think there was anything particularly special about the complexity measure they used, and I wouldn't be surprised if some other measures did as-well-or-better at predicting which functions fill large chunks of parameter space.

Parsing Chris Mingard on Neural Networks

Minor note: I think "mappings with low Kolmogorov complexity occupy larger volumes" is an extremely misleading way to explain what the evidence points to here. (Yes, I know Chris used that language in his blog post, but he really should not have.) The intuition of "simple mappings occupy larger volumes" is an accurate summary, but the relevant complexity measure is very much not Kolmogorov complexity. It is a measure which does not account for every computable way of compressing things, only some specific ways of compressing things.

In a sense, the results ... (read more)

4Alex Flint6moThanks for this clarification John. But did Mingard et al show that there is some specific practical complexity measure that explains the size of the volumes occupied in parameter space better than alternative practical complexity measures? If so then think we would have uncovered an even more detailed understanding of which mappings occupy large volumes in parameter space, and since neural nets just generally work so well in the world, we could say that we know something about what kind of compression is relevant to our particular world. But if they just used some practical complexity measure as a rough proxy for Kolmogorov complexity then it leaves open the door for mappings occupying large volumes in parameter space to be simple according to many more ways of compressing things than were used in their particular complexity measure.
FAQ: Advice for AI Alignment Researchers

I still don’t know much about GANs, Gaussian processes, Bayesian neural nets, and convex optimization.

FWIW, I'd recommend investing the time to learn about convex optimization even if you don't think you need it yet. Unlike the other topics on this list, the ideas from convex optimization are relevant to a wide variety of things the name does not suggest. Some examples:

  • how to think about optimization in high dimensions (even for non-convex functions)
  • using constraints as a more-gearsy way of representing relevant information (as opposed to e.g. just wrappin
... (read more)
4Rohin Shah6moYeah that's definitely the one on the list that I think would be most useful. I may also be understating how much I know about it; I've picked up some over time, e.g. linear programming, minimax, some kinds of duality, mirror descent, Newton's method.
NTK/GP Models of Neural Nets Can't Learn Features

They wouldn't be random linear combinations, so the central limit theorem estimate wouldn't directly apply. E.g. this circuit transparency work basically ran PCA on activations. It's not immediately obvious to me what the right big-O estimate would be, but intuitively, I'd expect the PCA to pick out exactly those components dominated by change in activations - since those will be the components which involve large correlations in the activation patterns across data points (at least that's my intuition).

I think this claim is basically wrong:

And to the exten

... (read more)
5interstice6moHmm, so regarding the linear combinations, it's true that there are some linear combinations that will change by Θ(1) in the large-width limit -- just use the vector of partial derivatives of the output at some particular input, this sum will change by the amount that the output function moves during the regression. Indeed, I suspect(but don't have a proof) that these particular combinations will span the space of linear combinations that change non-trivially during training. I would dispute "we expect most linear combinations to change" though -- the CLT argument implies that we should expect almost all combinations to not appreciably change. Not sure what effect this would have on the PCA and still think it's plausible that it doesn't change at all(actually, I think Greg Yang states that it doesn't change in section 9 of his paper, haven't read that part super carefully though) So I think I was a bit careless in saying that the NTK can't do transfer learning at all -- a more exact statement might be "the NTK does the minimal amount of transfer learning possible". What I mean by this is, any learning algorithm can do transfer learning if the task we are 'transferring' to is sufficiently similar to the original task -- for instance, if it's just the exact same task but with a different data sample. I claim that the 'transfer learning' the NTK does is of this sort. As you say, since the tangent kernel doesn't change at all, the net effect is to move where the network starts in the tangent space. Disregarding convergence speed, the impact this has on generalization is determined by the values set by the old function on axes of the NTK outside of the span of the partial derivatives at the new function's data points. This means that, for the NTK to transfer anything from one task to another, it's not enough for the tasks to both feature, for instance, eyes. It's that the eyes have to correlate with the output in the exact same way in both tasks. Indeed, the transfer l
NTK/GP Models of Neural Nets Can't Learn Features

Alright, I buy the argument on page 15 of the original NTK paper.

I'm still very skeptical of the interpretation of this as "NTK models can't learn features". In general, when someone proves some interesting result which seems to contradict some combination of empirical results, my default assumption is that the proven result is being interpreted incorrectly, so I have a high prior that that's what's happening here. In this case, it could be that e.g. the "features" relevant to things like transfer learning are not individual neuron activations - e.g. IIRC ... (read more)

5interstice6moI don't think taking linear combinations will help, because adding terms to the linear combination will also increase the magnitude of the original activation vector -- e.g. if you add together N12 units, the magnitude of the sum of their original activations will with high probability be Θ(N14), dwarfing the O(1) change due to change in the activations. But regardless, it can't help with transfer learning at all, since the tangent kernel(which determines learning in this regime) doesn't change by definition. What empirical results do you think are being contradicted? As far as I can tell, the empirical results we have are 'NTK/GP have similar performance to neural nets on some, but not all, tasks'. I don't think transfer/feature learning is addressed at all. You might say these results are suggestive evidence that NTK/GP captures everything important about neural nets, but this is precisely what is being disputed with the transfer learning arguments. I can imagine doing an experiment where we find the 'empirical tangent kernel' of some finite neural net at initialization, solve the linear system, and then analyze the activations of the resulting network. But it's worth noting that this is not what is usually meant by 'NTK' -- that usually includes taking the infinite-width limit at the same time. And to the extent that we expect the activations to change at all, we no longer have reason to think that this linear system is a good approximation of SGD. That's what the above mathematical results mean -- the same mathematical analysis that implies that network training is like solving a linear system, also implies that the activations don't change at all.
NTK/GP Models of Neural Nets Can't Learn Features

Ok, that's at least a plausible argument, although there are some big loopholes. Main problem which jumps out to me: what happens after one step of backprop is not the relevant question. One step of backprop is not enough to solve a set of linear equations (i.e. to achieve perfect prediction on the training set); the relevant question is what happens after one step of Newton's method, or after enough steps of gradient descent to achieve convergence.

What would convince me more is an empirical result - i.e. looking at the internals of an actual NTK model, tr... (read more)

6interstice6moThe result that NTK does not learn features in the large N limit is not in dispute at all -- it's right there on page 15 of the original NTK paper [], and indeed holds after arbitrarily many steps of backprop. I don't think that there's really much room for loopholes in the math here. See Greg Yang's paper for a lengthy proof that this holds for all architectures. Also worth noting that when people 'take the NTK limit' they often don't initialize an actual net at all, they instead use analytical expressions for what the inner product of the gradients would be at N=infinity to compute the kernel directly.
NTK/GP Models of Neural Nets Can't Learn Features

During training on the 'old task', NTK stays in the 'tangent space' of the network's initialization. This means that, to first order, none of the functions/derivatives computed by the individual neurons change at all, only the output function does.

Eh? Why does this follow? Derivatives make sense; the derivatives staying approximately-constant is one of the assumptions underlying NTK to begin with. But the functions computed by individual neurons should be able to change for exactly the same reason the output function changes, assuming the network has more than one layer. What am I missing here?

5interstice6moThe asymmetry between the output function and the intermediate neuron functions comes from backprop -- from the fact that the gradients are backprop-ed through weight matrices with entries of magnitude O(N−12). So the gradient of the output w.r.t itself is obviously 1, then the gradient of the output w.r.t each neuron in the preceding layer is O(N−12), since you're just multiplying by a vector with those entries. Then by induction all other preceding layers' gradients are the sum of N random things of size O(1/N), and so are of size O(N−12) again. So taking a step of backprop will change the output function by O(1) but the intermediate functions by O(N−12), vanishing in the large-width limit. (This is kind of an oversimplification since it is possible to have changing intermediate functions while doing backprop, as mentioned in the linked paper. But this is the essence of why it's possible in some limits to move around using backprop without changing the intermediate neurons)
Updating the Lottery Ticket Hypothesis


Here's intuition pump to consider: suppose our net is a complete multigraph: not only is there an edge between every pair of nodes, there's multiple edges with base-2-exponentially-spaced weights, so we can always pick out a subset of them to get any total weight we please between the two nodes. Clearly, masking can turn this into any circuit we please (with the same number of nodes). But it seems wrong to say that this initial circuit has anything useful in it at all.

3Matthew "Vaniver" Graves6moThat seems right, but also reminds me of the point that you need to randomly initialize your neural nets for gradient descent to work (because otherwise the gradients everywhere are the same). Like, in the randomly initialized net, each edge is going to be part of many subcircuits, both good and bad, and the gradient is basically "what's your relative contribution to good subcircuits vs. bad subcircuits?"
Updating the Lottery Ticket Hypothesis

Oh that's definitely interesting. Thanks for the link.

Updating the Lottery Ticket Hypothesis

Yup, I agree that that quote says something which is probably true, given current evidence. I don't think "picking a winning lottery ticket" is a good way analogy for what that implies, though; see this comment.

6DanielFilan6moI don't know what the referent of 'that quote' is. If you mean the passage I quoted from the original lottery ticket hypothesis ("LTH") paper, then I highly recommend reading a follow-up paper [] which describes how and why it's wrong for large networks. The abstract of the paper I'm citing here: -------------------------------------------------------------------------------- Again, assuming "that" refers to the claim in the original LTH paper, I also don't think it's a good analogy. But by default I think that claim is what "the lottery ticket hypothesis" refers to, given that it's a widely cited paper that has spawned a large number of follow-up works.
Updating the Lottery Ticket Hypothesis

Not quite. The linear expansion isn't just over the parameters associated with one layer, it's over all the parameters in the whole net.

Load More