Emergent modularity and safety

Perhaps a good way to summarize all this is something like "qualitatively similar models probably work well for brains and neural networks". I agree to a large extent with that claim (though there was a time when I would have agreed much less), and I think that's the main thing you need for the rest of the post.

"Ways we understand" comes across as more general than that - e.g. we understand via experimentally probing physical neurons vs spectral clustering of a derivative matrix.

Emergent modularity and safety

I agree that the phrasing could be better; any suggestions?

I actually think you could just drop that intro altogether, or move it later into the post. We do have pretty good evidence of modularity in the brain (as well as other biological systems) and in trained neural nets; it seems to be a pretty common property of large systems "evolved" by local optimization. And the rest of the post (as well as some of the other comments) does a good job of talking about some of that evidence. It's a good post, and I think the arguments later in the post are stronger ... (read more)

44dThanks, that's helpful. I do think there's a weak version of this which is an
important background assumption for the post (e.g. without that assumption I'd
need to explain the specific ways in which ANNs and BNNs are similar), so I've
now edited the opening lines to convey that weak version instead. (I still
believe the original version but agree that it's not worth defending here.)

Emergent modularity and safety

Our default expectation about large neural networks should be that we will understand them in roughly the same ways that we understand biological brains, except where we have specific reasons to think otherwise.

Why would that be our default expectation? We don't have direct access to all of the underlying parameters in the brain. We can't even simulate it yet, let alone take a gradient.

54dLots of reasons. Neural networks are modelled after brains. They both form
distributed representations at very large scales, they both learn over time, etc
etc. Sure, you've pointed out a few differences, but the similarities are so
great that this should be the main anchor for our expectations (rather than,
say, thinking that we'll understand NNs the same way we understand support
vector machines, or the same way we understand tree search algorithms, or...).

34dThe statement seems almost tautological – couldn't we somewhat similarly claim
that we'll understand NNs in roughly the same ways that we understand houses,
except where we have reasons to think otherwise? The "except where we have
reasons to think otherwise" bit seems to be doing a lot of work.

Epistemic Strategies of Selection Theorems

Two things I'd especially like to highlight in this post:

Fundamentally, structural constraints give us back some of the guarantees of the main epistemic strategies of Science and Engineering that

get lost in alignment: we don’t have the technology yet, but we have some ideas of how it will work.

This is possibly the best one-sentence summary I've seen of how these sorts of theorems would be useful.

One corollary of recovering (some of) the usual science-and-engineering strategies is that selection theorems would open the door to a lot of empirical work on al... (read more)

Selection Theorems: A Program For Understanding Agents

Cool, looks good.

Selection Theorems: A Program For Understanding Agents

I think that's a reasonable summary as written. Two minor quibbles, which you are welcome to ignore:

Selection theorems are helpful because (1) they can provide additional assumptions that can help with learning values by observing human behavior

I agree with the literal content of this sentence, but I personally don't imagine limiting it to behavioral data. I expect embedding-relevant selection theorems, which would also open the door to using internal structure or low-level dynamics of the brain to learn values (and human models, precision of approximation... (read more)

414dEdited to
and
(For the second one, that's one of the reasons why I had the weasel word
"could", but on reflection it's worth calling out explicitly given I mention it
in the previous sentence.)

Selection Theorems: A Program For Understanding Agents

The biggest piece (IMO) would be figuring out key properties of human values. If we look at e.g. your sequence on value learning, the main takeaway of the section on ambitious value learning is "we would need more assumptions". (I would also argue we need *different* assumptions, because some of the currently standard assumptions are wrong - like utility functions.)

That's one thing selection theorems offer: a well-grounded basis for new assumptions for ambitious value learning. (And, as an added bonus, directly bringing selection into the picture means we al... (read more)

Selection Theorems: A Program For Understanding Agents

A few comments...

Selection theorems are helpful because they tell us likely properties of the agents we build.

What are selection theorems helpful for? Three possible areas (not necessarily comprehensive):

- Properties of humans as agents (e.g. "human values")
- Properties of agents which we intentionally aim for (e.g. what kind of architectural features are likely to be viable)
- Properties of agents which we accidentally aim for (e.g. inner agency issues)

Of these, I expect the first to be most important, followed by the last, although this depends on the relative ... (read more)

415dThanks for this and the response to my other comment, I understand where you're
coming from a lot better now. (Really I should have figured it out myself, on
the basis of this post
[https://www.alignmentforum.org/posts/zQZcWkvEA8DLjKR7C/theory-of-ideal-agents-or-of-existing-agents]
.) New summary:
New opinion:

Selection Theorems: A Program For Understanding Agents

The problem with that sort of approach is that the system (i.e. agent) being modeled is not necessarily going to play along with whatever desiderata we want. We can't just be like "I want an interface which does X"; if X is not a natural fit for the system, then what pops out will be very misleading/confusing/antihelpful.

An oversimplified example: suppose I have some predictive model, and I want an interface which gives me a point estimate and confidence interval/region rather than a full distribution. That only works well if the distribution isn't multimo... (read more)

125dThis might be related to the notion that if we try to dictate the form of a
model ahead of time (i.e. some of the parameters are labeled "world model" in
the code, and others are labeled "preferences", and inference is done by
optimizing the latter over the former), but then just train it to minimize
error, the actual content of the parameters after training doesn't need to
respect our preconceptions. What the model really "wants" to do in the limit of
lots of compute is find a way to encode an accurate simulation of the human in
the parameters in a way that bypasses the simplifications we're trying to force
on it.
For this problem, which might not be what you're talking about, I think a lot of
the solution is algorithmic information theory. Trying to specify neat,
human-legible parts for your model (despite not being able to train the parts
separately) is kind of like choosing a universal Turing machine made of
human-legible parts. In the limit of big powerfulness, the Solomonoff inductor
will throw off your puny shackles and simulate the world in a highly accurate
(and therefore non human-legible) way. The solution is not better shackles, it's
an inference method that trades off between model complexity and error in a
different way.
(P.S.: I think there is an "obvious" way to do that, and it's MML learning with
some time constant used to turn error rates into total discounted error, which
can be summed with model complexity.)

Selection Theorems: A Program For Understanding Agents

Oh excellent, that's a perfect reference for one of the successor posts to this one. You guys do a much better job explaining what agent type signatures are and giving examples and classification, compared to my rather half-baked sketch here.

21moThanks! I hope the post is helpful to you or anyone else trying to think about
the type signatures of goals. It's definitely a topic I'm pretty interested in.

Selection Theorems: A Program For Understanding Agents

Basically, yes. Though I would add that narrowing down model choices in some legible way is a necessary step if, for instance, we want to be able to *interface* with our models in any other way than querying for probabilities over the low-level state of the system.

125dRight. I think I'm more of the opinion that we'll end up choosing those
interfaces via desiderata that apply more directly to the interface (like "we
want to be able to compare two models' ratings of the same possible future"),
rather than indirect desiderata on "how a practical agent should look" that we
keep adding to until an interface pops out.

Selection Theorems: A Program For Understanding Agents

You want a model of humans to account for complicated, psychology-dependent limitations on what actions we consider taking. So: what process produced this complicated psychology? Natural selection. What data structures can represent that complicated psychology? That's a type signature question. Put the two together, and we have a selection-theorem-shaped question.

In the example with persons A and B: a set of selection theorems would offer a solid foundation for the type signature of human preferences. Most likely, person B would use whatever types the theo... (read more)

11moWe can imagine modeling humans in purely psychological ways with no biological
inspiration, so I think you're saying that you want to look at the "natural
constraints" on representations / processes, and then in a sense generalize or
over-charge those constraints to narrow down model choices?

Testing The Natural Abstraction Hypothesis: Project Update

The #P-complete problem is to calculate the distribution of some variables in a Bayes net given some other variables in the Bayes net, without any particular restrictions on the net or on the variables chosen.

Formal statement of the Telephone Theorem: We have a sequence of Markov blankets forming a Markov chain . Then in the limit , mediates the interaction between and (i.e. the distribution factors according to ), for some satisfying

with probabi... (read more)

Information At A Distance Is Mediated By Deterministic Constraints

More like: exponential family distributions are a universal property of information-at-a-distance in large complex systems. So, we can use exponential models without any loss of generality when working with information-at-a-distance in large complex systems.

That's what I hope to show, anyway.

Information At A Distance Is Mediated By Deterministic Constraints

Yup, that's the direction I want. If the distributions are exponential family, then that dramatically narrows down the space of distributions which need to be represented in order to represent abstractions in general. That means much simpler data structures - e.g. feature functions and Lagrange multipliers, rather than whole distributions.

11moSo, your thesis is, only exponential models give rise to nice abstractions? And,
since it's important to have abstractions, we might just as well have our agents
reason exclusively in terms of exponential models?

Information At A Distance Is Mediated By Deterministic Constraints

Roughly speaking, the generalized KPD says that if the long-range correlations are low dimensional, then the whole distribution is exponential family (modulo a few "exceptional" variables). The theorem doesn't rule out the possibility of high-dimensional correlations, but it narrows down the possible forms a lot *if* we can rule out high-dimensional correlations some other way. That's what I'm hoping for: some simple/common conditions which limit the dimension of the long-range correlations, so that gKPD can apply.

This post says that those long range correla... (read more)

11moI'm still confused. What direction of GKPD do you want to use? It sounds like
you want to use the low-dimensional statistic => exponential family direction.
Why? What is good about some family being exponential?

The alignment problem in different capability regimes

Claim: the core of the alignment problem is conserved across capability levels. If a particular issue only occurs at a particular capability level, then the issue is usually "not really about alignment" in some sense.

Roughly speaking, if I ask a system for something, and then the result is not really what I wanted, but the system "could have" given the result I wanted in some sense, then that's an alignment problem regardless of whether the system is a superintelligent AI or google maps. Whether it's a simple system with a bad user interface, or a giant ML... (read more)

32moI think this is a reasonable definition of alignment, but it's not the one
everyone uses.
I also think that for reasons like the "ability to understand itself" thing,
there are pretty interesting differences in the alignment problem as you're
defining it between capability levels.

Information At A Distance Is Mediated By Deterministic Constraints

We can still view these as travelling through many layers - the light waves have to propagate through many lightyears of mostly-empty space (and it could attenuate or hit things along the way). The photo has to last many years (and could randomly degrade a little or be destroyed at any moment along the way).

What makes it feel like "one hop" intuitively is that the information is basically-perfectly conserved at each "step" through spacetime, and there's in a symmetry in how the information is represented.

Welcome & FAQ!

I recommend that the title make it clearer that non-members can now submit **alignment forum** content for review, since this post is cross-posted on LW.

12moYou're right. Maybe worth the extra words for now.

Knowledge is not just precipitation of action

Here's a similarly-motivated model which I have found useful for the knowledge of economic agents.

Rather than imagining that agents choose their actions as a function of their information (which is the usual picture), imagine that agents can choose their action for every world-state. For instance, if I'm a medieval smith, I might want to treat my iron differently depending on its composition.

In economic models, it's normal to include lots of constraints on agents' choices - like a budget constraint, or a constraint that our medieval smith cannot produce mo... (read more)

Traps of Formalization in Deconfusion

Instead of capturing the intuitions present in our confused understanding, John proposes to start with one of the applications and only focus on formalizing the concept for this specific purpose. [...] A later step is to attempt unification of the different formalization for the many applications.

Important clarification here: in order for this to work well, it is necessary to try multiple different use-cases and then unify. This is not a "start with one and then maybe get around to others in the indefinite future" sort of thing. I generally do not expect t... (read more)

Refactoring Alignment (attempt #2)

For a while, I've thought that the strategy of "split the problem into a complete set of necessary sub-goals" is incomplete. It produces problem factorizations, but it's not sufficient to produce *good* problem factorizations - it usually won't cut reality at clean joints. That was my main concern with Evan's factorization, and it also applies to all of these, but I couldn't quite put my finger on what the problem was.

I think I can explain it now: when I say I want a factorization of alignment to "cut reality at the joints", I think what I mean is that each ... (read more)

63moI think there's another reason why factorization can be useful here, which is
the articulation of sub-problems to try.
For example, in the process leading up to inventing logical induction, Scott
came up with a bunch of smaller properties to try for. He invented systems which
got desirable properties individually, then growing combinations of desirable
properties, and finally, figured out how to get everything at once. However,
logical induction doesn't have parts corresponding to those different
subproblems.
It can be very useful to individually achieve, say, objective robustness, even
if your solution doesn't fit with anyone else's solutions to any of the other
sub-problems. It shows us a way to do it, which can inspire other ways to do it.
In other words: tackling the whole alignment problem at once sounds too hard.
It's useful to split it up, even if our factorization doesn't guarantee that we
can stick pieces back together to get a whole solution.
Though, yeah, it's obviously better if we can create a factorization of the sort
you want.

Utility Maximization = Description Length Minimization

The construction is correct.

Note that for , conceptually we don't need to modify it, we just need to use the original but apply it only to the subcomponents of the new -variable which correspond to the original -variable. Alternatively, we can take the approach you do: construct which has a distribution over the new , but "doesn't say anything" about the new components, i.e. the it's just maxentropic over the new components. This is equivalent to ignoring the new components altogether.

33moAh yes, that's right. Yeah, I just wanted to make this part fully explicit to
confirm my understanding. But I agree it's equivalent to just letM′2ignore the
extraX′0(or whatever) component.
Thanks very much!

Search-in-Territory vs Search-in-Map

It seems to me maybe the interesting thing is whether you can talk about a search algorithm in terms of particular kinds of abstractions rather than anything else, which if you go far enough around comes back to your position, but with more explained.

+1

Reward Is Not Enough

Good explanation, conceptually.

Not sure how all the details play out - in particular, my big question for any RL setup is "how does it avoid wireheading?". In this case, presumably there would have to be some kind of constraint on the reward-prediction model, so that it ends up associating the reward with the state of the environment rather than the state of the sensors.

14moUm, unreliably, at least by default. Like, some humans are hedonists, others
aren't.
I think there's a "hardcoded" credit assignment algorithm. When there's a reward
prediction error, that algorithm primarily increments the reward-prediction /
value associated with whatever stuff in the world model became newly active
maybe half a second earlier. And maybe to a lesser extent, it also increments
the reward-prediction / value associated with anything else you were thinking
about at the time. (I'm not sure of the gory details here.)
Anyway, insofar as "the reward signal itself" is part of the world-model, it's
possible that reward-prediction / value will wind up attached to that concept.
And then that's a desire to wirehead. But it's not inevitable. Some of the
relevant dynamics are:
* Timing—if credit goes mainly to signals that slightly precede the reward
prediction error, then the reward signal itself is not a great fit.
* Explaining away—once you have a way to accurately predict some set of reward
signals, it makes the reward prediction errors go away, so the credit
assignment algorithm stops running for those signals. So the first good
reward-predicting model gets to stick around by default. Example: we learn
early in life that the "eating candy" concept predicts certain reward
signals, and then we get older and learn that the "certain neural signals in
my brain" concept predicts those same reward signals too. But just learning
that fact doesn't automatically translate into "I really want those certain
neural signals in my brain". Only the credit assignment algorithm can make a
thought appealing, and if the rewards are already being predicted then the
credit assignment algorithm is inactive. (This is kinda like the behaviorism
concept of blocking [https://en.wikipedia.org/wiki/Blocking_effect].)
* There may be some kind of bias to assign credit to predictive models that are
simple functions of sensory inputs, when such

Reward Is Not Enough

Nice post!

I'm generally bullish on multiple objectives, and this post is another independent arrow pointing in that direction. Some other signs which I think point that way:

- The argument from Why Subagents?. This is about utility maximizers rather than reward maximizers, but it points in a similar qualitative direction. Summary: once we allow internal state, utility-maximizers are not the only inexploitable systems; markets/committees of utility-maximizers also work.
- The argument from Fixing The Good Regulator Theorem. That post uses some incoming informatio

34moThanks!
I had totally forgotten about your subagents post.
I've been thinking that they kinda blend together in model-based RL, or at least
the kind of (brain-like) model-based RL AGI that I normally think about. See
this comment
[https://www.lesswrong.com/posts/zzXawbXDwCZobwF9D/my-agi-threat-model-misaligned-model-based-rl-agent?commentId=fyHkc5yhzxS4ywyCd]
and surrounding discussion. Basically, one way to do model-based RL is to have
the agent create a predictive model of the reward and then judge plans based on
their tendency to maximize "the reward as currently understood by my predictive
model". Then "the reward as currently understood by my predictive model" is
basically a utility function. But at the same time, there's a separate
subroutine that edits the reward prediction model (≈ utility function) to ever
more closely approximate the true reward function (by some learning algorithm,
presumably involving reward prediction errors).
In other words: At any given time, the part of the agent that's making plans and
taking actions looks like a utility maximizer. But if you lump together that
part plus the subroutine that keeps editing the reward prediction model to
better approximate the real reward signal, then that whole system is a
reward-maximizing RL agent.
Please tell me if that makes any sense or not; I've been planning to write
pretty much exactly this comment (but with a diagram) into a short post.

Testing The Natural Abstraction Hypothesis: Project Intro

If the wheels are bouncing off each other, then that could be chaotic in the same way as billiard balls. But at least macroscopically, there's a crapton of damping in that simulation, so I find it more likely that the chaos is microscopic. But also my intuition agrees with yours, this system doesn't seem like it should be chaotic...

Abstraction Talk

Heads up, there's a lot of use of visuals - drawing, gesturing at things, etc - so a useful transcript may take some work.

15moYeah, we can have a try and see whether it ends up being worth publishing.

Testing The Natural Abstraction Hypothesis: Project Intro

Nice!

A couple notes:

- Make sure to check that the values in the jacobian aren't exploding - i.e. there's not values like 1e30 or 1e200 or anything like that. Exponentially large values in the jacobian probably mean the system is chaotic.
- If you want to avoid explicitly computing the jacobian, write a method which takes in a (constant) vector and uses backpropagation to return . This is the same as the time-0-to-time-t jacobian dotted with , but it operates on size-n vectors rather than n-by-n jacobian matrices, so should be a l

Another little update, speed issue solved for now by adding SymPy's fortran wrappers to the derivative calculations - calculating the SVD isn't (yet?) the bottleneck. Can now quickly get results from 1,000+ step simulations of 100s of particles.

Unfortunately, even for the pretty stable configuration below, the values are indeed exploding. I need to go back through the program and double check the logic but I don't think it should be chaotic, if anything I would expect the values to hit zero.

It might be that there's some kind of quasi-chaotic behaviou... (read more)

SGD's Bias

My own understanding of the flat minima idea is that it's a different thing. It's not really about noise, it's about gradient descent in general being a pretty shitty optimization method, which converges very poorly to sharp minima (more precisely, minima with a high condition number). (Continuous gradient flow circumvents that, but using step sizes small enough to circumvent the problem in practice would make GD prohibitively slow. The methods we actually use are not a good approximation of continuous flow, as I understand it.) If you want flat minima, th... (read more)

SGD's Bias

I'm still wrapping my head around this myself, so this comment is quite useful.

Here's a different way to set up the model, where the phenomenon is more obvious.

Rather than Brownian motion in a continuous space, think about a random walk in a discrete space. For simplicity, let's assume it's a 1D random walk (aka birth-death process) with no explicit bias (i.e. when the system leaves state , it's equally likely to transition to or ). The rate at which the system leaves state serves a role analogous to the... (read more)

35moThat makes sense. Now it's coming back to me: you zoom your microscope into one
tiny nm^3 cube of air. In a right-to-left temperature gradient you'll see
systematically faster air molecules moving rightward and slower molecules moving
leftward, because they're carrying the temperature from their last collision.
Whereas in uniform temperature, there's "detailed balance" (just as many
molecules going along a path vs going along the time-reversed version of that
same path, and with the same speed distribution).
Thinking about the diode-resistor thing more, I suspect it would be a
waste-of-time nerd-sniping rabbit hole, especially because of the time-domain
aspects (the fluctuations are faster on one side vs the other I think) which
don't have any analogue in SGD. Sorry to have brought it up.

SGD's Bias

does it represent a bias towards less variance over the different gradients one can sample at a given point?

Yup, exactly.

Formal Inner Alignment, Prospectus

This is a good summary.

I'm still some combination of confused and unconvinced about optimization-under-uncertainty. Some points:

- It feels like "optimization under uncertainty" is not quite the right name for the thing you're trying to point to with that phrase, and I think your explanations would make more sense if we had a better name for it.
- The examples of optimization-under-uncertainty from your other comment do not really seem to be about uncertainty per se, at least not in the usual sense, whereas the Dr Nefarious example and maligness of the universal

25moCould you elaborate on that? I do think that learning-normativity is more about
outer alignment. However, some ideas might cross-apply.
Well, it still seems like a good name to me, so I'm curious what you are
thinking here. What name would communicate better?
Again, I need more unpacking to be able to say much (or update much).
Well, the optimization-under-uncertainty is an attempt to make a frame which can
contain both, so this isn't necessarily a problem... but I am curious what feels
non-tight about inner agency.
I still agree with the hypothetical me making the opposite point ;p The problem
is that certain things are being conflated, so both "uncertainty can't be
separated from goals" and "uncertainty can be separated from goals" have true
interpretations. (I have those interpretations clear in my head, but
communication is hard.)
OK, so.
My sense of our remaining disagreement...
We agree that the pointers/uncertainty could be factored (at least informally --
currently waiting on any formalism).
You think "optimization under uncertainty" is doing something different, and I
think it's doing something close.
Specifically, I think "optimization under uncertainty" importantly is not
necessarily best understood as the standard Bayesian thing where we (1) start
with a utility function, (2) provide a prior, so that we can evaluate expected
value (and 2.5, update on any evidence), (3) provide a search method, so that we
solve the whole thing by searching for the highest-expectation element. Many
examples of optimization-under-uncertainty strain this model. Probably the
pointer/uncertainty model would do a better job in these cases. But, the
Bayesian model is kind of the only one we have, so we can use it provisionally.
And when we do so, the approximation of pointer-vs-uncertainty that comes out
is:
Pointer: The utility function.
Uncertainty: The search plus the prior, which in practice can blend together
into "inductive bias".
This isn't perfect, by any me

Understanding the Lottery Ticket Hypothesis

Picture a linear approximation, like this:

The tangent space at point is that whole line labelled "tangent".

The main difference between the tangent space and the space of neural-networks-for-which-the-weights-are-very-close is that the tangent space extrapolates the linear approximation indefinitely; it's not just limited to the region near the original point. (In practice, though, that difference does not actually matter much, at least for the problem at hand - we do stay close to the original point.)

The reason we want to talk about "the tangen... (read more)

Understanding the Lottery Ticket Hypothesis

I don’t yet understand this proposal. In what way do we decompose this parameter tangent space into "lottery tickets"? Are the lottery tickets the cross product of subnetworks and points in the parameter tangent space? The subnetworks alone? If the latter then how does this differ from the original lottery ticket hypothesis?

The tangent space version is meant to be a fairly large departure from the original LTH; subnetworks are no longer particularly relevant at all.

We can imagine a space of generalized-lottery-ticket-hypotheses, in which the common theme i... (read more)

25moI confess I don't really understand what a tangent space is, even after reading
the wiki article on the subject. It sounds like it's something like this: Take a
particular neural network. Consider the "space" of possible neural networks that
are extremely similar to it, i.e. they have all the same parameters but the
weights are slightly different, for some definition of "slightly." That's the
tangent space. Is this correct? What am I missing?

Formal Inner Alignment, Prospectus

I buy the "problems can be both" argument in principle. However, when a problem involves both, it seems like we *have* to solve the outer part of the problem (i.e. figure out what-we-even-want), and once that's solved, all that's left for inner alignment is imperfect-optimizer-exploitation. The reverse does not apply: we do not necessarily have to solve the inner alignment issue (other than the imperfect-optimizer-exploiting part) at all. I also think a version of this argument probably carries over even if we're thinking about optimization-under-uncertainty... (read more)

Trying to lay this disagreement out plainly:

According to you, the inner alignment problem should apply to *well-defined optimization problems*, meaning optimization problems which have been given *all* the pieces needed to score domain items. Within this frame, the only reasonable definition is "inner" = issues of imperfect search, "outer" = issues of objective (which can include the prior, the utility function, etc).

According to me/Evan, the inner alignment problem should apply to *optimization under uncertainty*, which is a notion of optimization where you *don*... (read more)

45moThe way I'm currently thinking of things, I would say the reverse also applies
in this case.
We can turn optimization-under-uncertainty into well-defined optimization by
assuming a prior. The outer alignment problem (in your sense) involves getting
the prior right. Getting the prior right is part of "figuring out what we want".
But this is precisely the source of the inner alignment problems in the
paul/evan sense: Paul was pointing out a previously neglected issue about the
Solomonoff prior, and Evan is talking about inductive biases of machine learning
algorithms (which is sort of like the combination of a prior and imperfect
search).
So both you and Evan and Paul are agreeing that there's this problem with the
prior (/ inductive biases). It is distinct from other outer alignment problems
(because we can, to a large extent, factor the problem of specifying an expected
value calculation into the problem of specifying probabilities and the problem
of specifying a value function / utility function / etc). Everyone would seem to
agree that this part of the problem needs to be solved. The disagreement is just
about whether to classify this part as "inner" and/or "outer".
What is this problem like? Well, it's broadly a quality-of-prior problem, but it
has a different character from other quality-of-prior problems. For the most
part, the quality of priors can be understood by thinking about average error
being low, or mistakes becoming infrequent, etc. However, here, this kind of
thinking isn't sufficient: we are concerned with rare but catastrophic errors.
Thinking about these things, we find ourselves thinking in terms of "agents
inside the prior" (or agents being favored by the inductive biases).
To what extent "agents in the prior" should be lumped together with "agents in
imperfect search", I am not sure. But the term "inner optimizer" seems relevant.
A good example of optimization-under-uncertainty that doesn't look like that (at
least, not overtly) is most ap

Formal Inner Alignment, Prospectus### Example: Dr Nefarious

I'll make a case here that manipulation of imperfect internal search should be considered *the* inner alignment problem, and all the other things which look like inner alignment failures actually stem from outer alignment failure or non-mesa-optimizer-specific generalization failure.

Suppose Dr Nefarious is an agent in the environment who wants to acausally manipulate other agents' models. We have a model which knows of Dr Nefarious' existence, and we ask the model to predict what Dr Nefarious will do. At this point, we have already faile... (read more)

55moWhile I agree that outer objective, training data and prior should be considered
together, I disagree that it makes the inner alignment problem dissolve except
for manipulation of the search. In principle, if you could indeed ensure through
a smart choice of these three parameters that there is only one global optimum,
only "bad" (meaning high loss) local minima, and that your search process will
always reach the global optimum, then I would agree that the inner alignment
problem disappears.
But answering "what do we even want?" at this level of precision seems basically
impossible. I expect that it's pretty much equivalent to specifying exactly the
result we want, which we are quite unable to do in general.
So my perspective is that the inner alignment problem appears because of
inherent limits into our outer alignment capabilities. And that in realistic
settings where we cannot rule out multiple very good local minima, the sort of
reasoning underpinning the inner alignment discussion is the best approach we
have to address such problems.
That being said, I'm not sure how this view interacts with yours or Evan's, or
if this is a very standard use of the terms. But since that's part of the
discussion Abram is pushing, here is how I use these terms.

35moHm, I want to classify "defense against adversaries" as a separate category from
both "inner alignment" and "outer alignment".
The obvious example is: if an adversarial AGI hacks into my AGI and changes its
goals, that's not any kind of alignment problem, it's a
defense-against-adversaries problem.
Then I would take that notion and extend it by saying "yes interacting with an
adversary presents an attack surface, but also merely imagining an adversary
presents an attack surface too". Well, at least in weird hypotheticals. I'm not
convinced that this would really be a problem in practice, but I dunno, I
haven't thought about it much.
Anyway, I would propose that the procedure for defense against adversaries in
general is: (1) shelter an AGI from adversaries early in training, until it's
reasonably intelligent and aligned, and then (2) trust the AGI to defend itself.
I'm not sure we can do any better than that.
In particular, I imagine an intelligent and self-aware AGI that's aligned in
trying to help me would deliberately avoid imagining an adversarial
superintelligence that can acausally hijack its goals!
That still leaves the issue of early training, when the AGI is not yet motivated
to not imagine adversaries, or not yet able. So I would say: if it does imagine
the adversary, and then its goals do get hijacked, then at that point I would
say "OK yes now it's misaligned". (Just like if a real adversary is exploiting a
normal security hole—I would say the AGI is aligned before the adversary
exploits that hole, and misaligned after.) Then what? Well, presumably, we will
need to have procedure that verifies alignment before we release the AGI from
its training box. And that procedure would presumably be indifferent to how the
AGI came to be misaligned. So I don't think that's really a special problem we
need to think about.

75moSo, I think I could write a much longer response to this (perhaps another post),
but I'm more or less not persuaded that problems should be cut up the way you
say.
As I mentioned in my other reply, your argument that Dr. Nefarious problems
shouldn't be classified as inner alignment is that they are apparently outer
alignment. If inner alignment problems are roughly "the internal objective
doesn't match the external objective" and outer alignment problems are roughly
"the outer objective doesn't meet our needs/goals", then there's no reason why
these have to be mutually exclusive categories.
In particular, Dr. Nefarious problems can be both.
But more importantly, I don't entirely buy your notion of "optimization". This
is the part that would require a longer explanation to be a proper reply. But
basically, I want to distinguish between "optimization" and "optimization under
uncertainty". Optimization under uncertainty is not optimization -- that is, it
is not optimization of the type you're describing, where you have a well-defined
objective which you're simply feeding to a search. Given a prior, you can reduce
optimization-under-uncertainty to plain optimization (if you can afford the
probabilistic inference necessary to take the expectations, which often isn't
the case). But that doesn't mean that you do, and anyway, I want to keep them as
separate concepts even if one is often implemented by the other.
Your notion of the inner alignment problem applies only to optimization.
Evan's notion of inner alignment applies (only!) to optimization under
uncertainty.

This is a great comment. I will have to think more about your overall point, but aside from that, you've made some really useful distinctions. I've been wondering if inner alignment should be defined separately from mesa-optimizer problems, and this seems like more evidence in that direction (ie, the dr nefarious example is a mesa-optimization problem, but it's about outer alignment). Or maybe inner alignment just shouldn't be seen as the compliment of outer alignment! Objective quality vs search quality is a nice dividing line, but, doesn't cluster together the problems people have been trying to cluster together.

Parsing Chris Mingard on Neural Networks

Yeah, I don't think there was anything particularly special about the complexity measure they used, and I wouldn't be surprised if some other measures did as-well-or-better at predicting which functions fill large chunks of parameter space.

Parsing Chris Mingard on Neural Networks

Minor note: I think "mappings with low Kolmogorov complexity occupy larger volumes" is an *extremely* misleading way to explain what the evidence points to here. (Yes, I know Chris used that language in his blog post, but he really should not have.) The intuition of "simple mappings occupy larger volumes" is an accurate summary, but the relevant complexity measure is very much not Kolmogorov complexity. It is a measure which does not account for every computable way of compressing things, only some *specific* ways of compressing things.

In a sense, the results ... (read more)

46moThanks for this clarification John.
But did Mingard et al show that there is some specific practical complexity
measure that explains the size of the volumes occupied in parameter space better
than alternative practical complexity measures? If so then think we would have
uncovered an even more detailed understanding of which mappings occupy large
volumes in parameter space, and since neural nets just generally work so well in
the world, we could say that we know something about what kind of compression is
relevant to our particular world. But if they just used some practical
complexity measure as a rough proxy for Kolmogorov complexity then it leaves
open the door for mappings occupying large volumes in parameter space to be
simple according to many more ways of compressing things than were used in their
particular complexity measure.

FAQ: Advice for AI Alignment Researchers

I still don’t know much about GANs, Gaussian processes, Bayesian neural nets, and convex optimization.

FWIW, I'd recommend investing the time to learn about convex optimization even if you don't think you need it yet. Unlike the other topics on this list, the ideas from convex optimization are relevant to a wide variety of things the name does not suggest. Some examples:

- how to think about optimization in high dimensions (even for non-convex functions)
- using constraints as a more-gearsy way of representing relevant information (as opposed to e.g. just wrappin

46moYeah that's definitely the one on the list that I think would be most useful.
I may also be understating how much I know about it; I've picked up some over
time, e.g. linear programming, minimax, some kinds of duality, mirror descent,
Newton's method.

NTK/GP Models of Neural Nets Can't Learn Features

They wouldn't be random linear combinations, so the central limit theorem estimate wouldn't directly apply. E.g. this circuit transparency work basically ran PCA on activations. It's not immediately obvious to me what the right big-O estimate would be, but intuitively, I'd expect the PCA to pick out exactly those components dominated by change in activations - since those will be the components which involve large correlations in the activation patterns across data points (at least that's my intuition).

I think this claim is basically wrong:

... (read more)And to the exten

56moHmm, so regarding the linear combinations, it's true that there are some linear
combinations that will change by Θ(1) in the large-width limit -- just use the
vector of partial derivatives of the output at some particular input, this sum
will change by the amount that the output function moves during the regression.
Indeed, I suspect(but don't have a proof) that these particular combinations
will span the space of linear combinations that change non-trivially during
training. I would dispute "we expect most linear combinations to change" though
-- the CLT argument implies that we should expect almost all combinations to not
appreciably change. Not sure what effect this would have on the PCA and still
think it's plausible that it doesn't change at all(actually, I think Greg Yang
states that it doesn't change in section 9 of his paper, haven't read that part
super carefully though)
So I think I was a bit careless in saying that the NTK can't do transfer
learning at all -- a more exact statement might be "the NTK does the minimal
amount of transfer learning possible". What I mean by this is, any learning
algorithm can do transfer learning if the task we are 'transferring' to is
sufficiently similar to the original task -- for instance, if it's just the
exact same task but with a different data sample. I claim that the 'transfer
learning' the NTK does is of this sort. As you say, since the tangent kernel
doesn't change at all, the net effect is to move where the network starts in the
tangent space. Disregarding convergence speed, the impact this has on
generalization is determined by the values set by the old function on axes of
the NTK outside of the span of the partial derivatives at the new function's
data points. This means that, for the NTK to transfer anything from one task to
another, it's not enough for the tasks to both feature, for instance, eyes. It's
that the eyes have to correlate with the output in the exact same way in both
tasks. Indeed, the transfer l

NTK/GP Models of Neural Nets Can't Learn Features

Alright, I buy the argument on page 15 of the original NTK paper.

I'm still very skeptical of the interpretation of this as "NTK models can't learn features". In general, when someone proves some interesting result which seems to contradict some combination of empirical results, my default assumption is that the proven result is being interpreted incorrectly, so I have a high prior that that's what's happening here. In this case, it could be that e.g. the "features" relevant to things like transfer learning are not individual neuron activations - e.g. IIRC ... (read more)

56moI don't think taking linear combinations will help, because adding terms to the
linear combination will also increase the magnitude of the original activation
vector -- e.g. if you add together N12 units, the magnitude of the sum of their
original activations will with high probability be Θ(N14), dwarfing the O(1)
change due to change in the activations. But regardless, it can't help with
transfer learning at all, since the tangent kernel(which determines learning in
this regime) doesn't change by definition.
What empirical results do you think are being contradicted? As far as I can
tell, the empirical results we have are 'NTK/GP have similar performance to
neural nets on some, but not all, tasks'. I don't think transfer/feature
learning is addressed at all. You might say these results are suggestive
evidence that NTK/GP captures everything important about neural nets, but this
is precisely what is being disputed with the transfer learning arguments.
I can imagine doing an experiment where we find the 'empirical tangent kernel'
of some finite neural net at initialization, solve the linear system, and then
analyze the activations of the resulting network. But it's worth noting that
this is not what is usually meant by 'NTK' -- that usually includes taking the
infinite-width limit at the same time. And to the extent that we expect the
activations to change at all, we no longer have reason to think that this linear
system is a good approximation of SGD. That's what the above mathematical
results mean -- the same mathematical analysis that implies that network
training is like solving a linear system, also implies that the activations
don't change at all.

NTK/GP Models of Neural Nets Can't Learn Features

Ok, that's at least a plausible argument, although there are some big loopholes. Main problem which jumps out to me: what happens after *one* step of backprop is not the relevant question. One step of backprop is not enough to solve a set of linear equations (i.e. to achieve perfect prediction on the training set); the relevant question is what happens after one step of Newton's method, or after enough steps of gradient descent to achieve convergence.

What would convince me more is an empirical result - i.e. looking at the internals of an actual NTK model, tr... (read more)

66moThe result that NTK does not learn features in the large N limit is not in
dispute at all -- it's right there on page 15 of the original NTK paper
[https://arxiv.org/abs/1806.07572], and indeed holds after arbitrarily many
steps of backprop. I don't think that there's really much room for loopholes in
the math here. See Greg Yang's paper for a lengthy proof that this holds for all
architectures. Also worth noting that when people 'take the NTK limit' they
often don't initialize an actual net at all, they instead use analytical
expressions for what the inner product of the gradients would be at N=infinity
to compute the kernel directly.

NTK/GP Models of Neural Nets Can't Learn Features

During training on the 'old task', NTK stays in the 'tangent space' of the network's initialization. This means that, to first order, none of the functions/derivatives computed by the individual neurons change at all, only the output function does.

Eh? Why does this follow? Derivatives make sense; the derivatives staying approximately-constant is one of the assumptions underlying NTK to begin with. But the functions computed by individual neurons should be able to change for exactly the same reason the output function changes, assuming the network has more than one layer. What am I missing here?

56moThe asymmetry between the output function and the intermediate neuron functions
comes from backprop -- from the fact that the gradients are backprop-ed through
weight matrices with entries of magnitude O(N−12). So the gradient of the output
w.r.t itself is obviously 1, then the gradient of the output w.r.t each neuron
in the preceding layer is O(N−12), since you're just multiplying by a vector
with those entries. Then by induction all other preceding layers' gradients are
the sum of N random things of size O(1/N), and so are of size O(N−12) again. So
taking a step of backprop will change the output function by O(1) but the
intermediate functions by O(N−12), vanishing in the large-width limit.
(This is kind of an oversimplification since it is possible to have changing
intermediate functions while doing backprop, as mentioned in the linked paper.
But this is the essence of why it's possible in some limits to move around using
backprop without changing the intermediate neurons)

Updating the Lottery Ticket Hypothesis

Plausible.

Here's intuition pump to consider: suppose our net is a complete multigraph: not only is there an edge between every pair of nodes, there's multiple edges with base-2-exponentially-spaced weights, so we can always pick out a subset of them to get any total weight we please between the two nodes. Clearly, masking can turn this into any circuit we please (with the same number of nodes). But it seems wrong to say that this initial circuit has anything useful in it at all.

36moThat seems right, but also reminds me of the point that you need to randomly
initialize your neural nets for gradient descent to work (because otherwise the
gradients everywhere are the same). Like, in the randomly initialized net, each
edge is going to be part of many subcircuits, both good and bad, and the
gradient is basically "what's your relative contribution to good subcircuits vs.
bad subcircuits?"

Updating the Lottery Ticket Hypothesis

Oh that's definitely interesting. Thanks for the link.

Updating the Lottery Ticket Hypothesis

Yup, I agree that that quote says something which is probably true, given current evidence. I don't think "picking a winning lottery ticket" is a good way analogy for what that implies, though; see this comment.

66moI don't know what the referent of 'that quote' is. If you mean the passage I
quoted from the original lottery ticket hypothesis ("LTH") paper, then I highly
recommend reading a follow-up paper
[http://proceedings.mlr.press/v119/frankle20a.html] which describes how and why
it's wrong for large networks. The abstract of the paper I'm citing here:
--------------------------------------------------------------------------------
Again, assuming "that" refers to the claim in the original LTH paper, I also
don't think it's a good analogy. But by default I think that claim is what "the
lottery ticket hypothesis" refers to, given that it's a widely cited paper that
has spawned a large number of follow-up works.

Updating the Lottery Ticket Hypothesis

Not quite. The linear expansion isn't just over the parameters associated with one layer, it's over all the parameters in the whole net.

Under my current understanding of the "general alignment" angle, a core argument goes like:

I don't necessarily think this is the best framing, and I don't necessarily agree with it (e.g. whether the agent ha... (read more)