Abram Demski

Formal Inner Alignment, Prospectus

Right, but John is disagreeing with Evan's frame, and John's argument that such-and-such problems aren't inner alignment problems is that they *are* outer alignment problems.

Formal Inner Alignment, Prospectus

This is a great comment. I will have to think more about your overall point, but aside from that, you've made some really useful distinctions. I've been wondering if inner alignment should be defined separately from mesa-optimizer problems, and this seems like more evidence in that direction (ie, the dr nefarious example is a mesa-optimization problem, but it's about outer alignment). Or maybe inner alignment just shouldn't be seen as the compliment of outer alignment! Objective quality vs search quality is a nice dividing line, but, doesn't cluster together the problems people have been trying to cluster together.

Formal Inner Alignment, Prospectus

I admit that I'm more excited by doing this because you're asking it directly, and so I actually believe there will be some answer (which in my experience is rarely the case for my in-depth comments).

Thanks!

I'm not sure if I agree that there is no connection. The mesa-objective comes from the interaction of the outer objective, the training data/environments and the bias of the learning algorithm. So in some sense there is a connection. Although I agree that for the moment we lack a formal connection, which might have been your point.

Right. By "no connection" I specifically mean "we have no strong reason to posit any specific predictions we can make about mesa-objectives from outer objectives or other details of training" -- at least not for training regimes of practical interest. (I will consider this detail for revision.)

I could have also written down my plausibility argument (that there is actually "no connection"), but probably that just distracts from the point here.

(More later!)

My AGI Threat Model: Misaligned Model-Based RL Agent

I *have not* properly read all of that yet, but my *very quick take* is that your argument for a need for online learning strikes me as similar to your argument against the classic inner alignment problem applying to the architectures you are interested in. You find what I call mesa-learning implausible for the same reasons you find mesa-optimization implausible.

Personally, I've come around to the position (seemingly held pretty strongly by other folks, eg Rohin) that mesa-learning is practically inevitable for most tasks.

My AGI Threat Model: Misaligned Model-Based RL Agent

For "access to the reward function", we need to predict what the reward function will do (which may involve hard-to-predict things like "the human will be pleased with what I've done"). I guess your suggestion would be to call the thing-that-predicts-what-the-reward-will-be a "reward function model", and the thing-that-predicts-summed-rewards the "value function", and then to change "the value function may be different from the reward function" to "the value function may be different from the expected sum of rewards". Something like that?

Ah, that wasn't quite my intention, but I take it as an acceptable interpretation.

My true intention was that the "reward function calculator" *should indeed be directly accessible* rather than indirectly learned via reward-function-model. I consider this normative (not predictive) due to the considerations about observation-utility agents discussed in Robust Delegation (and more formally in Daniel Dewey's paper). Learning the reward function is asking for trouble.

Of course, *hard-coding* the reward function is *also* asking for trouble, so... **shrug**

My AGI Threat Model: Misaligned Model-Based RL Agent

Hmm. I guess I have this ambiguous thing where I'm not specifying whether the value function is "valuing" world-states, or actions, or plans, or all of the above, or what. I think there are different ways to set it up, and I was trying not to get bogged down in details (and/or not being very careful!)

Sure, but given most reasonable choices, there will be an analogous variant of my claim, right? IE, for most reasonable model-based RL setups, the type of the reward function will be different from the type of the value function, but there will be a "solution concept" saying what it means for the value function to be correct with respect to a set reward function and world-model. This will be your notion of alignment, *not* "are the two equal".

Like, here's one extreme: imagine that the "planner" does arbitrarily-long-horizon rollouts of possible action sequences and their consequences in the world, and then the "value function" is looking at that whole future rollout and somehow encoding how good it is, and then you can choose the best rollout. In this case we

dowant the value function to converge to be (for all intents and purposes) a clone of the reward function.

Well, there's still a type distinction. The reward function gives a value at each time step in the long rollout, while the value function just gives an overall value. So maybe you mean that the ideal value function would be precisely the *sum* of rewards.

But if so, this isn't really what RL people typically call a value function. The point of a value function is to capture the *potential future rewards* associated with a state. For example, if your reward function is to be high up, then the *value* of being near the top of a slide is very low (because you'll soon be at the bottom), even if it's still generating high reward (because you're currently high up).

So the value of a history (even a long rollout of the future) should incorporate anticipated rewards after the end of the history, not just the value observed within the history itself.

In the rollout architecture you describe, there wouldn't really be any *point* to maintaining a separate value function, since you can just sum the rewards (assuming you have access to the reward function).

On the opposite extreme, when you're not doing rollouts at all, and instead the value function is judging particular states or actions, then I guess it should be

lesslike the reward function andmorelike "expected upcoming reward assuming the current policy", which I think is what you're saying.

It doesn't seem to me like there is any "more/less like reward" spectrum here. The value function is just different from the reward function. In an architecture where you have a "value function" which operates like a reward function, I would just call it the "estimated reward function" or something along those lines, because RL people invented the value/reward distinction to point at something important (namely the difference between immediate reward and cumulative expected reward), and I don't want to use the terms in a way which gets rid of that distinction.

Like, maybe I'm putting on my shoes because I know that this is the first step of a plan where I'll go to the candy store and buy candy and eat it. I'm motivated to put on my shoes by the image in my head where, a mere 10 minutes from now, I'll be back at home eating yummy candy. In this case, the value function is hopefully approximating the reward function, and specifically approximating what the reward function will do at the moment where I will eat candy.

How is this "approximating the reward function"?? Again, if you feed both the value and reward function the same thing (the imagined history of going to the store and coming back and eating candy), you hope that they produce very different results (the reward function produces a sequence of individual rewards for each moment, including a high reward when you're eating the candy; the value function produces one big number accounting for the positives and negatives of the plan, including estimated future value of the post-candy-eating crash, even though that's not represented inside the history).

Well anyway, your point is well taken. Maybe I'll change it to "the value function might be misaligned with the reward function", or "incompatible", or something...

I continue to feel like you're not seeing that there is a precise formal notion of "the value function is aligned with the reward function", namely, that the value function is the solution to the value iteration equation (the bellman equation) wrt a given reward function and world model.

My AGI Threat Model: Misaligned Model-Based RL Agent

(Much of this has been touched on already in our Discord conversation:)

## Inner alignment problem: The value function might be different from the reward function.

In fact that’s an understatement: The value function

willbe different from the reward function. Why? Among other things, because they have different type signatures—they accept different input!

Surely this isn't relevant! We don't by any means *want* the value function to equal the reward function. What we *want* (at least in standard RL) is for the value function to be the solution to the dynamic programming problem set up by the reward function and world model (or, more idealistically, the reward function and the *actual* world).

The value function is a function of the latent variables in the world-model—thus, even abstract concepts like “differentiate both sides of the equation” are assigned values. The value function is updated by the reward signals, using (I assume) some generalization of TD learning(definition).

While something like this seems possible, it strikes me as a better fit for systems that do explicit probabilistic reasoning, as opposed to NNs. Like, if we're talking about predicting what ML people will do, the sentence "the value function is a function of the latent variables in the world model" makes a lot more sense than the clarification "even abstract concepts are assigned values". Because it makes more sense for the value to be just another output of the same world-model NN, or perhaps, to be a function of a "state vector" produced by the world-model NN, or *maybe* a function taking the whole activation vector of the world-model NN at a time-step as an __input__, as opposed to a value function which is explicitly creating __output__ values for each node in the value function NN (which is what it sounds like when you say even abstract concepts are assigned values).

I assume that

the learned components (world-model, value function, planner / actor) continue to be updated in deployment—a.k.a. online learning(definition). This is important for the risk model below, but seems very likely—indeed, unavoidable—to me:

This seems pretty implausible to me, as we've discussed. Like, yes, it might be a good research direction, and it isn't *terribly* non-prosaic. However, the current direction seems pretty focused on offline learning (even RL, which was originally intended specifically for online learning, has become a primarily offline method!!), and GPT-3 has convinced everyone that the best way to get online learning is to do massive offline training and rely on the fact that if you train on enough variety, learning-to-learn is inevitable.

- Online updating of the world-model is necessary for the AGI to have a conversation, learn some new idea from that conversation, and then refer back to that idea perpetually into the future.

I think my GPT-3 example adequately addresses the first two points, and memory networks adequately address the third.

- Online updating of the value function is then also necessary for the AGI to usefully employ those new concepts. For example, if the deployed AGI has a conversation in which it learns the idea of “Try differentiating both sides of the equation”, it needs to be able to assign and update a value for that new idea (in different contexts), in order to gradually learn how and when to properly apply it.
- Online updating of the value function is
alsonecessary for the AGI to break down problems into subproblems. Like if “inventing a better microscope” is flagged by the value function as being high-value, and then the planner notices that“If only I had a smaller laser, then I’d be able to invent a better microscope”, then we need a mechanism for the value function to flag “inventing a smaller laser” asitselfhigh-value.

These points are more interesting, but I think it's plausible that architectural innovations could deal with them w/o true online learning.

Prediction can be Outer Aligned at Optimum# Take 1

# Take 2

## Take 2.1

## Take 2.2

## Take 2.3

I think this is pretty complicated, and stretches the meaning of several of the critical terms employed in important ways. I think what you said is reasonable given the limitations of the terminology, but ultimately, may be subtly misleading.

How I would currently put it (which I think strays further from the standard terminology than your analysis):

Prediction *is not a well-defined optimization problem*.

Maximum-a-posteriori reasoning (with a given prior) is a well-defined optimization problem, and we can ask whether it's outer-aligned. The answer may be "no, because the Solomonoff prior contains malign stuff".

Variational bayes (with a given prior and variational loss) is similarly well-defined. We can similarly ask whether it's outer-aligned.

Minimizing square loss with a regularizing penalty is well-defined. Etc. Etc. Etc.

But "prediction" is not a clearly specified optimization target. Even if you fix the predictive loss (square loss, Bayes loss, etc) you need to specify a prior in order to get a well-defined expectation to minimize.

So the really well-defined question is whether specific predictive optimization targets are outer-aligned at optimum. And this type of outer-alignment seems to require the target to discourage mesa-optimizers!

This is a problem for the existing terminology, since it means these objectives are not outer-aligned unless they are also inner-aligned.

OK, but maybe you object. I'm assuming that "optimization" means "optimization of a well-defined function which we can completely evaluate". But (you might say), we can also optimize under uncertainty. We do this all the time. In your post, you frame "optimal performance" in terms of loss+distribution. Machine learning treats the data as a sample from the true distribution, and uses this as a proxy, but adds regularizers *precisely because* it's an imperfect proxy (but the regularizers are still just a proxy).

So, in this frame, we think of the true target function as the average loss on the true distribution (ie the distribution which will be encountered in the wild), and we think of gradient descent (and other optimization methods used inside modern ML) as optimizing a proxy (which is totally normal for optimization under uncertainty).

With this frame, I think the situation gets pretty complicated.

Sure, ok, if it's just actually predicting the actual stuff, this seems pretty outer-aligned. Pedantic note: the term "alignment" is weird here. It's not "perfectly aligned" in the sense of perfectly forwarding human values. But it could be non-malign, which I think is what people mostly mean by "AI alignment" when they're being careful about meaning.

But this whole frame is saying that once we have outer alignment, *the problem that's left* is the problem of correctly predicting the future. We have to optimize under uncertainty *because we can't predict the future*. An outer-aligned loss function can nonetheless yield catastrophic results *because of distributional shift*. The Solomonoff prior is malign *because it doesn't represent the future with enough accuracy,* instead containing some really weird stuff.

So, with this terminology, the inner alignment problem *is the prediction problem*. If we can predict well enough, then we can set up a proxy which gets us inner alignment (by heavily penalizing malign mesa-optimizers for their future treacherous turns). Otherwise, we're stuck with the inner alignment problem.

So given this use of terminology, "prediction is outer-aligned" is a pretty weird statement. Technically true, but prediction *is the whole inner alignment problem.*

But wait, let's reconsider 2.1.

In this frame, "optimal performance" means optimal at deployment time. This means we get all the strange incentives that come from online learning. We aren't *actually doing* online learning, but *optimal performance* would respond to those incentives anyway.

(You somewhat circumvent this in your "extending the training distribution" section when you suggest proxies such as the Solomonoff distribution rather than using *the actual future* to define optimality. But this can reintroduce the same problem and more besides. Option #1, Solomonoff, is probably accurate enough to re-introduce the problems with self-fulfilling prophecies, besides being malign in other ways. Option #3, using a physical quantum prior, requires a solution to quantum gravity, and also is probably accurate enough to re-introduce the same problems with self-fulfilling prophecies as well. The only option I consider feasible is #2, human priors. Because humans could notice this whole problem and refuse to be part of a weird loop of self-fulfilling prediction.)

So, I think I could write a much longer response to this (perhaps another post), but I'm more or less not persuaded that problems should be cut up the way you say.

As I mentioned in my other reply, your argument that Dr. Nefarious problems shouldn't be classified as inner alignment is that they

areapparentlyouteralignment. If inner alignment problems are roughly "the internal objective doesn't match the external objective" and outer alignment problems are roughly "the outer objective doesn't meet our needs/goals", then there's no reason why these have to be mutually exclusive categories.In particular, Dr. Nefarious problems can be both.

But more importantly, I don't entirely buy your notion of "optimization". This is the part that would require a longer explanation to be a proper reply. But basically, I want to distinguish between "optimization" and "optimization under uncertainty". Optimization under uncertainty

is not optimization-- that is, it is not optimization of the type you're describing, where you have a well-defined objective which you're simply feeding to a search.Given a prior, youcanreduce optimization-under-uncertainty to plain optimization (if you can afford the probabilistic inference necessary to take the expectations, which often isn't the case). But that doesn't mean that you do, and anyway, I want to keep them as separate concepts even if one is often implemented by the other.Your notion of the inner alignment problem applies only to

optimization.Evan's notion of inner alignment applies (only!) to

optimization under uncertainty.