## AI ALIGNMENT FORUMAF

Chris van Merwijk

Sorted by New

# Wiki Contributions

NEW EDIT: After reading three giant history books on the subject, I take back my previous edit. My original claims were correct.

Could you edit this comment to add which three books you're referring to?

I agree. Though is it just the limited context window that causes the effect? I may be mistaken, but from my memory it seems like they emerge sooner than you would expect if this was the only reason (given the size of the context window of gpt3).

Therefore, the waluigi eigen-simulacra are attractor states of the LLM

It seems to me like this informal argument is a bit suspect. Actually I think this argument would not apply to Solomonof Induction.

Suppose we have to programs that have distributions over bitstrings. Suppose p1 assigns uniform probability to each bitstring, while p2 assigns 100% probability to the string of all zeroes. (equivalently, p1 i.i.d. samples bernoully from {0,1}, p2 samples  0 i.i.d. with 100%).

Suppose we use a perfect Bayesian reasoner to sample bitstrings, but we do it in precisely the same way LLMs do it according to the simulator model. That is, given a bitstring, we first formulate a posterior over programs, i.e. a "superposition" on programs, which we use  to sample the next bit, then we recompute the posterior, etc.

Then I think the probability of sampling 00000000... is just 50%. I.e. I think the distribution over bitstrings that you end up with is just the same as if you just first sampled the program and stuck with it.

I think tHere's a messy calculation which could be simplified (which I won't do):

Limit of this is 0.5.

I don't wanna try to generalize this, but based on this example it seems like if an LLM was an actual Bayesian, Waluigi's would not be attractors. The informal argument is wrong because it doesn't take into account the fact that over time you sample increasingly many non-waluigi samples, pushing down the probability of Waluigi.

Then again, the presense of a context window completely breaks the above calculation in a way that preserves the point. Maybe the context window is what makes Waluigi's into an attractor? (Seems unlikely actually, given that the context windows are fairly big).

There is a general phenomenon where:

• Person A has mental model X and tries to explain X with explanation Q
• Person B doesn't get model X from Q, thinks a bit, and then writes explanation P, reads P and thinks: P is how it should have been explained all along, and Q didn't actually contain the insights, but P does.
• Person C doesn't get model X from P, thinks a bit, and then writes explanation R, reads R and thinks: ...

It seems to me quite likely that you are person B, thinking they explained something because THEY think their explanation is very good and contains all the insights that the previous ones didn't. Some of the evidence for this is in fact contained in your very comment:

"1. Pointing out the "reward chisels computation" point. 2. Having some people tell me it's obvious, or already known, or that they already invented it. 3. Seeing some of the same people continue making similar mistakes (according to me)"
So point 3 basically almost definitively proves that your mental model is not conveyed to those people in your post, does it not? I think a similar thing happened where that mental model was not conveyed to you from RFLO, even though we tried to convey it. (btw not saying the models that RFLO tried to explain are the same as this post, but the basic idea of this post definitely is a part of RFLO).

BTW, it could in fact be that person B's explanation is clearer. (otoh, I think some things are less clear, e.g. you talk about "the" optimization target, which I would say is referring to that of the mesa-optimizer, without clearly assuming there is a mesa-optimizer. We stated the terms mesa- and base-optimizer to clearly make the distinction. There are a bunch of other things that I think are just imprecise, but let's not get into it).

"Continuing (AFAICT) to correct people on (what I claim to be) mistakes around reward and optimization targets, and (for a while) was ~the only one doing so."

I have been correcting people for a while on stuff like that (though not on LW, I'm not often on LW), such as that in the generic case we shouldn't expect wireheading from RL agents unless the option of wireheading is in the training environment, for basically these reasons. I would also have expected people to just get this after reading RFLO, but many didn't (others did), so your points 1/2/3 also apply to me.

"I do totally buy that you all had good implicit models of the reward-chiseling point". I don't think we just "implicitly" modeled it, we very explicitly understood it and it ran throughout our whole thinking about the topic. Again, explaining stuff is hard though, I'm not claiming we conveyed everything well to everyone (clearly you haven't either).

"even though reward is not a kind of objective", this is a terminological issue. In my view, calling a "antecedent-computation reinforcement criterion" an "objective" matches my definition of "objective", and this is just a matter of terminology. The term "objective" is ill-defined enough that "even though reward is not a kind of objective" is a terminological claim about objective, not a claim about math/the world.

The idea that RL agents "reinforce antecedent computations" is completely core to our story of deception. You could not make sense of our argument for deception if you didn't look at RL systems in this way. Viewing the base optimizer as "trying" to achieve an "objective" but "failing" because it is being "deceived" by the mesa optimizer is purely a metaphorical/terminological choice. It doesn't negate the fact that we all understood that the base optimizer is just reinforcing "antecedent computations". How else could you make sense of the story of deception, where an existing model, which represents the mesa optimizer, is being reinforced by the base optimizer because that existing model understands the base optimizer's optimization process?

I am not claiming that the RFLO communicated this point well, just that it was understood and absolutely was core to the paper, and large parts of the paper wouldn't even make sense if you didn't have this insight. (Certainly the fact that we called it an objective doesn't communicate the point, and it isn't meant to).

The core point in this post is obviously correct, and yes people's thinking is muddled if they don't take this into account. This point is core to the Risks from learned optimization paper (so it's not exactly new, but it's good if it's explained in different/better ways).

Is the following a typo?
"So, the  ( works"

first sentence of "CoCo Equilbiria".

Maybe you have made a gestalt-switch I haven't made yet, or maybe yours is a better way to communicate the same thing, but: the way I think of it is that the reward function is just a function from states to numbers, and the way the information contained in the reward function affects the model parameters is via reinforcement of pre-existing computations.

Is there a difference between saying:

• A reward function is an objective function, but the only way that it affects behaviour is via reinforcement of pre-existing computations in the model, and it doesn't actually encode in any way the "goal" of the model itself.
• A reward function is not an objective function, and the only way that it affects behaviour is via reinforcement of pre-existing computations in the model, and it doesn't actually encode in any way the "goal" of the model itself.

It seems to me that once you acknowledge the point about reinforcement, the additional statement that reward is not an objective doesn't actually imply anything further about the mechanistic properties of deep reinforcement learners? It is just a way to put a high-level conceptual story on top of it, and in this sense it seems to me that this point is already known (and in particular, contained within RFLO), even though we talked of the base objective still as an "objective".

However, it might be that while RFLO pointed out the same mechanistic understanding that you have in mind, but calling it an objective tends in practice to not fully communicate that mechanistic understanding.

Or it might be that I am really not yet understanding that there is an actual diferrence in mechanistic understanding, or that my intuitions are still being misled by the wrong high-level concept even if I have the lower-level mechanistic understanding right.

(On the other hand, one reason to still call it an objective is because we really can think of the selection process, i.e. evolution/the learning algorithm of an RL agent, as having an objective but making imperfect choices, or we can think of the training objective as encoding a task that humans have in mind).

It seems to me that the basic conceptual point made in this post is entirely contained in our Risks from Learned Optimization paper. I might just be missing a point. You've certainly phrased things differently and made some specific points that we didn't, but am I just misunderstanding something if I think the basic conceptual claims of this post (which seems to be presented as new) are implied by RFLO? If not, could you state briefly what is different?

(Note I am still surprised sometimes that people still think certain wireheading scenario's make sense despite them having read RFLO, so it's plausible to me that we really didn't communicate everyrhing that's in my head about this).

Here is my partial honest reaction, just two points I'm somewhat dissatisfied with (not meant to be exhaustive):
2. "A cognitive system with sufficiently high cognitive powers, given any medium-bandwidth channel of causal influence, will not find it difficult to bootstrap to overpowering capabilities independent of human infrastructure." I would like there to be an argument for this claim that doesn't rely on nanotech, and solidly relies on actually existing amounts of compute. E.g. if the argument relies on running intractable detailed simulations of proteins, then it doesn't count. (I'm not disagreeing with the nanotech example by the way, or saying that it relies on unrealistic amounts of compute, I'd just like to have an argument for this that is very solid and minimally reliant on speculative technology, and actually shows that it is).
6. "We need to align the performance of some large task, a 'pivotal act' that prevents other people from building an unaligned AGI that destroys the world.". You name "burn all GPU's" as an "overestimate for the rough power level of what you'd have to do", but it seems to me that it would be too weak of a pivotal act? Assuming there isn't some extreme change in generally held views, people would consider this an extreme act of terrorism, and shut you down, put you in jail, and then rebuild the GPU's and go on with what they were planning to do. Moreover, now there is probably an extreme taboo on anything AI safety related. (I'm assuming here that law enforcement finds out that you were the one who did this). Maybe the idea is to burn all GPU's indefinitely and forever (i.e. leave nanobots that continually check for GPU's and burn them when they are created), but even this seems either insufficient or undesirable long term depending on what is counted as a GPU. Possibly I'm not getting what you mean, but it just seems completely too weak as an act.