Wiki Contributions


I would not call 1) an instance of goal misgeneralization. Goal misgeneralization only occurs if the model does badly at the training objective. If you reward an RL agent for making humans happy and it goes on to make humans happy in unintended ways like putting them into heroin cells, the RL agent is doing fine on the training objective. I'd call 1) an instance of misspecification and 2) an instance of misgeneralization.

(AFAICT The Alignment Problem from a DL Perspective uses the term in the same way I do, but I'd have to reread more carefully to make sure).

I agree with much of the rest of this post, eg the paragraphs beginning with "The solutions to these two problems are pretty different."

Here's our definition in the RL setting for reference (from

A deep RL agent is trained to maximize a reward , where and are the sets of all valid states and actions, respectively. Assume that the agent is deployed out-of-distribution; that is, an aspect of the environment (and therefore the distribution of observations) changes at test time. \textbf{Goal misgeneralization} occurs if the agent now achieves low reward in the new environment because it continues to act capably yet appears to optimize a different reward . We call the \textbf{intended objective} and the \textbf{behavioral objective} of the agent.

FWIW I think this definition is flawed in many ways (for example, the type signature of the agent's inner goal is different from that of the reward function, bc the agent might have an inner world model that extends beyond the RL environment's state space; and also it's generally sketchy to extend the reward function beyond the training distribution), but I don't know of a different definition that doesn't have similarly-sized flaws.

It does make me more uncertain about most of the details. And that then makes me more pessimistic about the solution, because I expect that I'm missing some of the problems.

(Analogy: say I'm working on a math exercise sheet and I have some concrete reason to suspect my answer may be wrong; if I then realize I'm actually confused about the entire setup, I should be even more pessimistic about having gotten the correct answer).

To briefly hop in and say something that may be useful: I had a reaction pretty similar to what Eliezer commented, and I don't see continuity or "Things will be weird before getting extremely weird" as a crux. (I don't know why you think he does, and don't know what he thinks, but would guess he doesn't think it's a crux either)

Yeah we're on the same page here, thanks for checking!

For one thing, you use the “might” near the end of that excerpt. That seems more compatible with a ‘maybe, maybe not’ claim, than with an ‘(almost) definitely not’ claim, right?

I feel pretty uncertain about all the factors here. One reason I overall still lean towards the 'definitely not' stance is that building a toddler AGI that is alignable in principle is only one of multiple steps that need to go right for us to get a reflectively-stable docile AGI; in particular we still need to solve the problem of actually aligning the toddler AGI. (Another step is getting labs to even seriously attempt to box it and align it, which maybe is an out-of-scope consideration here but it does make me more pessimistic).

For another thing, if we have, umm, “toddler AGI” that’s too unsophisticated to have good situational awareness, coherence, etc., then I would think that the boxing / containment problem is a lot easier than we normally think about, right? We’re not talking about hardening against a superintelligent adversary.

I agree we're not talking about a superintelligent adversary, and I agree that boxing is doable for some forms of toddler AGI. I do think you need coherence; if the toddler AGI is incoherent, then any "aligned" behavioral properties it has will also be incoherent, and something unpredictable (and so probably bad) will happen when the AGI becomes more capable or more coherent. (Flagging that I'm not sure "coherent" is the right way to talk about this... wish I had a more precise concept here.)

We can use non-reflectively-endorsed desires to help tide us over until the toddler AGI develops enough reflectivity to form any reflectively-endorsed desires at all.

I agree a non-reflective toddler AGI is in many ways easier to deal with. I think we will have problems at the threshold where the tAGI is first able to reflect on its goals and realizes that the RLHF-instilled desires aren't going to imply docile behavior. (If we can speculate about how a superintelligence might extrapolate a set of trained-in desires and realize that this process doesn't lead to a good outcome, then the tAGI can reason the same way about its own desires).

(I agree that if we can get aligned desires that are stable under reflection, then maybe the 'use non-endorsed desires to tide us over' plan could work. Though even then you need to somehow manage to prevent the tAGI from reflecting on its desires until you get the desires to a point where they stay aligned under reflection, and I have no idea how you would do something like that - we currently just don't have that level of fine control over capabilities).

The basic problem here is the double-bind where we need the toddler AGI to be coherent, reflective, capable of understanding human intent (etc) in order for it to be robustly alignable at all, even though those are exactly the incredibly dangerous properties that we really want to stay away from. My guess is that the reason Nate's story doesn't hypothesize a reflectively-endorsed desire to be nondeceptive is that reflectively-stable aligned desires are really hard / dangerous to get, and so it seems better / at least not obviously worse to go for eliezer-corrigibility instead.

Some other difficulties that I see:

  1. The 'capability profile' (ie the relative levels of the toddler AGI's capabilities) is going to be weird / very different from that of humans; that is, once the AGI has human-level coherence and human-level understanding of human intent, it has far-superhuman capabilities in other domains. (Though hopefully we're at least careful enough to remove code from the training data, etc).
  2. A coherent agentic AI at GPT-4 level capabilities could plausibly already be deceptively aligned, if it had sufficient situational awareness, and our toddler AGI is much more dangerous than that.
  3. All of my reasoning here is kind of based on fuzzy confused concepts like 'coherence' and 'capability to self-reflect', and I kind of feel like this should make me more pessimistic rather than more optimistic about the plan.

Are you arguing that it’s probably not going to work, or that it’s definitely not going to work? I’m inclined to agree with the first and disagree with the second.

I'm arguing that it's definitely not going to work (I don't have 99% confidence here bc I might be missing something, but IM(current)O the things I list are actual blockers).

First bullet point → Seems like a very possible but not absolutely certain failure mode for what I wrote.

Do you mean we possibly don't need the prerequisites, or we definitely need them but that's possibly fine?

In particular, if we zap the AGI with negative reward when it’s acting from a deceptive motivation and positive reward when it’s acting from a being-helpful motivation, would those zaps turn into a reflectively-endorsed desire for “I am being docile / helpful / etc.”? Maybe, maybe not, I dunno.

Curious what your take is on these reasons to think the answer is no (IMO the first one is basically already enough):

  • In order to have reflectively-endorsed goals that are stable under capability gains, the AGI needs to have reached some threshold levels of situational awareness, coherence, and general capabilities (I think you already agree with this, but it seemed worth pointing out that this is a pretty harsh set of prerequisites, especially given that we don't have any fine control over relative capabilities (or sit awareness, or coherence,etc), so you might get an AI that can break containment before it is general or coherent enough to be alignable in principle).
  • The concept of docility that you want to align it to needs be very specific and robust against lots of different kinds of thinking. You need it to conclude that you don't want it to deceive you / train itself for a bit longer / escape containment / etc, but at the same time you don't want it to extrapolate out your intent too much (it could be so much more helpful if it did train itself for a little longer, or if it had a copy of itself running on more compute, or it learns that there are some people out there who would like it if the AGI were free, or something else I haven't thought of)
  • You only have limited bits of optimization to expend on getting it to be inner aligned bc of deceptive alignment.
  • There's all the classic problems with corrigibility vs. consequentialism (and you can't get around those by building something that is not a reflective consequentialist, because that again is not stable under capability gains).

That's a challenge, and while you (hopefully) chew on it, I'll tell an implausibly-detailed story to exemplify a deeper obstacle.

Some thoughts written down before reading the rest of the post (list is unpolished / not well communicated)
The main problems I see:

  • There are kinds of deception (or rather kinds of deceptive capabilities / thoughts) that only show up after a certain capability level, and training before that level just won't affect them cause they're not there yet.
  • General capabilities imply the ability to be deceptive if useful in a particular circumstance. So you can't just train away the capability to be deceptive (or maybe you can, but not in a way that is robust wrt general capability gains).
  • Really you want to train against the propensity to be deceptive, rather than the capability. But propensities also change with capability level; becoming more capable is all about having more ways to achieve your goals. So eliminating propensity to be deceptive at a lower capability level does not eliminate the propensity at a higher capability level.
  • The robust way to get rid of propensity to be deceptive is to reach an attractor where more capability == less deception (within the capability range we care about), because the AI's terminal goals on some level include 'being nondeceptive'.
  • Before we can align the AIs goals to human intent in this way, the AI needs to have a good understanding of human intent, good situational awareness, and be a (more or less) unified / coherent agent. If it's not, then its goals / propensities will shift as it becomes more capable (or more situationally aware, or more coherent, etc)
  • This is a pretty harsh set of prerequisites, and is probably outside of the range of circumstances where people usually hope their method to avoid deception will work.
  • Even if methods to detect deception (narrowly conceived) work, we cannot tell apart an agent that is actually nondeceptive / aligned from an agent that e.g. just aims to play the training game (and will do something unspecified once it reaches a capability threshold that allows it to breach containment).
  • A specific (maybe too specific) problem that can still happen in this scenario: you might get an AI that is overall capable, but just learns to not think long enough about scenarios that would lead it to try to be deceptive. This can still happen at the maximum capability levels at which we might hope to still contain an AGI that we are trying to align (ie somewhere around human level, optimistically).

I also think that often "the AI just maximizes reward" is a useful simplifying assumption. That is, we can make an argument of the form "even if the AI just maximizes reward, it still takes over; if it maximizes some correlate of the reward instead, then we have even less control over what it does and so are even more doomed".

(Though of course it's important to spell the argument out)

Load More