Richard Ngo

Former AI safety research engineer, now AI governance researcher at OpenAI. Blog:


Shaping safer goals
AGI safety from first principles

Wiki Contributions


Thanks for the comments Vika! A few responses:

It might be good to clarify that this is an example architecture and the claims apply more broadly.

Makes sense, will do.

Phase 1 and 2 seem to map to outer and inner alignment respectively. 

That doesn't quite seem right to me. In particular:

  • Phase 3 seems like the most direct example of inner misalignment; I basically think of "goal misgeneralization" as a more academically respectable way of talking about inner misalignment.
  • Phase 1 introduces the reward misspecification problem (which I treat as synonymous with "outer alignment") but also notes that policies might become misaligned by the end of phase 1 because they learn goals which are "robustly correlated with reward because they’re useful in a wide range of environments", which is a type of inner misalignment.
  • Phase 2 discusses both policies which pursue reward as an instrumental goal (which seems more like inner misalignment) and also policies which pursue reward as a terminal goal. The latter doesn't quite feel like a central example of outer misalignment, but it also doesn't quite seem like a central example of reward tampering (because "deceiving humans" doesn't seem like an example of "tampering" per se). Plausibly we want a new term for this - the best I can come up with after a few minutes' thinking is "reward fixation", but I'd welcome alternatives.

Supposing there is no misspecification in phase 1, do the problems in phase 2 still occur? How likely is deceptive alignment seems to argue that they may not occur, since a model that has perfect proxies when it becomes situationally aware would not then become deceptively aligned. 

It seems very unlikely for an AI to have perfect proxies when it becomes situationally aware, because the world is so big and there's so much it won't know. In general I feel pretty confused about Evan talking about perfect performance, because it seems like he's taking a concept that makes sense in very small-scale supervised training regimes, and extending it to AGIs that are trained on huge amounts of constantly-updating (possibly on-policy) data about a world that's way too complex to predict precisely.

I'm confused why mechanistic interpretability is listed under phase 3 in the research directions - surely it would make the most difference for detecting the emergence of situational awareness and deceptive alignment in phase 2, while in phase 3 the deceptively aligned model will get around the interpretability techniques. 

Mechanistic interpretability seems helpful in phase 2, but there are other techniques that could help in phase 2, in particular scalable oversight techniques. Whereas interpretability seems like the only thing that's really helpful in phase 3 - if it gets good enough then we'll be able to spot agents trying to "get around" our techniques, and/or intervene to make their concepts generalize in more desirable ways.

How? E.g. Jacob left a comment here about his motivations, does that count as a falsification? Or, if you'd say that this is an example of rationalization, then what would the comment need to look like in order to falsify your claim? Does Paul's comment here mentioning the discussions that took place before launching the GPT-3 work count as a falsification? if not, why not?

At a sufficiently high level of abstraction, I agree that "cost of experimenting" could be seen as the core difficulty. But at a very high level of abstraction, many other things could also be seen as the core difficulty, like "our inability to coordinate as a civilization" or "the power of intelligence" or "a lack of interpretability", etc. Given this, John's comment seemed like mainly rhetorical flourishing rather than a contentful claim about the structure of the difficult parts of the alignment problem.

Also, I think that "on our first try" thing isn't a great framing, because there are always precursors (e.g. we landed a man on the moon "on our first try" but also had plenty of tries at something kinda similar). Then the question is how similar, and how relevant, the precursors are - something where I expect our differing attitudes about the value of empiricism to be the key crux.

RLHF helps with outer alignment because it leads to rewards which more accurately reflect human preferences than the hard-coded reward functions (including the classic specification gaming examples, but also intrinsic motivation functions like curiosity and empowerment) which are used to train agents in the absence of RLHF.

The smiley faces example feels confusing as a "classic" outer alignment problem because AGIs won't be trained on a reward function anywhere near as limited as smiley faces. An alternative like "AGIs are trained on a reward function in which all behavior on a wide range of tasks is classified by humans as good or bad" feels more realistic, but also lacks the intuitive force of the smiley face example - it's much less clear in this example why generalization will go badly, given the breadth of the data collected.

I take this comment as evidence that John would fail an intellectual turing test for people who have different views than he does about how valuable incremental empiricism is. I think this is an ITT which a lot of people in the broader LW cluster would fail. I think the basic mistake that's being made here is failing to recognize that reality doesn't grade on a curve when it comes to understanding the world - your arguments can be false even if nobody has refuted them. That's particularly true when it comes to very high-level abstractions, like the ones this field is built around (and in particular the abstraction which it seems John is using, where making progress on outer alignment makes almost no difference to inner alignment).

Historically, the way that great scientists have gotten around this issue is by engaging very heavily with empirical data (like Darwin did) or else with strongly predictive theoretical frameworks (like Einstein did). Trying to do work which lacks either is a road with a lot of skulls on it. And that's fine, this might be necessary, and so it's good to have some people pushing in this direction, but it seems like a bunch of people around here don't just ignore the skulls, they seem to lack any awareness that the absence of the key components by which scientific progress has basically ever been made is a red flag at all.

I think it's possible to criticise work on RLHF while taking seriously the possibility that empirical work on our biggest models is necessary for solving alignment. But criticisms like this one seem to showcase a kind of blindspot. I'd be more charitable if people in the LW cluster had actually tried to write up the arguments for things like "why inner misalignment is so inevitable". But in general people have put shockingly little effort into doing so, with almost nobody trying to tackle this rigorously. E.g. I was surprised when my debates with Eliezer involved him still using all the same intuition-pumps as he did in the sequences, because to me the obvious thing to do over the next decade is to flesh out the underlying mental models of the key issue, which would then allow you to find high-level intuition pumps that are both more persuasive and more trustworthy.

I'm more careful than John about throwing around aspersions on which people are "actually trying" to solve problems. But it sure seems to me that blithely trusting your own intuitions because you personally can't imagine how they might be wrong is one way of not actually trying to solve hard problems.

There's one attitude towards alignment techniques which is something like "do they prevent all catastrophic misalignment?" And there's another which is more like "do they push out the frontier of how advanced our agents need to be before we get catastrophic misalignment?" I don't think the former approach is very productive, because by that standard no work is ever useful. So I tend to focus on the latter, with the theory of victory being "push the frontier far enough that we get a virtuous cycle of automating alignment work".

Huh, I thought you agreed with statements like "if we had many shots at AI Alignment and could get reliable empirical feedback on whether an AI Alignment solution is working, AI Alignment would be much easier".

I agree that having many shots is helpful, but lacking them is not the core difficulty (just as having many shots to launch a rocket doesn't help you very much if you have no idea how rockets work).

I don't really know what "reliable empirical feedback" means in this context - if you have sufficiently reliable feedback mechanisms, then you've solved most of the alignment problem. But, out of the things John listed:

Goodhart problems in outer alignment, deception in inner alignment, phase change in hard takeoff, "getting what you measure" in slow takeoff

I expect that we'll observe a bunch of empirical examples of each of these things happening (except for the hard takeoff phase change), and not know how to fix them.

Alignment as a field is hard precisely because we do not expect to see empirical evidence before it is too late.

I don't think this is the core reason that alignment is hard - even if we had access to a bunch of evidence about AGI misbehavior now, I think it'd still be hard to convert that into a solution for alignment. Nor do I believe we'll see no empirical evidence of power-seeking behavior before it's too late (and I think opinions amongst alignment researchers are pretty divided on this question).

No, I'm thinking of cases where Alice>Bob, and trying to gesture towards the distinction between "Bob knows that Alice believes X" and "Bob can use X to make predictions".

For example, suppose that Bob is a mediocre physicist and Alice just invented general relativity. Bob knows that Alice believes that time and space are relative, but has no idea what that means. So when trying to make predictions about physical events, Bob should still use Newtonian physics, even when those calculations require assumptions that contradict Alice's known beliefs.

Interesting post! Two quick comments:

Sometimes we analyze agents from a logically omniscient perspective. ... However, this omniscient perspective eliminates Vingean agency from the picture.

Another example of this happening comes when thinking about utilitarian morality, which by default doesn't treat other agents as moral actors (as I discuss here).

Bob has minimal use for attributing beliefs to Alice, because Bob doesn't think Alice is mistaken about anything -- the best he can do is to use his own beliefs as a proxy, and try to figure out what Alice will do based on that.

This makes sense when you think in terms of isolated beliefs, but less sense when you think in terms of overarching world-models/worldviews. Bob may know many specific facts about what Alice believes, but be unable to tie those together into a coherent worldview, or understand how they're consistent with his other beliefs. So the best strategy for predicting when Bob is a bounded agent may be:

  • Maintain a model of Alice's beliefs which contains the specific things Alice is known to believe, and use that to predict Alice's actions in domains closely related to those beliefs.
  • For anything which isn't directly implied by Alice's known beliefs, use Bob's own world-model to make predictions about what will achieve Alice's goals.
Load More