Rohin Shah

Research Scientist at DeepMind. Creator of the Alignment Newsletter.


Value Learning
Alignment Newsletter

Wiki Contributions


Indeed I am confused why people think Goodharting is effectively-100%-likely to happen and also lead to all the humans dying. Seems incredibly extreme. All the examples people give of Goodharting do not lead to all the humans dying.

(Yes, I'm aware that the arguments are more sophisticated than that and "previous examples of Goodharting didn't lead to extinction" isn't a rebuttal to them, but that response does capture some of my attitude towards the more sophisticated arguments, something like "that's a wildly strong conclusion you've drawn from a pretty handwavy and speculative argument".)

Ultimately I think you just want to compare various kinds of models and ask how likely they are to arise (assuming you are staying within the scaled up neural networks as AGI paradigm). Some models you could consider:

  1. The idealized aligned model, which does whatever it thinks is best for humans
  2. The savvy aligned model, which wants to help humans but knows that it should play into human biases (e.g. by being sycophantic) in order to get high reward and not be selected against by gradient descent
  3. The deceptively aligned model, which wants some misaligned goal (say paperclips), but knows that it should behave well until it can execute a treacherous turn
  4. The bag of heuristics model, which (like a human) has a mix of various motivations, and mostly executes past strategies that have worked out well, imitating many of them from broader culture, without a great understanding of why they work, which tends to lead to high reward without extreme consequentialism.

(Really I think everything is going to be (4) until significantly past human-level, but will be on a spectrum of how close they are to (2) or (3).)

Plausibly you don't get (1) because it doesn't get particularly high reward relative to the others. But (2), (3) and (4) all seem like they could get similarly high reward. I think you could reasonably say that Goodharting is the reason you get (2), (3), or (4) rather than (1). But then amongst (2), (3) and (4), probably only (3) causes an existential catastrophe through misalignment.

You could then consider other factors like simplicity or training dynamics to say which of (2), (3) and (4) are likely to arise, but (a) this is no longer about Goodharting, (b) it seems incredibly hard to make arguments about simplicity / training dynamics that I'd actually trust, (c) the arguments often push in opposite directions (e.g. shard theory vs how likely is deceptive alignment), (d) a lot of these arguments also depend on capability levels, which introduces another variable into the mix (now allowing for arguments like this one).

The argument I'm making above is one about training dynamics. Specifically, the claim is that if you are on a path towards (3), it will probably take some bad actions initially (attempts at deception that fail), and if you successfully penalize those, that would plausibly switch the model towards (2) or (4).

I'm not claiming that you figure out whether the model's underlying motivations are bad. (Or, reading back what I wrote, I did say that but it's not what I meant, sorry about that.) I'm saying that when the model's underlying motivations are bad, it may take some bad action. If you notice and penalize that just because the action is bad, without ever figuring out whether the underlying motivation was bad or not, that still selects against models with bad motivations.

It's plausible that you then get a model with bad motivations that knows not to produce bad actions until it is certain those will not be caught. But it's also plausible that you just get a model with good motivations. I think the more you succeed at noticing bad actions (or good actions for bad reasons) the more likely you should think good motivations are.

I think you're missing the primary theory of change for all of these techniques, which I would say is particularly compatible with your "follow-the-trying" approach.

While all of these are often analyzed from the perspective of "suppose you have a potentially-misaligned powerful AI; here's what would happen", I view that as an analysis tool, not the primary theory of change.

The theory of change that I most buy is that as you are training your model, while it is developing the "trying", you would like it to develop good "trying" and not bad "trying", and one way to make this more likely is to notice when bad "trying" develops and penalize it if so, with the hope that this leads to good "trying".

This is illustrated in the theory-of-change diagram below, where to put it in your terminology:

  1. Each of the clouds (red or blue) consists of models that are "trying"
  2. The grey models outside of clouds are models that are not "trying" or are developing "trying"
  3. The "deception rewarded" point occurs when a model that is developing "trying" does something bad due to instrumental / deceptive reasoning
  4. The "apply alignment technique" means that you use debate / RRM / ELK instead of vanilla RLHF, which allows you to notice it doing something bad and penalize it instead of rewarding it.

Some potential objections + responses:

  1. But the model will be "trying" right after pretraining, before you've even done any finetuning!
    1. Response: I don't think this is obvious, but if that is the case, that just means you should also be doing alignment work during pretraining.
  2. But all of these techniques are considering models that already have all their concepts baked in, rather than developing them on the fly!
    1. Response: I agree that's what we're thinking about now, and I agree that eventually we will need to think about models that develop concepts on the fly. But I think the overall theory of change here would still apply in that setting, even if we need to somewhat change the techniques to accommodate this new kind of capability.

Yes, that's right, though I'd say "probable" not "possible" (most things are "possible").

Depends what the aligned sovereign does! Also depends what you mean by a pivotal act!

In practice, during the period of time where biological humans are still doing a meaningful part of alignment work, I don't expect us to build an aligned sovereign, nor do I expect to build a single misaligned AI that takes over: I instead expect there to be a large number of AI systems, that could together obtain a decisive strategic advantage, but could not do so individually.

I think that skews it somewhat but not very much. We only have to "win" once in the sense that we only need to build an aligned Sovereign that ends the acute risk period once, similarly to how we only have to "lose" once in the sense that we only need to build a misaligned superintelligence that kills everyone once.

(I like the discussion on similar points in the strategy-stealing assumption.)

[People at AI labs] expected heavy scrutiny by leadership and communications teams on what they can state publicly. [...] One discussion with a person working at DeepMind is pending approval before publication. [...] We think organizations discouraging their employees from speaking openly about their views on AI risk is harmful, and we want to encourage more openness.

(I'm the person in question.)

I just want to note that in the case of DeepMind:

  • I don't expect "heavy" scrutiny by leadership and communications teams (though it is not literally zero)
  • For the discussion with me, the ball is in the authors' court: the transcript needs to be cleaned up more. I haven't even sent it to the relevant people at DeepMind yet (and have said so to Andrea).
  • While DeepMind obviously cares about what I say, I think it is mostly inaccurate to say that DeepMind has discouraged me from speaking about my views on AI risk (as should be evident from my many comments on this forum).

(Nothing in the post contradicts what I'm saying here, but I'm worried that readers would get a mistaken impression from it.)

Suppose you have some deep learning model M_orig that you are finetuning to avoid some particular kind of failure. Suppose all of the following hold:

  1. Capable model: The base model has the necessary capabilities and knowledge to avoid the failure.
  2. Malleable motivations: There is a "nearby" model M_good (i.e. a model with minor changes to the weights relative to the M_orig) that uses its capabilities to avoid the failure. (Combined with (1), this means it behaves like M_orig except in cases that show the failure, where it does something better.)
  3. Strong optimization: If there's a "nearby" setting of model weights that gets lower training loss, your finetuning process will find it (or something even better). Note this is a combination of human factors like "the developers wrote correct code" and background technical facts like "the shape of the loss landscape is favorable".
  4. Correct rewards: You accurately detect when a model output is a failure vs not a failure.
  5. Good exploration: During finetuning there are many different inputs that trigger the failure.

(In reality each of these are going to lie on a spectrum, and the question is how high you are on each of the spectrums, and some of them can substitute for others. I'm going to ignore these complications and keep talking as though they are discrete properties.)

Claim 1: you will get a model that has training loss at least as good as that of [M_orig without failures]. ((1) and (2) establish that M_good exists and behaves as [M_orig without failures], (3) establishes that we get M_good or something better.)

Claim 2: you will get a model that does strictly better than M_orig on the training loss. ((4) and (5) together establish that the M_orig gets higher training loss than M_good, and we've already established that you get something at least as good as M_good.)

Corollary: Suppose your training loss plateaus, giving you model M, and M exhibits some failure. Then at least one of (1)-(5) must not hold.

Generally when thinking about a deep learning failure I think about which of (1)-(5) was violated. In the case of AI misalignment via deep learning failure, I'm primarily thinking about cases where (4) and/or (5) fail to hold.

In contrast, with ChatGPT jailbreaking, it seems like (4) and (5) probably hold. The failures are very obvious (so it's easy for humans to give rewards), and there are many examples of them already. With Bing it's more plausible that (5) doesn't hold.

To people holding up ChatGPT and Bing as evidence of misalignment: which of (1)-(5) do you think doesn't hold for ChatGPT / Bing, and do you think a similar mechanism will underlie catastrophic misalignment risk?

In that case, I think you should try and find out what the incentive gradient is like for other people before prescribing the actions that they should take. I'd predict that for a lot of alignment researchers your list of incentives mostly doesn't resonate, relative to things like:

  1. Active discomfort at potentially contributing to a problem that could end humanity
  2. Social pressure + status incentives from EAs / rationalists to work on safety and not capabilities
  3. Desire to work on philosophical or mathematical puzzles, rather than mucking around in the weeds of ML engineering
  4. Wanting to do something big-picture / impactful / meaningful (tbc this could apply to both alignment and capabilities)

For reference, I'd list (2) and (4) as the main things that affects me, with maybe a little bit of (3), and I used to also be pretty affected by (1). None of the things you listed feel like they affect me much (now or in the past), except perhaps wishful thinking (though I don't really see that as an "incentive").

Errr, I feel like we already agree on this point?

Yes, sorry, I realized that right after I posted and replaced it with a better response, but apparently you already saw it :(

What I meant to say was "I think most of the time closing overhangs is more negative than positive, and I think it makes sense to apply OP's higher bar of scrutiny to any proposed overhang-closing proposal".

But like, why? I wish people would argue for this instead of flatly asserting it and then talking about increased scrutiny or burdens of proof (which I also don't like).

Load More