Rohin Shah

Research Scientist at DeepMind. Creator of the Alignment Newsletter.


Value Learning
Alignment Newsletter

Wiki Contributions


Decoupling deliberation from competition

Planned summary for the Alignment Newsletter:

Under a [longtermist]( lens, one problem to worry about is that even after building AI systems, humans will spend more time competing with each other rather than figuring out what they want, which may then lead to their values changing in an undesirable way. For example, we may have powerful persuasion technology that everyone uses to persuade people to their line of thinking; it seems bad if humanity’s values are determined by a mix of effective persuasion tools, especially if persuasion significantly diverges from truth-seeking.

One solution to this is to coordinate to _pause_ competition while we deliberate on what we want. However, this seems rather hard to implement. Instead, we can at least try to _decouple_ competition from deliberation, by having AI systems acquire <@flexible influence@>(@The strategy-stealing assumption@) on our behalf (competition), and having humans separately thinking about what they want (deliberation). As long as the AI systems are competent enough to shield the humans from the competition, the results of the deliberation shouldn’t depend too much on competition, thus achieving the desired decoupling.

The post has a bunch of additional concrete details on what could go wrong with such a plan that I won’t get into here.

Re-Define Intent Alignment?

Are you the historical origin of the robustness-centric approach?

Idk, probably? It's always hard for me to tell; so much of what I do is just read what other people say and make the ideas sound sane to me. But stuff I've done that's relevant:

  • Talk at CHAI saying something like "daemons are just distributional shift" in August 2018, I think. (I remember Scott attending it.)
  • Talk at FHI in February 2020 that emphasized a risk model where objectives generalize but capabilities don't.
  • Talk at SERI conference a few months ago that explicitly argued for a focus on generalization over objectives.

Especially relevant stuff other people have done that has influenced me:

(My views were pretty set by the time Evan wrote the clarifying inner alignment terminology post; it's possible that his version that's closer to generalization-focused was inspired by things I said, you'd have to ask him.)

Progress on Causal Influence Diagrams

Planned summary for the Alignment Newsletter:

Many of the problems we care about (reward gaming, wireheading, manipulation) are fundamentally a worry that our AI systems will have the _wrong incentives_. Thus, we need Causal Influence Diagrams (CIDs): a formal theory of incentives. These are <@graphical models@>(@Understanding Agent Incentives with Causal Influence Diagrams@) in which there are action nodes (which the agent controls) and utility nodes (which determine what the agent wants). Once such a model is specified, we can talk about various incentives the agent has. This can then be used for several applications:

1. We can analyze [what happens]( when you [intervene]( on the agent’s action. Depending on whether the RL algorithm uses the original or modified action in its update rule, we may or may not see the algorithm disable its off switch.

2. We can <@avoid reward tampering@>(@Designing agent incentives to avoid reward tampering@) by removing the connections from future rewards to utility nodes; in other words, we ensure that the agent evaluates hypothetical future outcomes according to its _current_ reward function.

3. A [multiagent version]( allows us to recover concepts like Nash equilibria and subgames from game theory, using a very simple, compact representation.

Re-Define Intent Alignment?

(Meta: was this meant to be a question?)

In contrast, the generalization-focused approach puts less emphasis on the assumption that the worst catastrophes are intentional.

I don't think this is actually a con of the generalization-focused approach. From the post you link, one of the two questions in that approach (the one focused on robustness) is:

How do we ensure the model generalizes acceptably out of distribution?

Part of the problem is to come up with a good definition of "acceptable", such that this is actually possible to achieve. (See e.g. the "Defining acceptable" section of this post, or the beginning of this post.) But if you prefer to bake in the notion of intent, you could make the second question

How do we ensure the model continues to try to help us when out of distribution?

[AN #156]: The scaling hypothesis: a plan for building AGI

what is the measure of data-sets in the N-datapoint hypercube such that the trained model is aligned?", perhaps also weighting by ease of specification in some sense.

You're going to need the ease of specification condition, or something similar; else you'll probably run into no-free-lunch considerations (at which point I think you've stopped talking about anything useful).

[AN #156]: The scaling hypothesis: a plan for building AGI

The fourth bullet point claims that GPT-N will go on filling in missing words rather than doing a treacherous turn.

?? I said nothing about a treacherous turn? And where did I say it would go on filling in missing words?

EDIT: Ah, you mean the fourth bullet point in ESRogs response. I was thinking of that as one example of how such reasoning could go wrong, as opposed to the only case. So in that case the model_1 predicts a treacherous turn confidently, but this is the wrong epistemic state to be in because it is also plausible that it just "fills in words" instead.

Seems to me the conclusion of this argument is that "In general it's not true that the AI is trying to achieve its training objective." 

Isn't that effectively what I said? (I was trying to be more precise since "achieve its training objective" is ambiguous, but given what I understand you to mean by that phrase, I think it's what I said?)

we have no idea what it'll do; treacherous turn is a real possibility because that's what'll happen for most goals it could have, and it may have a goal for all we know.

This seems reasonable to me (and seems compatible with what I said)

[AN #156]: The scaling hypothesis: a plan for building AGI

Yeah, I agree with all this. I still think the pretraining objective basically doesn't matter for alignment (beyond being "reasonable") but I don't think the argument I've given establishes that.

I do think the arguments in support of Claim 2 are sufficient to at least raise Claim 3 to attention (and thus Claim 4 as well).

Fractional progress estimates for AI timelines and implied resource requirements

Planned summary:

One [methodology]( for forecasting AI timelines is to ask experts how much progress they have made to human-level AI within their subfield over the last T years. You can then extrapolate linearly to see when 100% of the problem will be solved. The post linked above collects such estimates, with a typical estimate being 5% of a problem being solved in the twenty year period between 1992 and 2012. Overall these estimates imply a timeline of [372 years](

This post provides a reductio argument against this pair of methodology and estimate. The core argument is that if you linearly extrapolate, then you are effectively saying “assume that business continues as usual: then how long does it take”? But “business as usual” in the case of the last 20 years involves an increase in the amount of compute used by AI researchers by a factor of ~1000, so this effectively says that we’ll get to human-level AI after a 1000^{372/20} = 10^56 increase in the amount of available compute. (The authors do a somewhat more careful calculation that breaks apart improvements in price and growth of GDP, and get 10^53.)

This is a stupendously large amount of compute: it far dwarfs the amount of compute used by evolution, and even dwarfs the maximum amount of irreversible computing we could have done with all the energy that has ever hit the Earth over its lifetime (the bound comes from [Landauer’s principle](

Given that evolution _did_ produce intelligence (us), we should reject the argument. But what should we make of the expert estimates then? One interpretation is that “proportion of the problem solved” behaves more like an exponential, because the inputs are growing exponentially, and so the time taken to do the last 90% can be much less than 9x the time taken for the first 10%.

Planned opinion:

This seems like a pretty clear reductio to me, though it is possible to argue that this argument doesn’t apply because compute isn’t the bottleneck, i.e. even with infinite compute we wouldn’t know how to make AGI. (That being said, I mostly do think we could build AGI if only we had enough compute; see also <@last week’s highlight on the scaling hypothesis@>(@The Scaling Hypothesis@).)

[AN #156]: The scaling hypothesis: a plan for building AGI

Wrote a separate comment here (in particular I think claims 1 and 4 are directly relevant to safety)

Load More