Rohin Shah

Research Scientist at DeepMind. Creator of the Alignment Newsletter.


Value Learning
Alignment Newsletter

Wiki Contributions


The theory-practice gap

Planned summary for the Alignment Newsletter:

We can think of alignment as roughly being decomposed into two “gaps” that we are trying to reduce:

1. The gap between proposed theoretical alignment approaches (such as iterated amplification) and what we might do without such techniques (aka the <@unaligned benchmark@>(@An unaligned benchmark@))
2. The gap between actual implementations of alignment approaches, and what those approaches are theoretically capable of.

(This distinction is fuzzy. For example, the author puts “the technique can’t answer NP-hard questions” into the second gap while I would have had it in the first gap.)

We can think of some disagreements in AI alignment as different pictures about how these gaps look:

1. A stereotypical “ML-flavored alignment researcher” thinks that the first gap is very small, because in practice the model will generalize appropriately to new, more complex situations, and continue to do what we want. Such people would then be more focused on narrowing the second gap, by working on practical implementations.
2. A stereotypical “MIRI-flavored alignment researcher” thinks that the first gap is huge, such that it doesn’t really matter if you narrow the second gap, because even if you reduced that gap to zero you would still be doomed with near certainty.

Other notes:

I think some people think that amplified humans are actually just as capable as the unaligned benchmark. I think this is basically the factored cognition hypothesis. 

I don't see how this is enough. Even if this were true, vanilla iterated amplification would only give you an average-case / on-distribution guarantee. Your alignment technique also needs to come with a worst-case / off-distribution guarantee. (Another way of thinking about this is that it needs to deal with potential inner alignment failures.)

By the time we get to AGI, will we have alignment techniques that are even slightly competitive? I think it’s pretty plausible the answer is no.

I'm not totally sure what you mean here by "alignment techniques". Is this supposed to be "techniques that we can justify will be intent aligned in all situations", or perhaps "techniques that empirically turn out to be intent aligned in all situations"? If so, I agree that the answer is plausibly (even probably) no.

But what we'll actually do is some technique that is more capable that we don't justifiably know is intent aligned, and (probably) wouldn't be intent aligned in some exotic circumstances. But it still seems plausible that in practice we never hit those exotic circumstances (because those exotic circumstances never happen, or because we've retrained the model before we get to the exotic circumstances, etc), and it's intent aligned in all the circumstances the model actually encounters.

I think I'm like 30% on the proposition that before AGI, we're going to come up with some alignment scheme that just looks really good and clearly solves most of the problems with current schemes.

Fwiw if you mostly mean something that resolves the unaligned benchmark - theory gap, without relying on empirical generalization / contingent empirical facts we learn from experiments, and you require it to solve abstract problems like this, I feel more pessimistic, maybe 20% (which comes from starting at ~10% and then updating on "well if I had done this same reasoning in 2010 I think I would have been too pessimistic about the progress made since then").

Immobile AI makes a move: anti-wireheading, ontology change, and model splintering

Yeah, I agree with that.

(I don't think we have experience with deep Bayesian versions of IRL / preference comparison at CHAI, and I was thinking about advice on who to talk to)

Immobile AI makes a move: anti-wireheading, ontology change, and model splintering

This seems very related to Inverse Reward Design. There was at least one project at CHAI trying to scale up IRD, but it proved challenging to get working -- if you're thinking of similar approaches it might be worth pinging Dylan about it.

Grokking the Intentional Stance

Yeah, I agree with all of that.

[AN #164]: How well can language models write code?

Oh yeah, I definitely agree that this is not strong evidence for typical skeptic positions (and I'd guess the authors would agree).

[AN #164]: How well can language models write code?

First comment: I don't think their experiment about code execution is much evidence re "true understanding."

I agree that humans would do poorly in the experiment you outline. I think this shows that, like the language model, humans-with-one-second do not "understand" the code.

(Idk if you were trying to argue something else with the comparison, but I don't think it's clear that this is a reasonable comparison; there are tons of objections you could bring up. For example, humans have to work from pixels whereas the language model gets tokens, making its job much easier.)

Second comment: Speculation about scaling trends:

I didn't check the numbers, but that seems pretty reasonable. I think there's a question of whether it actually saves time in the current format -- it might be faster to simply write the program than to write down a clear natural language description of what you want along with test cases.

The Blackwell order as a formalization of knowledge

Having studied the paradoxical results, I don't think they are paradoxical for particularly interesting reasons. (Which is not to say that the paper is bad! I don't expect I would have noticed these problems given just a definition of the Blackwell order! Just that I would recommend against taking this as progress towards "understanding knowledge", and more like "an elaboration of how not to use Blackwell orders".)

Proposition 2. There exist random variables S, X1, X2 and a function  from the support of S to a finite set S' such that the following holds:

1) S and X1 are independent given f(S).



This is supposed to be paradoxical because (3) implies that there is a task where it is better to look at , (1) implies that any decision rule that uses  as computed through  can also be translated to an equivalently good decision rule that uses  as computed through . So presumably it should still be better to look at than . However, (2) implies that once you compute through  you're better off using . Paradox!

This argument fails because (3) is a claim about utility functions (tasks) with type signature , while (2) is a claim about utility functions (tasks) with type signature .  It is possible to have a task that is expressible with , where  is better (to make (3) true), but then you can't express it with , and so (2) can be true but irrelevant. Their example follows precisely this format.

Proposition 3. There exist random variables S, X and there exists a function  from the support of S to a finite set S' such that the following holds: .

It seems to me that this is saying "there exists a utility function U and a function f, such that if we change the world to be Markovian according to f, then an agent can do better in the changed world than it could in the original world". I don't see why this is particularly interesting or relevant to a formalization of knowledge.

To elaborate, running the actual calculations from their counterexample, we get:


If you now compare  to , that is basically asking the question "let us consider these two ways in which the world could determine which observation I get, and see which of the ways I would prefer". But  is not the way that the world determines how you get the observation  -- that's exactly given by the  table above (you can check consistency with the joint probability distribution in the paper). So I don't really know why you'd do this comparison.

The decomposition  can make sense if X is independent of S given f(S), but in this case, the resulting matrix would be exactly equal to  -- which is as it should be when you are comparing two different views on how the world generates the same observation.

(Perhaps the interpretation under which this is paradoxical is one where  means that B is computed from A, and matrix multiplication corresponds to composition of computations, but that just seems like a clearly invalid interpretation, since in their example their probability distribution is totally inconsistent with "f(S) is computed from S, and then X is computed from f(S)".)

More broadly, the framework in this paper provides a definition of "useful" information that allows all possible utility functions. However, it's always possible to construct a utility function (given their setup) for which any given information is useful, and so the focus on decision-making isn't going to help you distinguish between information that is and isn't relevant / useful, in the sense that we normally mean those words.

You could try to fix this by narrowing down the space of utility functions, or introducing limits on computation, or something else along those lines, though then I would be applying these sorts of fixes with traditional Shannon information.

The alignment problem in different capability regimes

Planned summary for the Alignment Newsletter:

One reason that researchers might disagree on what approaches to take for alignment is that they might be solving different versions of the alignment problem. This post identifies two axes on which the “type” of alignment problem can differ. First, you may consider AI systems with differing levels of capability, ranging from subhuman to wildly superintelligent, with human-level somewhere in the middle. Second, you might be thinking about different mechanisms by which this leads to bad outcomes, where possible mechanisms include <@the second species problem@>(@AGI safety from first principles@) (where AIs seize control of the future from us), the “missed opportunity” problem (where we fail to use AIs as well as we could have, but the AIs aren’t themselves threatening us), and a grab bag of other possibilities (such as misuse of AI systems by bad actors).

Depending on where you land on these axes, you will get to rely on different assumptions that change what solutions you would be willing to consider:

1. **Competence.** If you assume that the AI system is human-level or superintelligent, you probably don’t have to worry about the AI system causing massive problems through incompetence (at least, not to a greater extent than humans do).

2. **Ability to understand itself.** With wildly superintelligent systems, it seems reasonable to expect them to be able to introspect and answer questions about its own cognition, which could be a useful ingredient in a solution that wouldn’t work in other regimes.

3. **Inscrutable plans or concepts.** With sufficiently competent systems, you might be worried about the AI system making dangerous plans you can’t understand, or reasoning with concepts you will never comprehend. Your alignment solution must be robust to this.

Planned opinion:

When I talk about alignment, I am considering the second species problem, with AI systems whose capability level is roughly human-level or more (including “wildly superintelligent”).

I agree with [this comment thread]( that the core _problem_ in what-I-call-alignment stays conserved across capability levels, but the solutions can change across capability levels. (Also, other people mean different things by “alignment”, such that this would no longer be true.)

Measurement, Optimization, and Take-off Speed

I moved this post to the Alignment Forum -- let me know if you don't want that.

Distinguishing AI takeover scenarios

"I won't assume any of them" is distinct from "I will assume the negations of them".

I'm fairly confident the analysis is also meant to apply to situations in which things like (1)-(5) do hold.

(Certainly I personally am willing to apply the analysis to situations in which (1)-(5) hold.)

Load More