Rohin Shah

Research Scientist at DeepMind. Creator of the Alignment Newsletter.


Value Learning
Alignment Newsletter


Formal Inner Alignment, Prospectus

Planned summary for the Alignment Newsletter:

This post outlines a document that the author plans to write in the future, in which he will define the inner alignment problem formally, and suggest directions for future research. I will summarize that document when it comes out, but if you would like to influence that document, check out the post.

Agency in Conway’s Game of Life

Planned summary for the Alignment Newsletter:

Conway’s Game of Life (GoL) is a simple cellular automaton which is Turing-complete. As a result, it should be possible to build an “artificial intelligence” system in GoL. One way that we could phrase this is: if we imagine a GoL board with 10^30 rows and 10^30 columns, and we are able to set the initial state of the top left 10^20 by 10^20 square, can we set that initial state appropriately such that after a suitable amount of time, we the full board evolves to a desired state (perhaps a giant smiley face), for the vast majority of possible initializations of the remaining area?

This requires us to find some setting of the initial 10^20 by 10^20 square that has [expandable, steerable influence]( Intuitively, the best way to do this would be to build “sensors” and “effectors” to have inputs and outputs, and then have some program decide what the effectors should do based on the input from the sensors, and the “goal” of the program would be to steer the world towards the desired state. Thus, this is a framing of the problem of AI (both capabilities and alignment) in GoL, rather than in our native physics.

Planned opinion:

With the tower of abstractions we humans have built, we now naturally think in terms of inputs and outputs for the agents we build. This hypothetical seems good for shaking us out of that mindset, as we don’t really know what the analogous inputs and outputs in GoL would be, and so we are forced to consider those aspects of the design process as well.

AXRP Episode 7 - Side Effects with Victoria Krakovna

Planned summary for the Alignment Newsletter:

This podcast goes over the problem of side effects, and impact regularization as an approach to handle this problem. The core hope is that impact regularization would enable “minimalistic” value alignment, in which the AI system may not be doing exactly what we want, but at the very least it will not take high impact actions that could cause an existential catastrophe.

An impact regularization method typically consists of a _deviation measure_ and a _baseline_. The baseline is what we compare the agent to, in order to determine whether it had an “impact”. The deviation measure is used to quantify how much impact there has been, when comparing the state generated by the agent to the one generated by the baseline.

Deviation measures are relatively uncontroversial – there are several possible measures, but they all seem to do relatively similar things, and there aren’t any obviously bad outcomes traceable to problems with the deviation measure. However, that is not the case with baselines. One typical baseline is the **inaction** baseline, where you compare against what would have happened if the agent had done nothing. Unfortunately, this leads to _offsetting_: as a simple example, if some food was going to be thrown away and the agent rescues it, it then has an incentive to throw it away again, since that would minimize impact relative to the case where it had done nothing. A solution is the **stepwise inaction** baseline, which compares to the case where the agent does nothing starting from the previous state (instead of from the beginning of time). However, this then prevents some beneficial offsetting: for example, if the agent opens the door to leave the house, then the agent is incentivized to leave the door open.

As a result, the author is interested in seeing more work on baselines for impact regularization. In addition, she wants to see impact regularization tested in more realistic scenarios. That being said, she thinks that the useful aspect of impact regularization research so far is in bringing conceptual clarity to what we are trying to do with AI safety, and in identifying the interference and offsetting behaviors, and the incentives for them.

[AN #149]: The newsletter's editorial policy

I really like the long summaries, and would be sad to see them go

Fwiw I still expect to do them; this is an "on the margin" thing. Like, I still would do a long summary for bio anchors, but maybe I do something shorter for infra-Bayesianism.

Frame this as a 'request for summaries', link to the papers you won't get round to, but offer to publish any sufficiently good summaries of those papers that someone sends you in a future newsletter.

Hmm, intriguing. That might be worth trying.

[AN #149]: The newsletter's editorial policy

Other results from the survey:

There were 66 responses, though at least one was a duplicate. (I didn't deduplicate in any of the analyses below; I doubt it will make a big difference.) Looking at names (when provided), it looks like people in the field were quite a bit more likely to respond than the typical reader. Estimating 5 min on average per response (since many provided qualitative feedback as well), that's 5.5 hours of person-time answering the survey.

My main takeaways (described in more detail below):

  • The newsletter is still useful to people.
  • Long summaries are not as useful as I thought.
  • On the current margin I should move my focus away from the Alignment Forum, since the most involved readers seem to read most of the Alignment Forum already.
  • It would be nice to do more "high-level opinions" -- if you imagine a tree where the root node is "did we build safe / beneficial AI", and then lower nodes delve into subproblems; it would be useful to have opinions talk about how the current paper / article relates to the top-level node. I don't think I'll make a change of this form right now, but I might in the future.

I think these takeaways are probably worth 5-10 hours of time? It's close though.


Average rating of various components of the newsletter (arranged in ascending order of popularity):

3.88 Long summaries (full newsletter dedicated to one topic)
3.91 Source of interesting things to read
3.95 Opinions
4.02 Highlights
4.27 Regular summaries
4.47 Newsletter overall

(The question was a five-point scale: "Useless, Meh, Fairly useful, Keep doing this, This is amazing!", which I then converted to 1-5 and averaged across respondents.)

The newsletter overall is far more popular than any of the individual components. In hindsight this makes sense -- different people will find different components valuable, but probably people will subscribe if there's just one or two components valuable to them. So everyone will rate the newsletter highly, but only a subset will rate any given component highly.

I was surprised to see the long summaries were least popular, since people have previously explicitly said that they especially liked the long summaries without any prompting from me. I will probably be less likely to do long summaries in the future.


In the "value of the newsletter" qualitative section, the most common thing by far was people saying that it helped them stay abreast of the field -- especially for articles that are not on the Alignment Forum.


One or two people suggested adding links to interesting papers that I wouldn't have time to summarize. I actually used to do this when the newsletter first started, but it seemed like no one was clicking on those links so I stopped doing that. I'm pretty sure that would still be the case now so I'm not planning to restart that practice.

Pitfalls of the agent model

Planned summary for the Alignment Newsletter:

It is common to view AI systems through the “agent lens”, in which the AI system implements a fixed, unchanging policy that given some observations takes some actions. This post points out several ways in which this “fixed, unchanging policy” assumption can lead us astray.

For example, AI designers may assume that the AI systems they build must have unchanging decision algorithms, and therefore believe that there will be a specific point at which influence is “handed off” to the AI system, before which we have to solve a wide array of philosophical and technical problems.

[AN #139]: How the simplicity of reality explains the success of neural nets

Hmm, I think you're right. I'm not sure what I was thinking when I wrote that. (Though I give it like 50% that if past-me could explain his reasons, I'd agree with him.)

Possibly I was thinking of epochal double descent, but that shouldn't matter because we're comparing the final outcome of SGD to random sampling, so epochal double descent doesn't come into the picture.

Low-stakes alignment

Yeah, all of that seems right to me (and I feel like I have a better understanding of why assumptions on inputs are better than assumptions on outputs, which was more like a vague intuition before). I've changed the opinion to:

I like the low-stakes assumption as a way of saying "let's ignore distributional shift for now". Probably the most salient alternative is something along the lines of "assume that the AI system is trying to optimize the true reward function". The main way that low-stakes alignment is cleaner is that it uses an assumption on the _environment_ (an input to the problem) rather than an assumption on the _AI system_ (an output of the problem). This seems to be a lot nicer because it is harder to "unfairly" exploit a not-too-strong assumption on an input rather than on an output. See [this comment thread]( for more discussion.

Mundane solutions to exotic problems

Planned summary for the Alignment Newsletter:

The author’s goal is to find “mundane” or simple algorithms that solve even “exotic” problems in AI alignment. Why should we expect this is possible? If an AI system is using powerful, exotic capabilities to evade detection, shouldn’t we need powerful, exotic algorithms to fight that? The key idea here is that we can instead have a mundane algorithm that leverages the exotic capabilities of the AI system to produce an exotic oversight process. For example, we could imagine that a mundane algorithm could be used to create a question-answerer that knows everything the model knows. We could then address <@gradient hacking@>(@Gradient hacking@) by asking the question “what should the loss be?” In this case, our model has an exotic capability: very strong introspective access to its own reasoning and the training process that modifies it. (This is what is needed to successfully hack gradients). As a result, our question answerer should be able to leverage this capability to assign high loss (low reward) to cases where our AI system tries to hack gradients, even if our normal hardcoded loss would not do so.

Low-stakes alignment

I guess the natural definition is ...

I was imagining a Cartesian boundary, with a reward function that assigns a reward value to every possible state in the environment (so that the reward is bigger than the environment). So, embeddedness problems are simply assumed away, in which case there is only one correct generalization.

It feels like the low-stakes setting is also mostly assuming away embeddedness problems? I suppose it still includes e.g. cases where the AI system subtly changes the designer's preferences over the course of training, but it excludes e.g. direct modification of the reward, taking over the training process, etc.

I agree that "actually trying" is still hard to define, though you could avoid that messiness by saying that the goal is to provide a reward such that any optimal policy for that reward would be beneficial / aligned (and then the assumption is that a policy that is "actually trying" to pursue the objective would not do as well as the optimal policy but would not be catastrophically bad).

Just to reiterate, I agree that the low-stakes formulation is better; I just think that my reasons for believing that are different from "it's a clean subproblem". My reason for liking it is that it doesn't require you to specify a perfect reward function upfront, only a reward function that is "good enough", i.e. it incentivizes the right behavior on the examples on which the agent is actually trained. (There might be other reasons too that I'm failing to think of now.)

Load More