Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.


Alignment Newsletter Three Year Retrospective (Rohin Shah) (summarized by Rohin): It’s (two days until) the third birthday of this newsletter! In this post, I reflect on the two years since the previous retrospective (AN #53). There aren’t any major takeaways, so I won’t summarize all of it here. Please do take this 2 minute survey though. I’ll also copy over the “Advice to readers” section from the post:

Don’t treat [newsletter entries] as an evaluation of people’s work. As I mentioned above, I’m selecting articles based in part on how well they fit into my understanding of AI alignment. This is a poor method for evaluating other people’s work. Even if you defer to me completely and ignore everyone else’s views, it still would not be a good method, because often I am mistaken about how important the work is even on my own understanding of AI alignment. Almost always, my opinion about a paper I feel meh about will go up after talking to the authors about the work.

I also select articles based on how useful I think it would be for other AI alignment researchers to learn about the ideas presented. (This is especially true for the choice of what to highlight.) This can be very different from how useful the ideas are to the world (which is what I’d want out of an evaluation): incremental progress on some known subproblem like learning from human feedback could be very important, but still not worth telling other AI alignment researchers about.

Consider reading just the highlights section. If you’re very busy, or you find yourself just not reading the newsletter each week because it’s too long, I recommend just reading the highlights section. I select pretty strongly for “does this seem good for researchers to know?” when choosing the highlight(s).

If you’re busy, consider using the spreadsheet database as your primary mode of interaction. Specifically, rather than reading the newsletter each week, you could instead keep the database open, and whenever you see a vaguely interesting new paper, you can check (via Ctrl+F) whether it has already been summarized, and if so you can read that summary. (Even I use the database in this way, though I usually know whether or not I’ve already summarized the paper before, rather than having to check.)

Also, there may be a nicer UI to interact with this database in the near future :)



Formal Methods for the Informal Engineer: Workshop Recommendations (Gopal Sarma et al) (summarized by Rohin): This is the writeup from the Formal Methods for the Informal Engineer (AN #130) workshop. The main thrust is a call for increased application of formal methods in order to increase confidence in critical AI/ML systems, especially in the life sciences. They provide five high-level recommendations for this purpose.


Semi-informative priors over AI timelines (Tom Davidson) (summarized by Rohin): This report aims to analyze outside view evidence for AI timelines. In this setting, “outside view” roughly means that we take into account when AI research started, and how its inputs (data, compute, researcher time) have changed over time, but nothing else. The report considers four potential reference classes from which an outside view can be formed.

For each reference class, we’re going to use it to estimate how hard we would have thought AGI would be before we had tried to build AGI at all, and then we’re going to update that probability based on the observation that we’ve tried for some amount of calendar time / researcher time / compute, and haven’t yet gotten AGI. The report uses a simple generalization of Laplace’s Rule to actually synthesize it all together; I’m not going to go into that here.

I found the reference classes most interesting and will summarize them here. Note that the author says that the main contribution is in the framework, and that the individual reference classes are much less well done (there are several suggestions on other reference classes to investigate in the future). With that caveat, in order of the weight assigned to each, the four references classes are:

1. STEM goal: AGI is a highly ambitious but feasible technology that a serious STEM field is explicitly trying to develop. Looking at other such examples, the author suggests putting between 5% and 50% on developing AGI in 50 years.

2. Transformative technology: AGI is a technological development that would have a transformative effect on the nature of work and society. While these have been incredibly rare, we might expect that their probability increases with more technological development, making it more likely to occur now. Based on this, the author favors an upper bound of 1% per year on AGI.

3. Futurism goal: AGI is a high-impact technology that a serious STEM field is trying to build in 2020. There are a lot of such technologies, but we probably shouldn’t expect too many high-impact technologies to work out. The author suggests this should put it at below 1% per year.

4. Math conjecture: AGI is kinda sorta like a notable math conjecture. AI Impacts investigated (AN #97) the rate at which notable math conjectures are resolved, and their results imply 1/170 chance per year of a conjecture being resolved.

Aggregating these all together, the author favors assigning 0.1% - 1% per year at the beginning of AI research in 1956, with a point estimate of 0.3%. After updating on the fact that we don’t yet have AGI, the framework gives 1.5% - 9% for AGI by 2036 and 7% - 33% for AGI by 2100.

We can also run the same analysis where you get a new “chance” to develop AGI every time you increase the researcher pool by a constant fraction. (This is almost like having a log uniform prior on how many researcher hours are needed to get AGI.) Since there have been a few large booms in AI, this gives somewhat higher probabilities than the previous method, getting to 2% - 15% for AGI by 2036. Doing the same thing for compute gets 2% - 22% for AGI by 2036.

A weighted aggregation of all of the methods together (with weights set by intuition) gives 1% - 18% for AGI by 2036, and 5% - 35% for AGI by 2100.

Rohin's opinion: This seems like a good quantification of what the outside view suggests for AI timelines. Unfortunately, I have never really spent much time figuring out how best to combine outside view and inside view evidence, because research generally requires you to think about a detailed, gearsy, inside-view model, and so outside views feel pretty irrelevant to me. (They’re obviously relevant to Open Phil, who have to make funding decisions based on AI timelines, and so really do benefit from having the better estimates of timelines.) So I will probably continue to act based on the bio anchors framework (AN #121).

This is also why I haven’t highlighted this particular piece, despite the content being excellent. I generally highlight things that would be valuable for technical alignment researchers to read; my guess is that timelines are actually not that important for researchers to have good beliefs about (though inside-view models that predict timelines are important).

Some feedback on the report takes issue with the use of Laplace’s Rule because it models each “attempt” to make AGI as independent, which is obviously false. I’m not too worried about this; while the model might be obviously wrong, I doubt that a more sophisticated model would give very different results; most of the “oomph” is coming from the reference classes.


My research methodology (Paul Christiano) (summarized by Rohin): This post outlines a simple methodology for making progress on AI alignment. The core idea is to alternate between two steps:

1. Come up with some alignment algorithm that solves the issues identified so far

2. Try to find some plausible situation in which either a) the resulting AI system is misaligned or b) the AI system is not competitive.

This is all done conceptually, so step 2 can involve fairly exotic scenarios that probably won't happen. Given such a scenario, we need to argue why no failure in the same class as that scenario will happen, or we need to go back to step 1 and come up with a new algorithm.

This methodology could play out as follows:

Step 1: RL with a handcoded reward function.

Step 2: This is vulnerable to specification gaming (AN #1).

Step 1: RL from human preferences over behavior, or other forms of human feedback.

Step 2: The system might still pursue actions that are bad that humans can't recognize as bad. For example, it might write a well researched report on whether fetuses are moral patients, which intuitively seems good (assuming the research is good). However, this would be quite bad if the AI wrote the report because it calculated that it would increase partisanship leading to civil war.

Step 1: Use iterated amplification to construct a feedback signal that is "smarter" than the AI system it is training.

Step 2: The system might pick up on inaccessible information (AN #104) that the amplified overseer cannot find. For example, it might be able to learn a language just by staring at a large pile of data in that language, and then seek power whenever working in that language, and the amplified overseer may not be able to detect this.

Step 1: Use imitative generalization (AN #133) so that the human overseer can leverage facts that can be learned by induction / pattern matching, which neural nets are great at.

Step 2: Since imitative generalization ends up learning a description of facts for some dataset, it may learn low-level facts useful for prediction on the dataset, while not including the high-level facts that tell us how the low-level facts connect to things we care about.

The post also talks about various possible objections you might have, which I’m not going to summarize here.

Rohin's opinion: I'm really like having a candidate algorithm in mind when reasoning about alignment. It is a lot more concrete, which makes it easier to make progress and not get lost, relative to generic reasoning from just the assumption that the AI system is superintelligent.

I'm less clear on how exactly you move between the two steps -- from my perspective, there is a core reason for worry, which is something like "you can't fully control what patterns of thought your algorithms learn, and how they'll behave in new circumstances", and it feels like you could always apply that as your step 2. Our algorithms are instead meant to chip away at the problem, by continually increasing our control over these patterns of thought. It seems like the author has a better-defined sense of what does and doesn't count as a valid step 2, and that makes this methodology more fruitful for him than it would be for me. More discussion here.



Evaluating Agents without Rewards (Brendon Matusch et al) (summarized by Rohin): How can we evaluate algorithms for exploration? This paper suggests that we look at a variety of proxy objectives, such as reward obtained, similarity to human behavior, empowerment, and entropy of the visited state distribution.

The authors evaluate two algorithms (ICM and RND (AN #31)) as well as three baselines (noop agent, random agent, and PPO) on three Atari games and the Minecraft TreeChop task (AN #56), producing a list of proxy objective values for each combination. Their analysis then concludes that intrinsic objectives correlate with human behavior more strongly than task rewards do.

Rohin's opinion: I’m a big fan of thinking harder about metrics and evaluation (AN #135), and I do generally like the approach of “just look at a bunch of proxy statistics to help understand what’s happening”. However, I’m not sure how much I believe in the ones used in this paper -- with the exception of task reward, they are computed by downsampling the pixel inputs really far (8x8 with each pixel taking on 4 possible values), in order to create a nice discrete distribution that they can compute objectives over. While you can downsample quite a lot without losing much important information, this seems too far to me.

I’m also a big fan of their choice of Minecraft as one of their environments. The issue with Atari is that the environments are too “linear” -- either you do the thing that causes you to score points or win the game, or you die; unsurprisingly many objectives lead to you scoring points. (See the large-scale study on curiosity (AN #20).) However, on Minecraft there doesn’t seem to be much difference between the agents -- you’d be hard-pressed to tell the difference between the random agent and the trained agents based only on the values of the proxy objectives. To be fair, this may not be a problem with the objectives: it could be that the agents haven’t been trained for long enough (they were trained for 12 million steps, because the Minecraft simulator is quite slow).


I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.


An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

New Comment