Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.


A naive alignment strategy and optimism about generalization (Paul Christiano) (summarized by Rohin): We want to build an AI system that answers questions honestly, to the best of its ability. One obvious approach is to have humans generate answers to questions, select the question-answer pairs where we are most confident in the answers, and train an AI system on those question-answer pairs.

(I’ve described this with a supervised learning setup, but we don’t have to do that: we could also learn from comparisons between answers, and we only provide comparisons where we are confident in the comparison.)

What will the AI system do on questions where we wouldn’t be confident in the answers? For example, questions that are complex, where we may be misled by bad observations, where an adversary is manipulating us, etc.

One possibility is that the AI system learned the intended policy, where it answers questions honestly to the best of its ability. However, there is an instrumental policy which also gets good performance: it uses a predictive model of the human to say whatever a human would say. (This is “instrumental” in that the model is taking the actions that are instrumental to getting a low loss, even in the test environment.) This will give incorrect answers on complex, misleading, or manipulative questions -- even if the model “knows” that the answer is incorrect.

Intuitively, “answer as well as you can” feels like a much simpler way to give correct answers, and so we might expect to get the intended policy rather than the instrumental policy. This view (which seems common amongst ML researchers) is optimism about generalization: we are hoping that the policy generalizes to continue to answer these more complex, misleading, manipulative questions to the best of its ability.

Are there reasons to instead be pessimistic about generalization? There are at least three:

1. If the answers we train on aren’t perfectly correct, the instrumental policy might get a lower training loss than the intended policy (which corrects errors that humans make), and so be more likely to be found by gradient descent.

2. If the AI already needs to make predictions about humans, it may not take much “additional work” to implement the instrumental policy. Conversely, if the AI reasons at a different level of abstraction than humans, it may take a lot of “additional work” to turn correct answers in the AI’s ontology into correct answers in human ontologies.

3. From a followup post, the AI system might answer questions by translating its concepts to human concepts or observations, and then deduce the answer from those concepts or observations. This will systematically ignore information that the AI system understands that isn’t represented in the human concepts or observations. (Consider the example of the robot hand that only looked like it was grasping the appropriate object.)

A possible fourth problem: if the AI system did the deduction in its own concepts and only as a final step translated it to human concepts, we might still lose relevant information. This seems not too bad though -- it seems like we should at least be able to explain the bad effects of a catastrophic failure (AN #44) in human concepts, even if we can’t explain why that failure occurred.

A followup post considers whether we could avoid the instrumental policy by preventing it from learning information about humans (AN #52), but concludes that while it would solve the problems outlined in the post, it seems hard to implement in practice.



Experimentally evaluating whether honesty generalizes (Paul Christiano) (summarized by Rohin): The highlighted post introduced the notion of optimism about generalization. On this view, if we train an AI agent on question-answer pairs (or comparisons) where we are confident in the correctness of the answers (or comparisons), the resulting agent will continue to answer honestly even on questions where we wouldn’t be confident of the answer.

While we can’t test exactly the situation we care about -- whether a superintelligent AI system would continue to answer questions honestly -- we can test an analogous situation with existing large language models. In particular, let’s consider the domain of unsupervised translation: we’re asking a language model trained on both English and French to answer questions about French text, and we (the overseers) only know English.

We could finetune the model on answers to questions about grammar ("Why would it have been a grammatical error to write Tu Vas?") and literal meanings ("What does Defendre mean here?"). Once it performs well in this setting, we could then evaluate whether the model generalizes to answer questions about tone ("Does the speaker seem angry or sad about the topic they are discussing?"). On the optimism about generalization view, it seems like this should work. It is intentional here that we only finetune on two categories rather than thousands, since that seems more representative of the case we’ll actually face.

There are lots of variants which differ in the type of generalization they are asking for: for example, we could finetune a model on all questions about French text and German text, and then see whether it generalizes to answering questions about Spanish text.

While the experiments as currently suggested probably won’t show good generalization, a variant that could support it would be one in which we train for plausibility. In our original example, we finetune on correct answers for grammar and literal meanings, and then we also finetune to have the model give plausible answers to tone (i.e. when asked about tone, instead of saying “en colère means 'angry'”, the model says “the author is angry, as we can see from the use of ‘en colère’”). It seems possible that this combination leads to the model giving actually correct answers about tone, just because “honestly report your best guess” seems like the simplest policy that meets all of these criteria.

Teaching ML to answer questions honestly instead of predicting human answers (Paul Christiano) (summarized by Rohin): This post presents an algorithm that aims to solve the second problem from the highlighted post. As a reminder, the second problem is that an AI system that already has to make predictions about humans might learn a policy that is just “say what humans would say”, since that is simpler than learning another translation that maps its knowledge to human language (so that it can answer honestly to the best of its knowledge).

The core idea is to have a “simple” labeling process and a “complex” labeling process, where the complex process can catch errors from the simple process. We’ll also assume we have a simple / complex dataset (labeled by the simple / complex processes respectively), where we are confident that all of the answers in the dataset are correct. This means that the complex dataset has more challenging questions than the simple one. The simple / complex instrumental policy is the policy that predicts what the simple / complex labeling process would do.

As an inefficient first algorithm, we can train our model on answers from the simple process where we are confident in their correctness, and then simply check whether the model performs well on new, harder questions where the complex and simple processes disagree. If the model fails this check, we restart from scratch. Intuitively, when the model eventually passes the check, it has probably learned the intended policy, as that’s the one which would generalize zero-shot to complex questions as well (and in particular, the simple instrumental policy would not generalize in this way).

Of course, we need a more efficient algorithm than that. Instead of creating an algorithm where the intended policy would work but the instrumental policy wouldn’t, could we instead change the inductive biases so that the intended policy is likely while the instrumental policy is not? Our original worry was that the intended policy has to bake in two implementations of language -- one in its world model, and one when translating answers into human-understandable concepts. So we could instead try to train a model that learns language from the simple instrumental policy, but is also trained on the complex dataset. The hope would be that the intended policy can learn the second implementation of language “for free” from the simple instrumental policy, while still working on the complex dataset. The actual details are quite complex and I’m not going to go into them here.

This post by Evan Hubinger points out some problems and potential solutions with the approach.


AXRP Episode 10 - AI’s Future and Impacts (Daniel Filan and Katja Grace) (summarized by Rohin): This podcast goes over various strands of research from AI Impacts, including lots of work that I either haven’t covered or have covered only briefly in this newsletter:

AI Impacts’ methodology. AI Impacts aims to advance the state of knowledge about AI and AI risk by recursively decomposing important high-level questions and claims into subquestions and subclaims, until reaching a question that can be relatively easily answered by gathering data. They generally aim to provide new facts or arguments that people haven’t considered before, rather than arguing about how existing arguments should be interpreted or weighted.

Timelines. AI Impacts is perhaps most famous for its survey of AI experts on timelines till high-level machine intelligence (HLMI). The author’s main takeaway is that people give very inconsistent answers and there are huge effects based on how you frame the question. For example:

1. If you estimate timelines by asking questions like “when will there be a 50% chance of HLMI”, you’ll get timelines a decade earlier than if you estimate by asking questions like “what is the chance of HLMI in 2030”.

2. If you ask about when AI will outperform humans at all tasks, you get an estimate of ~2061, but if you ask when all occupations will be automated, you get an estimate of ~2136.

3. People whose undergraduate studies were in Asia estimated ~2046, while those in North America estimated ~2090.

The survey also found that the median probability of outcomes approximately as bad as extinction was 5%, which the author found surprisingly high for people working in the field.

Takeoff speeds. A common disagreement in the AI alignment community is whether there will be a discontinuous “jump” in capabilities at some point. AI Impacts has three lines of work investigating this topic:

1. Checking how long it typically takes to go from “amateur human” to “expert human”. For example, it took about 3 years for image classification on ImageNet, 38 years on checkers, 21 years for StarCraft, 30 years for Go, 30 years for chess, and ~3000 years for clock stability (how well you can measure the passage of time).

2. Checking how often particular technologies have undergone discontinuities in the past (AN #97). A (still uncertain) takeaway would be that discontinuities are the kind of thing that legitimately happen sometimes, but they don’t happen so frequently that you should expect them, and you should have a pretty low prior on a discontinuity happening at some specific level of progress.

3. Detailing arguments for and against discontinuous progress in AI.

Arguments for AI risk, and counterarguments. The author has also spent some time thinking about how strong the arguments for AI risk are, and has focused on a few areas:

1. Will superhuman AI systems actually be able to far outpace humans, such that they could take over the world? In particular, it seems like humans can use non-agentic tools to help keep up.

2. Maybe the AI systems we build won’t have goals, and so the argument from instrumental subgoals won’t apply.

3. Even if the AI systems do have goals, they may have human-compatible goals (especially since people will be explicitly trying to do this).

4. The AI systems may not destroy everything: for example, they might instead simply trade with humans, and use their own resources to pursue their goals while leaving humans alone.


Decoupling deliberation from competition (Paul Christiano) (summarized by Rohin): Under a longtermist lens, one problem to worry about is that even after building AI systems, humans will spend more time competing with each other rather than figuring out what they want, which may then lead to their values changing in an undesirable way. For example, we may have powerful persuasion technology that everyone uses to persuade people to their line of thinking; it seems bad if humanity’s values are determined by a mix of effective persuasion tools, especially if persuasion significantly diverges from truth-seeking.

One solution to this is to coordinate to pause competition while we deliberate on what we want. However, this seems rather hard to implement. Instead, we can at least try to decouple competition from deliberation, by having AI systems acquire flexible influence (AN #65) on our behalf (competition), and having humans separately thinking about what they want (deliberation). As long as the AI systems are competent enough to shield the humans from the competition, the results of the deliberation shouldn’t depend too much on competition, thus achieving the desired decoupling.

The post has a bunch of additional concrete details on what could go wrong with such a plan that I won’t get into here.


Building and Evaluating Ethical Robotic Systems (Justin Svegliato, Samer Nashed et al) (summarized by Rohin): This workshop at IROS 2021 asks for work on ethical robotic systems, including value alignment as a subtopic. Notably, they also welcome researchers from disciplines beyond robotics, including philosophy, psychology, sociology, and law. The paper submission deadline is August 13.

Survey: classifying AI systems used in response to the COVID-19 pandemic (Samuel Curtis et al) (summarized by Rohin): A team at The Future Society aims to build a living database of AI systems used to respond to COVID, classified using the OECD framework. I think this is an interesting example of building capacity for effective AI governance. If you were involved in developing an AI system used in the COVID response, they ask that you take this survey by August 2nd.


I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.


An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

New Comment