Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

Please note that, while I work at DeepMind, this newsletter represents my personal views and not those of my employer.

HIGHLIGHTS

AI Safety Papers (Ozzie Gooen) (summarized by Rohin): AI Safety Papers (announced here) is an app to interactively explore a previously collected database of AI safety work (AN #130). I believe it contains every article in this newsletter (at least up to a certain date; it doesn’t automatically update) along with their summaries, so you may prefer to use that to search past issues of the newsletter instead of the spreadsheet I maintain.

On the Opportunities and Risks of Foundation Models (Rishi Bommasani et al) (summarized by Rohin): The history of AI is one of increasing emergence and homogenization. With the introduction of machine learning, we moved from a large proliferation of specialized algorithms that specified how to compute answers to a small number of general algorithms that learned how to compute answers (i.e. the algorithm for computing answers emerged from the learning algorithm). With the introduction of deep learning, we moved from a large proliferation of hand-engineered features for learning algorithms to a small number of architectures that could be pointed at a new domain and discover good features for that domain. Recently, the trend has continued: we have moved from a large proliferation of trained models for different tasks to a few large “foundation models” which learn general algorithms useful for solving specific tasks. BERT and GPT-3 are central examples of foundation models in language; many NLP tasks that previously required different models are now solved using finetuned or prompted versions of BERT and/or GPT-3.

Note that, while language is the main example of a domain with foundation models today, we should expect foundation models to be developed in an increasing number of domains over time. The authors call these “foundation” models to emphasize that (1) they form a fundamental building block for applications and (2) they are not themselves ready for deployment; they are simply a foundation on which applications can be built. Foundation models have been enabled only recently because they depend on having large scale in order to make use of large unlabeled datasets using self-supervised learning to enable effective transfer to new tasks. It is particularly challenging to understand and predict the capabilities exhibited by foundation models because their multitask nature emerges from the large-scale training rather than being designed in from the start, making the capabilities hard to anticipate. This is particularly unsettling because foundation models also lead to significantly increased homogenization, where everyone is using the same few models, and so any new emergent capability (or risk) is quickly distributed to everyone.

The authors argue that academia is uniquely suited to study and understand the risks of foundation models. Foundation models are going to interact with society, both in terms of the data used to create them and the effects on people who use applications built upon them. Thus, analysis of them will need to be interdisciplinary; this is best achieved in academia due to the concentration of people working in the various relevant areas. In addition, market-driven incentives need not align well with societal benefit, whereas the research mission of universities is the production and dissemination of knowledge and creation of global public goods, allowing academia to study directions that would have large societal benefit that might not be prioritized by industry.

All of this is just a summary of parts of the introduction to the report. The full report is over 150 pages and goes into detail on capabilities, applications, technologies (including technical risks), and societal implications. I’m not going to summarize it here, because it is long and a lot of it isn’t that relevant to alignment; I’ll instead note down particular points that I found interesting.

- (pg. 26) Some studies have suggested that foundation models in language don’t learn linguistic constructions robustly; even if they use it well once, they may not do so again, especially under distribution shift. In contrast, humans can easily “slot in” new knowledge into existing linguistic constructions.

- (pg. 34) This isn’t surprising but is worth repeating: many of the capabilities highlighted in the robotics section are very similar to the ones that we focus on in alignment (task specification, robustness, safety, sample efficiency).

- (pg. 42) For tasks involving reasoning (e.g. mathematical proofs, program synthesis, drug discovery, computer-aided design), neural nets can be used to guide a search through a large space of possibilities. Foundation models could be helpful because (1) since they are very good at generating sequences, you can encode arbitrary actions (e.g. in theorem proving, they can use arbitrary instructions in the proof assistant language rather than being restricted to an existing database of theorems), (2) the heuristics for effective search learned in one domain could transfer well to other domains where data is scarce, and (3) they could accept multimodal input: for example, in theorem proving for geometry, a multimodal foundation model could also incorporate information from geometric diagrams.

- (Section 3) A significant portion of the report is spent discussing potential applications of foundation models. This is the most in-depth version of this I have seen; anyone aiming to forecast the impacts of AI on the real world in the next 5-10 years should likely read this section. It’s notable to me how nearly all of the applications have an emphasis on robustness and reliability, particularly in truth-telling and logical reasoning.

- (Section 4.3) We’ve seen a few (AN #152) ways (AN #155) in which foundation models can be adapted. This section provides a good overview of the various methods that have been proposed in the literature. Note that adaptation is useful not just for specializing to a particular task like summarization, but also for enforcing constraints, handling distributional shifts, and more.

- (pg. 92) Foundation models are commonly evaluated by their performance on downstream tasks. One limitation of this evaluation paradigm is that it makes it hard to distinguish between the benefits provided by better training, data, adaptation techniques, architectures, etc. (The authors propose a bunch of other evaluation methodologies we could use.)

- (Section 4.9) There is a review of AI safety and AI alignment as it relates to foundation models, if you’re interested. (I suspect there won’t be much new for readers of this newsletter.)

- (Section 4.10) The section on theory emphasizes studying the pretraining-adaptation interface, which seems quite good to me. I especially liked the emphasis on the fact that pretraining and adaptation work on different distributions, and so it will be important to make good modeling assumptions about how these distributions are related.

TECHNICAL AI ALIGNMENT

PROBLEMS

AI Risk for Epistemic Minimalists (Alex Flint) (summarized by Rohin): This post makes a case for working on AI risk using four robust arguments:

1. AI is plausibly impactful because it is the first system that could plausibly have long-term influence or power without using humans as building blocks.

2. The impact is plausibly concerning because in general, when humans gain power quickly (as they would with AI), this tends to increase existential risk.

3. We haven’t already addressed the concern: we haven’t executed a considered judgment about the optimal way to roll out AI technology.

4. It seems possible to take actions that decrease the concern, simply because there are so many possible actions that we could take; at least some of them should have some useful effect.

Rohin's opinion: There’s definitely room to quibble with some of these arguments as stated, but I think this sort of argument basically works. Note that it only establishes that it is worth looking into AI risk; to justify the specific things people are doing (especially in AI alignment) you need significantly more specific and detailed arguments.

TECHNICAL AGENDAS AND PRIORITIZATION

Some criteria for sandwiching projects (Daniel Ziegler) (summarized by Rohin): This post outlines the pieces needed in order to execute a “sandwiching” project on aligning narrowly superhuman models (AN #141), with the example of answering questions about a text when humans have limited access to that text. (Imagine answering questions about a paper, where the model can read the full paper but human labelers can only read the abstract.) The required pieces are:

1. Aligned metric: There needs to be some way of telling whether the project succeeded, i.e. the technique made the narrowly superhuman model more aligned. In the Q&A case, we get the aligned metric by seeing how humans answer when they can read the entire text.

2. A narrowly superhuman model: The model must have the capability to outperform the labelers on the task. In the Q&A case, we get this by artificially restricting the input that the labelers get (relative to what the model gets). In other cases we could use labelers who lack the relevant domain expertise that the model instead knows.

3. Headroom on the aligned metric: Baseline methods (such as training from labeler feedback) should not perform very well, so that there is room for a better technique to improve performance. It would be especially nice if making the model larger led to no improvement in the aligned metric; this would mean that we are working in a situation that is primarily an alignment failure.

4. A natural plan of attack: We have some approach for doing better than the baseline. For the Q&A example, we could train one model that selects the most relevant piece of text (by training on labelers’ ratings of relevance) and another model that answers the question given that relevant piece.

Rohin's opinion: This seems like a good way to generate good concrete empirical projects to work on. It does differ from the original post in placing less of an emphasis on “fuzzy” tasks, where aligned metrics are hard to come by, though it isn’t incompatible with it (in a “fuzzy” task, you probably still want as aligned a metric as you can get in order to measure progress).

INTERPRETABILITY

Automating Auditing: An ambitious concrete technical research proposal (Evan Hubinger) (summarized by Rohin): A core worry with inner alignment is that we cannot determine whether a system is deceptive or not just by inspecting its behavior, since it may simply be behaving well for now in order to wait until a more opportune moment to deceive us. In order for interpretability to help with such an issue, we need worst-case interpretability that surfaces all the problems in a model. When we hear “worst-case”, we should be thinking of adversaries.

This post considers the auditing game, in which an attacker introduces a vulnerability in the model to violate some known specification, and the auditor must find and describe the vulnerability given only the modified model (i.e. it does not get to see the original model, or what the adversary did). The attacker aims to produce the largest vulnerability that they can get away with, and the auditor aims to describe the vulnerability as completely as possible. Note that both the attacker and the auditor can be humans (potentially assisted by AI tools). This game forms a good benchmark for worst-case interpretability work.

While the author is excited about direct progress on this game (i.e. better and better human auditors), he is particularly interested in fully automating the auditors. For example, we could collect a dataset of possible attacks and the corresponding desired audit, and finetune a large language model on such a dataset.

Rohin's opinion: I like the auditing game as a framework for constructing benchmarks for worst-case interpretability -- you can instantiate a particular benchmark by defining a specific adversary (or distribution of adversaries). Automating auditing against a human attacker seems like a good long-term goal, but it seems quite intractable given current capabilities.

AI GOVERNANCE

What the AI Community Can Learn From Sneezing Ferrets and a Mutant Virus Debate (Jasmine Wang) (summarized by Rohin): If you can modify bird flu to be transmitted in ferrets, should your experimental methods be published in full? When this question arose, the National Science Advisory Board for Biosecurity (NSABB) unanimously recommended that key methodological details should not be published. The World Health Organization (WHO) disagreed, calling for full publication in order to enable better science, and arguing that it would be too hard to create a mechanism to grant researchers with a legitimate need access to the redacted information. At this point, many bird flu researchers declared a voluntary moratorium on such research, until the controversy settled. Ultimately, the NSABB reversed its position and the paper was published.

This post suggests four lessons for the AI community to learn:

1. Third-party institutions like the NSABB can lead to better-considered outcomes. In particular, they can counteract publish-or-perish incentives and provide additional expertise and context (the NSABB had clearance for secret information that researchers could not access).

2. These institutions don’t happen “by default”. The NSABB was only established after the anthrax attacks of 2001, and most other countries don’t have an analogous body.

3. However, the powers of such institutions are limited. The NSABB is geographically limited and was not able to create a mechanism for sharing information to only those with legitimate need.

4. Researchers must take on some responsibility as well. For example, the voluntary moratorium allowed for the development of better policy.

Rohin's opinion: The four claims seem quite plausible to me. The post also argues that this suggests that the AI community should create its own third-party institution rather than depending on a state-led institution, but I didn’t really follow the argument for this, nor do I agree with the conclusion. On one hand, it’s plausible that the AI community could create such an institution before some crisis, while states could not (claim 2), and that such a community-led institution would be more binding on researchers across different countries (part of claim 3). But on the other hand, such institutions seem much worse at binding companies (from which I expect most of the risk) and presumably would have much less context than a state-led institution (claim 1).

FEEDBACK

I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

AI ALIGNMENT FORUM
AF