Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.


Request for proposals for projects in AI alignment that work with deep learning systems (Nick Beckstead and Asya Bergal) (summarized by Rohin): Open Philanthropy is seeking proposals for AI safety work in four major areas related to deep learning, each of which I summarize below. Proposals are due January 10, and can seek up to $1M covering up to 2 years. Grantees may later be invited to apply for larger and longer grants.

Rohin's opinion: Overall, I like these four directions and am excited to see what comes out of them! I'll comment on specific directions below.

RFP: Measuring and forecasting risks (Jacob Steinhardt) (summarized by Rohin): Measurement and forecasting is useful for two reasons. First, it gives us empirical data that can improve our understanding and spur progress. Second, it can allow us to quantitatively compare the safety performance of different systems, which could enable the creation of safety standards. So what makes for a good measurement?

1. Relevance to AI alignment: The measurement exhibits a failure mode that becomes worse as models become larger, or tracks a potential capability that may emerge with further scale (which in turn could enable deception, hacking, resource acquisition, etc).

2. Forward-looking: The measurement helps us understand future issues, not just those that exist today. Isolated examples of a phenomenon are good if we have nothing else, but we’d much prefer to have a systematic understanding of when a phenomenon occurs and how it tends to quantitatively increase or decrease with various factors. See for example scaling laws (AN #87).

3. Rich data source: Not all trends in MNIST generalize to CIFAR-10, and not all trends in CIFAR-10 generalize to ImageNet. Measurements on data sources with rich factors of variation are more likely to give general insights.

4. Soundness and quality: This is a general category for things like “do we know that the signal isn’t overwhelmed by the noise” and “are there any reasons that the measurement might produce false positives or false negatives”.

What sorts of things might you measure?

1. As you scale up task complexity, how much do you need to scale up human-labeled data to continue to maintain good performance and avoid reward hacking? If you fail at this and there are imperfections in the reward, how bad does this become?

2. What changes do we observe based on changes in the quality of the human feedback (e.g. getting feedback from amateurs vs experts)? This could give us information about the acceptable “difference in intelligence” between a model and its supervisor.

3. What happens when models are pushed out of distribution along a factor of variation that was not varied in the pretraining data?

4. To what extent do models provide wrong or undesired outputs in contexts where they are capable of providing the right answer?

Rohin's opinion: Measurements generally seem great. One story for impact is that we have a measurement that we think is strongly correlated with x-risk, and we use that measurement to select an AI system that scores low on such a metric. This seems distinctly good and I think would in fact reduce x-risk! But I want to clarify that I don’t think it would convince me that the system was safe with high confidence. The conceptual arguments against high confidence in safety seem quite strong and not easily overcome by such measurements. (I’m thinking of objective robustness failures (AN #66) of the form “the model is trying to pursue a simple proxy, but behaves well on the training distribution until it can execute a treacherous turn”.)

You can also tell stories where the measurements reveal empirical facts that then help us have high confidence in safety, by allowing us to build better theories and arguments, which can rule out the conceptual arguments above.

Separately, these measurements are also useful as a form of legible evidence about risk to others who are more skeptical of conceptual arguments.

RFP: Techniques for enhancing human feedback (Ajeya Cotra) (summarized by Rohin): Consider a topic previously analyzed in aligning narrowly superhuman models (AN #141): how can we use human feedback to train models to do what we want in cases where the models are more knowledgeable than the humans providing the feedback? A variety of techniques have been proposed to solve this problem, including iterated amplification (AN #40), debate (AN #5), recursive reward modeling (AN #34), market making (AN #108), and generalizing from short deliberations to long deliberations. This RFP solicits proposals that aim to test these or other mechanisms on existing systems. There are a variety of ways to set up the experiments so that the models are more knowledgeable than the humans providing the feedback, for example:

1. Train a language model to accurately explain things about a field that the feedback providers are not familiar with.

2. Train an RL agent to act well in an environment where the RL agent can observe more information than the feedback providers can.

3. Train a multilingual model to translate between English and a foreign language that the feedback providers do not know.

RFP: Interpretability (Chris Olah) (summarized by Rohin): The author provides this one sentence summary: We would like to see research building towards the ability to “reverse engineer" trained neural networks into human-understandable algorithms, enabling auditors to catch unanticipated safety problems in these models.

This RFP is primarily focused on an aspirational “intermediate” goal: to fully reverse engineer some modern neural network, such as an ImageNet classifier. (Despite the ambition, it is only an “intermediate” goal because what we would eventually need is a general method for cheaply reverse engineering any neural network.) The proposed areas of research are primarily inspired by the Circuits line of work (AN #142):

1. Discovering Features and Circuits: This is the most obvious approach to the aspirational goal. We simply “turn the crank” using existing tools to study new features and circuits, and this fairly often yields an interesting result that makes progress towards reverse engineering a neural network.

2. Scaling Circuits to Larger Models: So far the largest example of reverse engineering is curve circuits, with 50K parameters. Can we find examples of structure in the neural networks that allow us to drastically reduce the amount of effort required per parameter? (As examples, see equivariance and branch specialization.)

3. Resolving Polysemanticity: One of the core building blocks of the circuits approach is to identify a neuron with a concept, so that connections between neurons can be analyzed as connections between concepts. Unfortunately, some neurons are polysemantic, that is, they encode multiple different concepts. This greatly complicates analysis of the connections and circuits between these neurons. How can we deal with this potential obstacle?

Rohin's opinion: The full RFP has many, many more points about these topics; it’s 8 pages of remarkably information-dense yet readable prose. If you’re at all interested in mechanistic interpretability, I recommend reading it in full.

This RFP also has the benefit of having the most obvious pathway to impact: if we understand what algorithm neural networks are running, there’s a much better chance that we can catch any problems that arise, especially ones in which the neural network is deliberately optimizing against us. It’s one of the few areas where nearly everyone agrees that further progress is especially valuable.

RFP: Truthful and honest AI (Owain Evans) (summarized by Rohin): This RFP outlines research projects on Truthful AI (summarized below). They fall under three main categories:

1. Increasing clarity about “truthfulness” and “honesty”. While there are some tentative definitions of these concepts, there is still more precision to be had: for example, how do we deal with statements with ambiguous meanings, or ones involving figurative language? What is the appropriate standard for robustly truthful AI? It seems too strong to require the AI system to never generate a false statement; for example it might misunderstand the meaning of a newly coined piece of jargon.

2. Creating benchmarks and tasks for Truthful AI, such as TruthfulQA (AN #165), which checks for imitative falsehoods. This is not just meant to create a metric to improve on; it may also simply perform as a measurement. For example, we could experimentally evaluate whether honesty generalizes (AN #158), or explore how much truthfulness is reduced when adding in a task-specific objective.

3. Improving the truthfulness of models, for example by finetuning models on curated datasets of truthful utterances, finetuning on human feedback, using debate (AN #5), etc.

Besides the societal benefits from truthful AI, building truthful AI systems can also help with AI alignment:

1. A truthful AI system can be used to supervise its own actions, by asking it whether its selected action was good.

2. A robustly truthful AI system could continue to do this after deployment, allowing for ongoing monitoring of the AI system.

3. Similarly, we could have a robustly truthful AI system supervise its own actions in hypothetical scenarios, to make it more robustly aligned.

Rohin's opinion: While I agree that making AI systems truthful would then enable many alignment strategies, I’m actually more interested in the methods by which we make AI systems truthful. Many of the ideas suggested in the RFP are ones that would apply to alignment more generally and aren’t particularly specific to truthful AI. So it seems like whatever techniques we used to build truthful AI could then be repurposed for alignment. In other words, I expect that the benefit to AI alignment of working on truthful AI is that it serves as a good test case for methods that aim to impose constraints upon an AI system. In this sense, it is a more challenging, larger version of the ”never describe someone getting injured” challenge (AN #166). Note that I am only talking about how this helps AI alignment; there are also beneficial effects on society from pursuing truthful AI that I haven’t talked about here.


Truthful AI: Developing and governing AI that does not lie (Owain Evans, Owen Cotton-Barratt et al) (summarized by Rohin): This paper argues that we should develop both the technical capabilities and the governance mechanisms necessary to ensure that AI systems can be made truthful. We will primarily think about conversational AI systems here (so not, say, AlphaFold).

Some key terms:

1. An AI system is honest if it only makes statements that it actually believes. (This requires you to have some way of ascribing beliefs to the system.) In contrast, truthfulness only checks if statements correspond to reality, without making any claims about the AI system’s beliefs.

2. An AI system is broadly truthful if it doesn’t lie, volunteers all the relevant information it knows, is well-calibrated and knows the limits of its information, etc.

3. An AI system is narrowly truthful if it avoids making negligent suspected-falsehoods. These are statements that can feasibly be determined by the AI system to be unacceptably likely to be false. Importantly, a narrowly truthful AI is not required to make contentful statements, it can express uncertainty or refuse to answer.

This paper argues for narrow truthfulness as the appropriate standard. Broad truthfulness is not very precisely defined, making it challenging to coordinate on. Honesty does not give us the guarantees we want: in settings in which it is advantageous to say false things, AI systems might end up being honest but deluded. They would honestly report their beliefs, but those beliefs might be false.

Narrow truthfulness is still a much stronger standard than we impose upon humans. This is desirable because (1) AI systems need not be constrained by social norms the way humans are; consequently they need stronger standards, and (2) it may be less costly to enforce that AI systems are narrowly truthful than to enforce that humans are narrowly truthful, so a higher standard is more feasible.

Evaluating the (narrow) truthfulness of a model is non-trivial. There are two parts: first, determining whether a given statement is unacceptably likely to be false, and second, determining whether the model was negligent in uttering such a statement. The former could be done by having human processes that study a wide range of information and determine whether a given statement is unacceptably likely to be false. In addition to all of the usual concerns about the challenges of evaluating a model that might know more than you, there is also the challenge that it is not clear exactly what counts as “unacceptably likely to be false”. For example, if a model utters a false statement but expresses low confidence, how should that be rated? The second part, determining negligence, needs to account for the fact that the AI system might not have had all the necessary information, or that it might not have been capable enough to come to the correct conclusion. One way of handling this is to compare the AI system to other AI systems built in a similar fashion.

How might narrow truthfulness be useful? One nice thing it enables is truthfulness amplification, in which we can amplify properties of a model by asking a web of related questions and combining the answers appropriately. For example, if we are concerned that the AI system is deceiving us on just this question, we could ask it whether it is deceiving us, or whether an investigation into its statement would conclude that it was deceptive. As another example, if we are worried that the AI system is making a mistake on some question where its statement isn’t obviously false, we can ask it about its evidence for its position and how strong the evidence is (where false statements are more likely to be negligently false).

Section 3 is devoted to the potential benefits and costs if we successfully ensure that AI systems are narrowly truthful, with the conclusion that the costs are small relative to the benefits and can be partially mitigated. Section 6 discusses other potential benefits and costs if we attempt to create truthfulness standards to ensure the AI systems are narrowly truthful. (For example, we might try to create a truthfulness standard but instead create an institution that makes sure that AI systems follow a particular agenda (by only rating as true the statements that are consistent with that agenda). Section 4 talks about the governance mechanisms we might use to implement a truthfulness standard. Section 5 describes potential approaches for building truthful AI systems. As I mentioned in the highlighted post, these techniques are general alignment techniques that have been specialized for truthful AI.


Q&A Panel on Applying for Grad School (summarized by Rohin): In this event run by AI Safety Support on November 7, current PhD students will share their experiences navigating the application process and AI Safety research in academia. RSVP here.

SafeAI Workshop 2022 (summarized by Rohin): The SafeAI workshop at AAAI is now accepting paper submissions, with a deadline of Nov 12.

FLI's $25M Grants Program for Existential Risk Reduction (summarized by Rohin): This podcast talks about FLI's recent grants program for x-risk reduction. I've previously mentioned the fellowships (AN #165) they are running as part of this program. As a reminder, the application deadline is October 29 for the PhD fellowship, and November 5 for the postdoc fellowship.


I'm always happy to hear feedback; you can send it to me, Rohin Shah.


An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

New Comment