Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.
Audio version here (may not be up yet).
Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.
Some AI research areas and their relevance to existential safety (Andrew Critch) (summarized by Rohin): This long post explains the author’s beliefs about a variety of research topics relevant to AI existential safety. First, let’s look at some definitions.
While AI safety alone just means getting AI systems to avoid risks (including e.g. the risk of a self-driving car crashing), AI existential safety means preventing AI systems from posing risks at least as bad as human extinction. AI alignment on the other hand is about getting an AI system to try to / succeed at doing what a person or institution wants them to do. (The “try” version is intent alignment, while the “succeed” version is impact alignment.)
Note that AI alignment is not the same thing as AI existential safety. In addition, the author makes the stronger claim that it is insufficient to guarantee AI existential safety, because AI alignment tends to focus on situations involving a single human and a single AI system, whereas AI existential safety requires navigating systems involving multiple humans and multiple AI systems. Just as AI alignment researchers worry that work on AI capabilities for useful systems doesn’t engage enough with the difficulty of alignment, the author worries that work on alignment doesn’t engage enough with the difficulty of multiagent systems.
The author also defines AI ethics as the principles that AI developers and systems should follow, and AI governance as identifying and enforcing norms for AI developers and systems to follow. While ethics research may focus on resolving disagreements, governance will be more focused on finding agreeable principles and putting them into practice.
Let’s now turn to how to achieve AI existential safety. The main mechanism the author sees is to anticipate, legitimize, and fulfill governance demands for AI technology. Roughly, governance demands are those properties which there are social and political pressures for, such as “AI systems should be fair” or “AI systems should not lead to human extinction”. If we can anticipate these demands in advance, then we can do technical work on how to fulfill or meet these demands, which in turn legitimizes them, that is, it makes it clearer that the demand can be fulfilled and so makes it easier to create common knowledge that it is likely to become a legal or professional standard.
We then turn to various different fields of research, which the author ranks on three axes: helpfulness to AI existential safety (including potential negative effects), educational value, and neglectedness. Note that for educational value, the author is estimating the benefits of conducting research on the topic to the researcher, and not to (say) the rest of the field. I’ll only focus on helpfulness to AI existential safety below, since that’s what I’m most interested in (it’s where the most disagreement is, and so where new arguments are most useful), but I do think all three axes are important.
The author ranks both preference learning and out of distribution robustness lowest on helpfulness to existential safety (1/10), primarily because companies already have a strong incentive to have robust AI systems that understand preferences.
Multiagent reinforcement learning (MARL) comes only slightly higher (2/10), because since it doesn’t involve humans its main purpose seems to be to deploy fleets of agents that may pose risks to humanity. It is possible that MARL research could help by producing cooperative agents (AN #116), but even this carries its own risks.
Agent foundations is especially dual-use in this framing, because it can help us understand the big multiagent system of interactions, and there isn’t a restriction on how that understanding could be used. It consequently gets a low score (3/10), that is a combination of “targeted applications could be very useful” and “it could lead to powerful harmful forces”.
Minimizing side effects starts to address the challenges the author sees as important (4/10): in particular, it can allow us both to prevent accidents, where an AI system “messes up”, and it can help us prevent externalities (harms to people other than the primary stakeholders), which are one of the most challenging issues in regulating multiagent systems.
Fairness is valuable for the obvious reason: it is a particular governance demand that we have anticipated, and research on it now will help fulfill and legitimize that demand. In addition, research on fairness helps get people to think at a societal scale, and to think about the context in which AI systems are deployed. It may also help prevent centralization of power from deployment of AI systems, since that would be an unfair outcome.
The author would love it if AI/ML pivoted to frequently think about real-life humans and their desires, values and vulnerabilities. Human-robot interaction (HRI) is a great way to cause more of that to happen, and that alone is valuable enough that the author assigns it 6/10, tying it with fairness.
As we deploy more and more powerful AI systems, things will eventually happen too quickly for humans to monitor. As a result, we will need to also automate the process of governance itself. The area of computational social choice is well-posed to make this happen (7/10), though certainly current proposals are insufficient and more research is needed.
Accountability in ML is good (8/10) primarily because as we make ML systems accountable, we will likely also start to make tech companies accountable, which seems important for governance. In addition, in a CAIS (AN #40) scenario, better accountability mechanisms seem likely to help in ensuring that the various AI systems remain accountable, and thus safer, to human society.
Finally, interpretability is useful (8/10) for the obvious reasons: it allows developers to more accurately judge the properties of systems they build, and helps in holding developers and systems accountable. But the most important reason may be that interpretable systems can make it significantly easier for competing institutions and nations to establish cooperation around AI-heavy operations.
Rohin's opinion: I liked this post: it’s a good exploration of what you might do if your goal was to work on technical approaches to future governance challenges; that seems valuable and I broadly agree with it (though I did have some nitpicks in this comment).
There is then an additional question of whether the best thing to do to improve AI existential safety is to work on technical approaches to governance challenges. There’s some pushback on this claim in the comments that I agree with; I recommend reading through it. It seems like the core disagreement is on the relative importance of risks: in particular, it sounds like the author thinks that existing incentives for preference learning and out-of-distribution robustness are strong enough that we mostly don’t have to worry about it, whereas governance will be much more challenging; I disagree with at least that relative ranking.
It’s possible that we agree on the strength of existing incentives -- I’ve claimed (AN #80) a risk of 10% for existentially bad failures of intent alignment if there is no longtermist intervention, primarily because of existing strong incentives. That could be consistent with this post, in which case we’d disagree primarily on whether the “default” governance solutions are sufficient for handling AI risk, where I’m a lot more optimistic than the author.
Understanding RL Vision (Jacob Hilton et al) (summarized by Robert): This work presents an interface for interpreting the vision of a reinforcement learning agent trained with PPO on the CoinRun game. This game is procedurally generated, which means the levels are different in every episode of playing. The interface primarily uses attribution from a hidden layer to the output of the value function. This interface is used in several ways.
First, they use the interface to dissect failed trajectories of the policy (it fails in 1 out of 200 levels). They're able to understand why the failures occurred using their interface: for example, in one case the view of the agent at the top of its jump means it can't see any platforms below it, so doesn't move to the right fast enough to reach the platform it was jumping for, leading it to miss the platform and fail the level. Second, they use the interface to discover "hallucinations", where the value function mistakes one element of the environment for another, causing its value to drop or rise significantly. Often these hallucinations only last a single time-step, so they don't affect performance.
Finally, they use the attributions specifically to hand-edit the weights of the model to make it "blind" to buzzsaws (one of the hazards) by zeroing the feature which recognises them. After doing this, they show that the edited agent fails a lot more from buzzsaw failures but no more from other types of failures, which gives a quantitative justification for their interpretation of the feature as buzzsaw-recognising.
From using this interface, they propose the diversity hypothesis: Interpretable features tend to arise (at a given level of abstraction) if and only if the training distribution is diverse enough (at that level of abstraction). This is based on the fact that interpretable features arise more when the agent is trained on a wider variety of levels. There also seems to be a qualitative link to generalisation - a wider distribution of training levels leads to better interpretability (measured qualitatively) and better generalisation (measured quantitatively).
Robert's opinion: I'm in favour of work on interpretability in reinforcement learning, and it's good to see the team at OpenAI working on it. I think this is a(nother) demonstration from them that interpretability research is often mostly about engineering and user interface design, followed by extended use of the produced interface; none of the methods proposed here are especially novel, but the combined interface and subsequent insights gained from its use are.
I also think the diversity hypothesis seems (in the abstract) plausible, and seems to have some supporting evidence from supervised learning (in particular computer vision): harder tasks tend to lead to better representations, and adversarially robust networks produce more interpretable representations, while also generalising better. One problem with verifying this hypothesis in other settings (or even more formally in this setting) is having to measure what it means for a representation to be "more interpretable". In general, I think this is related to the phenomena of shortcut learning in deep learning: shortcuts in tasks will tend to mean that the network won't have learned a robust or interpretable feature set, whereas if there are no shortcuts and the network needs to do the task "as a human would", then it's more likely that the representations will be more robust.
Proxy Tasks and Subjective Measures Can Be Misleading in Evaluating Explainable AI Systems (Zana Buçinca, Phoebe Lin et al) (summarized by Flo): As humans and AI systems have different strengths, it might make sense to combine them into human+AI teams for decision-making tasks. However, this does not always work well: if the human puts too little trust in a competent AI, the AI is of little use, and if they put too much trust in an incompetent AI, they might make worse decisions than had they been on their own. A lot of explainability research has focused on instilling more trust in AI systems without asking how much trust would be appropriate, even though there is research showing that hiding model bias instead of truthfully revealing it can increase trust in an AI system.
The authors conduct two experiments using an AI system that predicts nutrition information from pictures of food. In the first experiment, participants were asked to predict the AI's decision based on the ground truth and one of two types of explanations. In the inductive condition, the explanation consisted of a series of images the AI had identified as similar. In the deductive condition, subjects were shown a list of main ingredients identified by the AI. Subjects put more trust in the inductive explanations but were equally good at predicting the system's output in both cases. In the second experiment, a new set of subjects was asked to predict nutritional values with the help of the AI's predictions. Overall, access to the AI strongly improved the subjects' accuracy from below 50% to around 70%, which was further boosted to a value slightly below the AI's accuracy of 75% when users also saw explanations. This time, subjects put more trust in the AI when given deductive explanations, but performed better when given inductive explanations, as they were more likely to go against the AI's wrong decisions in that case.
The authors hypothesize that the between-task difference in which explanations are trusted more is connected to the cognitive effort required by the tasks and for understanding the explanations, combined with human reluctance to exert mental effort. They suggest to pay more attention to the exact form of the human-AI interaction and recommend to view AI-based decision aids as sociotechnical systems that are to be evaluated by their usefulness for actual decision making, rather than trust.
Flo's opinion: I am not sure whether the authors used an actual AI system or just handcrafted the input-prediction-explanation tuples, and how that might affect the correlation between explanations and the system's outputs, which can influence trust. Overall, the study reinforces my prior that trust induced by explanations is not a good predictor of an AI system's usefulness, but I am more sceptical that the differences between inductive and deductive explanations will be the same in different contexts.
AGI Predictions (Amanda Ngo et al) (summarized by Rohin): A collection of interesting questions relevant to AI safety, as well as aggregated predictions from readers of the post.
AlphaFold: a solution to a 50-year-old grand challenge in biology (The AlphaFold team et al) (summarized by Rohin): The newest results from AlphaFold (AN #36) on the CASP-14 assessment give it a median score of 92.4 GDT across all targets, where a score of 90 is informally considered to be competitive with results obtained from experimental methods. The system also shows some signs of real-world usability: for example, it was used earlier this year to predict the structure of two COVID proteins, which were later borne out by experimental results (that took several months to obtain, if I understand correctly).
Rohin's opinion: Obviously this is an astounding accomplishment for DeepMind (conflict of interest notice: I work at DeepMind). I feel like I should have some opinion on what this means for the future of AI systems, but unfortunately I think I don’t know enough about protein folding to have any interesting takes.
From an outside view perspective, it seems like this is an example of deep learning crushing a task that a) humans put a lot of effort into and b) humans weren’t evolutionarily designed for. This is exactly what we saw with Go, Dota and StarCraft, and so this isn’t much of an update for me. Yes, this is a case of it being used in a real-world problem rather than a synthetic game, but that doesn’t seem particularly relevant.
Asya's opinion: I think this is particularly interesting because this model is closer to being a source of revenue than solutions to other problems. This makes me think machine learning research might actually solve enough important problems to pay for itself in the near future.
Transformers for Image Recognition at Scale (Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Neil Houlsby et al) (summarized by Flo): This paper applies transformers to image classification in a fairly straightforward way: First, an input image is divided into 16x16 pixel patches on a grid. Then, a linear projection of the patch is combined with a learnt positional embedding and fed into a standard transformer pipeline. Lastly, a standard MLP head is applied on top of the transformer for the classification. When trained on ImageNet, this architecture overfits and does not reach SOTA performance. However, it can compete with the previous SOTA on the larger ImageNet-21k (14M images) and outcompete it on JFT (300M images), while needing four times less compute for training. By finetuning the JFT model on ImageNet, the transformer narrowly outperforms the previous best ImageNet classifier.
The positional embeddings learnt by the model look meaningful in that each is most similar to others in the same row or column. Also, some of the attention heads in early layers attend to multiple distant patches, while others are a lot more local. This means that some heads in the early layers have a wide receptive field, which is something that convolution kernels cannot achieve. Overall, given enough data, the transformer seems to be able to learn inductive biases used by CNNs without being limited to them.
Read more: Paper: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Flo's opinion: Intuitively, inductive biases become less and less useful the more training data we have, but I would have thought that in the current regime CNNs have too weak rather than too strong inductive biases, so the results are surprising. What is even more surprising is how simple the model is: It does not seem to use any data augmentation, unsupervised pretraining or other tricks like noisy student-teacher training, such that there are many promising avenues for immediate improvements. Also, I would imagine that using something more sophisticated than a linear projection to embed the 16x16 patches could go a long way.
Metaculus AI Progress Tournament (summarized by Rohin): Metaculus is running an AI forecasting tournament, with up to $50,000 in prizes. The tournament starts December 14, and will continue till around mid-June, and will involve forecasting targets on a 6-24 month timescale. You can pre-register to forecast now.
I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.
An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.