[AN #103]: ARCHES: an agenda for existential safety, and combining natural language with deep RL



Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).


AI Research Considerations for Human Existential Safety (Andrew Critch et al) (summarized by Rohin): This research agenda out of CHAI directly attacks the problem longtermists care about: how to prevent AI-related existential catastrophe. This is distinctly different from the notion of being "provably beneficial": a key challenge for provable beneficence is defining what we even mean by "beneficial". In contrast, there are avenues for preventing AI-caused human extinction that do not require an understanding of "beneficial": most trivially, we could coordinate to never build AI systems that could cause human extinction.

Since the focus is on the impact of the AI system, the authors need a new phrase for this kind of AI system. They define a prepotent AI system to be one that cannot be controlled by humanity and has the potential to transform the world in a way that is at least as impactful as humanity as a whole. Such an AI system need not be superintelligent, or even an AGI; it may have powerful capabilities in a narrow domain such as technological autonomy, replication speed, or social acumen that enable prepotence.

By definition, a prepotent AI system is capable of transforming the world drastically. However, there are a lot of conditions that are necessary for continued human existence, and most transformations of the world will not preserve these conditions. (For example, consider the temperature of the Earth or the composition of the atmosphere.) As a result, human extinction is the default outcome from deploying a prepotent AI system, and can only be prevented if the system is designed to preserve human existence with very high precision relative to the significance of its actions. They define a misaligned prepotent AI system (MPAI) as one whose deployment leads to human extinction, and so the main objective is to avert the deployment of MPAI.

The authors break down the risk of deployment of MPAI into five subcategories, depending on the beliefs, actions and goals of the developers. The AI developers could fail to predict prepotence, fail to predict misalignment, fail to coordinate with other teams on deployment of systems that aggregate to form an MPAI, accidentally (unilaterally) deploy MPAI, or intentionally (unilaterally) deploy MPAI. There are also hazardous social conditions that could increase the likelihood of risks, such as unsafe development races, economic displacement of humans, human enfeeblement, and avoidance of talking about x-risk at all.

Moving from risks to solutions, the authors categorize their research directions along three axes based on the setting they are considering. First, is there one or multiple humans; second, is there one or multiple AI systems; and third, is it helping the human(s) comprehend, instruct, or control the AI system(s). So, multi/single instruction would involve multiple humans instructing a single AI system. While we will eventually need multi/multi, the preceding cases are easier problems from which we could gain insights that help solve the general multi/multi case. Similarly, comprehension can help with instruction, and both can help with control.

The authors then go on to list 29 different research directions, which I'm not going to summarize here.

Rohin's opinion: I love the abstract and introduction, because of their directness at actually stating what we want and care about. I am also a big fan of the distinction between provably beneficial and reducing x-risk, and the single/multi analysis.

The human fragility argument, as applied to generally intelligent agents, is a bit tricky. One interpretation is that the "hardness" stems from the fact that you need a bunch of "bits" of knowledge / control in order to keep humans around. However, it seems like a generally intelligent AI should easily be able to keep humans around "if it wants", and so the bits already exist in the AI. (As an analogy: we make big changes to the environment, but we could easily preserve deer habitats if we wanted to.) Thus, it is really a question of what "distribution" you expect the AI system is sampled from: if you think we'll build AI systems that try to do what humanity wants, then we're probably fine, but if you think that there will be multiple AI systems that each do what their users want, but the users have conflicts, the overall system seems more "random" in its goals, and so more likely to fall into the "default" outcome of human extinction.

The research directions are very detailed, and while there are some suggestions that don't seem particularly useful to me, overall I am happy with the list. (And as the paper itself notes, what is and isn't useful depends on your models of AI development.)

Human Instruction-Following with Deep Reinforcement Learning via Transfer-Learning from Text (Felix Hill et al) (summarized by Nicholas): This paper proposes the Simulation-to-Human Instruction Following via Transfer from Text (SHIFTT) method for training an RL agent to receive commands from humans in natural language. One approach to this problem is to train an RL agent to respond to commands based on a template; however, this is not robust to small changes in how humans phrase the commands. In SHIFTT, you instead begin with a pretrained language model such as BERT and first feed the templated commands through the language model. This is then combined with vision inputs to produce a policy. The human commands are later fed through the same language model, and they find that the model has zero-shot transfer to the human commands even if they differ in structure.

Nicholas's opinion: Natural language is a very flexible and intuitive way to convey instructions to AI. In some ways, this shifts the alignment problem from the RL agent to the supervised language model, which just needs to learn how to correctly interpret the meaning behind human speech. One advantage of this approach is that the language model is separately trained so it can be tested and verified for safety criteria before being used to train an RL agent. It also may be more competitive than alternatives such as reward modeling that require training a new reward model for each task.

I do see a couple downsides to this approach, however. The first is that humans are not perfect at conveying their values in natural language (e.g. King Midas wishing for everything he touches to turn to gold), and natural language may not have enough information to convey complex preferences. Even if humans give precise and correct commands, the language model needs to verifiably interpret those commands correctly. This could be difficult as current language models are difficult to interpret and contain many harmful biases.

Grounding Language in Play (Corey Lynch et al) (summarized by Robert): This paper presents a new approach to learning to follow natural language human instruction in a robotics setting. It builds on similar ideas to Learning Latent Plans from Play (AN #65), in that it uses unsupervised "play" data (trajectories of humans playing on the robot with no goal in mind).

The paper combines several ideas to enable training a policy which can follow natural language instructions with only limited human annotations.

* In Hindsight Instruction Pairing, human annotators watch small trajectories from the play data, and label them with the instruction which is being completed in the clip. This instruction can take any form, and means we don't need to choose the instructions and ask humans to perform specific tasks.

* Multicontext Imitation Learning is a method designed to allow goal-conditioned policies to be learned with multiple different types of goals. For example, we can have lots of example trajectories where the goal is an end state image (as these can be generated automatically without humans), and just a small amount of example trajectories where the goal is a natural language instruction (gathered using Hindsight Instruction Pairing). The approach is to learn a goal embedding network for each type of goal specification, and a single shared policy which takes the goal embedding as input.

Combining these two methods enables them to train a policy and embedding networks end to end using imitation learning from a large dataset of (trajectory, image goal) pairs and a small dataset of (trajectory, natural language goal) pairs. The policy can follow very long sequences of natural language instructions in a fairly complex grasping environment with a variety of buttons and objects. Their method performs better than the Learning from Play (LfP) method, even though LfP uses a goal image as the goal conditioning, instead of a natural language instruction.

Further, they propose that instead of learning the goal embedding for the natural language instructions, they use a pretrained large language model to produce the embeddings. This improves the performance of their method over learning the embedding from scratch, which the authors claim is the first example of the knowledge in large language models being transferred and improving performance in a robotics domain. This model also performs well when they create purposefully out of distribution natural language instructions (i.e. with weird synonyms, or google-translated from a different language).

Robert's opinion: I think this paper shows two important things:

1. Embedding the natural language instructions in the same space as the image conditioning works well, and is a good way of extending the usefulness of human annotations.

2. Large pretrained language models can be used to improve the performance of language-conditioned reinforcement learning (in this case imitation learning) algorithms and policies.

Methods which enable us to scale human feedback to complex settings are useful, and this method seems like it could scale well, especially with the use of pretrained large language models which might reduce the amount of language annotations needed further.



From ImageNet to Image Classification (Dimitris Tsipras et al) (summarized by Flo): ImageNet was crowdsourced by presenting images to MTurk workers who had to select images that contain a given class from a pool of images obtained via search on the internet. This is problematic, as an image containing multiple classes will basically get assigned to a random suitable class which can lead to deviations between ImageNet performance and actual capability to recognize images. The authors used MTurk and allowed workers to select multiple classes, as well as one main class for a given image in a pool of 10000 ImageNet validation images. Around 20% of the images seem to contain objects representing multiple classes and the average accuracy for these images was around 10% worse than average for a wide variety of image classifiers. While this is a significant drop, it is still way better than predicting a random class that is in the image. Also, advanced models were still able to predict the ImageNet label in cases where it does not coincide with the main class identified by humans, which suggest that they exploit biases in the dataset generation. While the accuracy of model predictions with respect to the newly identified main class still increased with better accuracy in predicting labels, the accuracy gap seems to grow and we might soon hit a point where gains in ImageNet accuracy don't correspond to improved image classification.

Read more: Paper: From ImageNet to Image Classification: Contextualizing Progress on Benchmarks

Flo's opinion: I generally find these empiricial tests of whether ML systems actually do what they are assumed to do quite useful for better calibrating intuitions about the speed of AI progress, and to make failure modes more salient. While we have the latter, I am confused about what this means for AI progress: on one hand, this supports the claim that improved benchmark progress does not necessarily translate to better real world applicability. On the other hand, it seems like image classification might be easier than exploiting the dataset biases present in ImageNet, which would mean that we would likely be able to reach even better accuracy than on ImageNet for image classification with the right dataset.

Focus: you are allowed to be bad at accomplishing your goals (Adam Shimi) (summarized by Rohin): Goal-directedness (AN #35) is one of the key drivers of AI risk: it's the underlying factor that leads to convergent instrumental subgoals. However, it has eluded a good definition so far: we cannot simply say that it is the optimal policy for some simple reward function, as that would imply AlphaGo is not goal-directed (since it was beaten by AlphaZero), which seems wrong. Basically, goal-directedness should not be tied directly to competence. So, instead of only considering optimal policies, we can consider any policy that could have been output by an RL algorithm, perhaps with limited resources. Formally, we can construct a set of policies for G that can result from running e.g. SARSA with varying amounts of resources with G as the reward, and define the focus of a system towards G to be the distance of the system’s policy to the constructed set of policies.

Rohin's opinion: I certainly agree that we should not require full competence in order to call a system goal-directed. I am less convinced of the particular construction here: current RL policies are typically terrible at generalization, and tabular SARSA explicitly doesn’t even try to generalize, whereas I see generalization as a key feature of goal-directedness.

You could imagine the RL policies get more resources and so are able to understand the whole environment without generalization, e.g. if they get to update on every state at least once. However, in this case realistic goal-directed policies would be penalized for “not knowing what they should have known”. For example, suppose I want to eat sweet things, and I come across a new fruit I’ve never seen before. So I try the fruit, and it turns out it is very bitter. This would count as “not being goal-directed”, since the RL policies for “eat sweet things” would already know that the fruit is bitter and so wouldn’t eat it.



Identifying Statistical Bias in Dataset Replication (Logan Engstrom et al) (summarized by Flo): One way of dealing with finite and fixed test sets and the resulting possibility of overfitting on the test set is dataset replication, where one tries to closely mimic the original process of dataset creation to obtain a larger test set. This can lead to bias if the difficulty of the new test images is distributed differently than in the original test set. A previous attempt at dataset replication on ImageNet tried to get around this by measuring how often humans under time pressure correctly answered a yes/no question about an image's class (dubbed selection frequency), which can be seen as a proxy for classification difficulty.

This data was then used to sample candidate images for every class which match the distribution of difficulty in the original test set. Still, all tested models performed worse on the replicated test set than on the original. Parts of this bias can be explained by noisy measurements combined with disparities in the initial distribution of difficulty, which are likely as the original ImageNet data was prefiltered for quality. Basically, the more noisy our estimates for the difficulty are, the more the original distribution of difficulty matters. As an extreme example, imagine a class for which all images in the original test set have a selection frequency of 100%, but 90% of candidates in the new test set have a selection frequency of 50%, while only 10% are as easy to classify as the images in the original test set. Then, if we only use a single human annotator, half of the difficult images in the candidate pool are indistinguishable from the easy ones, such that most images ending up in the new test set are more difficult to classify than the original ones, even after the adjustment.

The authors then replicate the ImageNet dataset replication with varying amounts of annotators and find that the gap in accuracy between the original and the new test set progressively shrinks with reduced noise from 11.7% with one annotator to 5.7% with 40. Lastly, they discuss more sophisticated estimators for accuracy to further lower bias, which additionally decreases the accuracy gap down to around 3.5%.

Flo's opinion: This was a pretty interesting read and provides evidence against large effects of overfitting on the test set. On the other hand, results like this also seem to highlight how benchmarks are mostly useful for model comparison, and how nonrobust they can be to fairly benign distributional shift.

Cold Case: The Lost MNIST Digits (Chhavi Yadav et al) (summarized by Flo): As the MNIST test set only contains 10,000 samples, concerns that further improvements are essentially overfitting on the test set have been voiced. Interestingly, MNIST was originally meant to have a test set of 60,000, as large as the training set, but the remaining 50,000 digits have been lost. The authors made many attempts to reconstruct the way MNIST was obtained from the NIST handwriting database as closely as possible and present QMNIST(v5) which features an additional 50,000 test images for MNIST, while the rest of the images are very close to the originals from MNIST. They test their dataset using multiple classification methods and find little difference in whether MNIST or QMNIST is used for training, but the test error on the additional 50,000 images is consistently higher than on the original 10,000 test images or their reconstruction of these. While the concerns about overuse of a test set are justified, the measured effects were mostly small and their relevance might be outweighed by the usefulness of paired differences for statistical model selection.

Flo's opinion: I am confused about the overfitting part, as most methods they try (like ResNets) don't seem to have been selected for performance on the MNIST test set. Granted, LeNet seems to degrade more than other models, but it seems like the additional test images in QMNIST are actually harder to classify. This seems especially plausible with the previous summary in mind and because the authors mention a dichotomy between the ease of classification for NIST images generated by highschoolers vs government employees but don’t seem to mention any attempts to deal with potential selection bias.


I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.


An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.