Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.
Audio version here (may not be up yet).
TECHNICAL AI ALIGNMENT
OTHER PROGRESS IN AI
Language Models are Few-Shot Learners (Tom B. Brown et al) (summarized by Rohin): The biggest GPT-2 model (AN #46) had 1.5 billion parameters, and since its release people have trained language models with up to 17 billion parameters. This paper reports GPT-3 results, where the largest model has 175 billion parameters, a 10x increase over the previous largest language model. To get the obvious out of the way, it sets a new state of the art (SOTA) on zero-shot language modeling (evaluated only on Penn Tree Bank, as other evaluation sets were accidentally a part of their training set).
The primary focus of the paper is on analyzing the few-shot learning capabilities of GPT-3. In few-shot learning, after an initial training phase, at test time models are presented with a small number of examples of a new task, and then must execute that task for new inputs. Such problems are usually solved using meta-learning or finetuning, e.g. at test time MAML takes a few gradient steps on the new examples to produce a model finetuned for the test task. In contrast, the key hypothesis with GPT-3 is that language is so diverse, that doing well on it already requires adaptation to the input, and so the learned language model will already be a meta-learner. This implies that they can simply "prime" the model with examples of a task they care about, and the model can learn what task is supposed to be performed, and then perform that task well.
For example, consider the task of generating a sentence using a newly made-up word whose meaning has been explained. In one notable example, the prompt for GPT-3 is:
A "whatpu" is a small, furry animal native to Tanzania. An example of a sentence that uses the word whatpu is:
We were traveling in Africa and we saw these very cute whatpus.
To do a "farduddle" means to jump up and down really fast. An example of a sentence that uses the word farduddle is:
Given this prompt, GPT-3 generates the following example sentence for "farduddle":
One day when I was playing tag with my little sister, she got really excited and she started doing these crazy farduddles.
The paper tests on several downstream tasks for which benchmarks exist (e.g. question answering), and reports zero-shot, one-shot, and few-shot performance on all of them. On some tasks, the few-shot version sets a new SOTA, despite not being finetuned using the benchmark’s training set; on others, GPT-3 lags considerably behind finetuning approaches.
The paper also consistently shows that few-shot performance increases as the number of parameters increases, and the rate of increase is faster than the corresponding rate for zero-shot performance. While they don’t outright say it, we might take this as suggestive evidence that as models get larger, they are more incentivized to learn “general reasoning abilities”.
The most striking example of this is in arithmetic, where the smallest 6 models (up to 6.7 billion parameters) have poor performance (< 20% on 2-digit addition), then the next model (13 billion parameters) jumps to > 50% on 2-digit addition and subtraction, and the final model (175 billion parameters) achieves > 80% on 3-digit addition and subtraction and a perfect 100% on 2-digit addition (all in the few-shot regime). They explicitly look for their test problems in the training set, and find very few examples, suggesting that the model really is learning “how to do addition”; further, when it is incorrect, it tends to make mistakes like “forgetting to carry a 1”.
On broader impacts, the authors talk about potential misuse, fairness and bias concerns, and energy usage concerns; and say they about these issues what you’d expect. One interesting note: “To understand how low and mid-skill actors think about language models, we have been monitoring forums and chat groups where misinformation tactics, malware distribution, and computer fraud are frequently discussed.” They find that while there was significant discussion of misuse, they found no successful deployments. They also consulted with professional threat analysts about the possibility of well-resourced actors misusing the model. According to the paper: “The assessment was that language models may not be worth investing significant resources in because there has been no convincing demonstration that current language models are significantly better than current methods for generating text, and because methods for “targeting” or “controlling” the content of language models are still at a very early stage.”
Rohin's opinion: For a long time, I’ve heard people quietly hypothesizing that with a sufficient diversity of tasks, regular gradient descent could lead to general reasoning abilities allowing for quick adaptation to new tasks. This is a powerful demonstration of this hypothesis.
One critique is that GPT-3 still takes far too long to “identify” a task -- why does it need 50 examples of addition in order to figure out that what it should do is addition? Why isn’t 1 sufficient? It’s not like there are a bunch of other conceptions of “addition” that need to be disambiguated. I’m not sure what’s going on mechanistically, but we can infer from the paper that as language models get larger, the number of examples needed to achieve a given level of performance goes down, so it seems like there is some “strength” of general reasoning ability that goes up (see also this commentary). Still, it would be really interesting to figure out mechanistically how the model is “reasoning”.
This also provides some empirical evidence in support of the threat model underlying inner alignment concerns (AN #58): they are predicated on neural nets that implicitly learn to optimize. (To be clear, I think it provides empirical support for neural nets learning to “reason generally”, not neural nets learning to implicitly “perform search” in pursuit of a “mesa objective” -- see also Is the term mesa optimizer too narrow? (AN #78).)
An overview of 11 proposals for building safe advanced AI (Evan Hubinger) (summarized by Rohin): This post describes eleven “full” AI alignment proposals (where the goal is to build a powerful, beneficial AI system using current techniques), and evaluates them on four axes:
1. Outer alignment: Would the optimal policy for the specified loss function be aligned with us? See also this post.
2. Inner alignment: Will the model that is actually produced by the training process be aligned with us?
3. Training competitiveness: Is this an efficient way to train a powerful AI system? More concretely, if one team had a “reasonable lead” over other teams, would they keep at least some of the lead if they used this algorithm?
4. Performance competitiveness: Will the trained model have good performance (relative to other models that could be trained)?
Seven of the eleven proposals are of the form “recursive outer alignment technique” plus “technique for robustness (AN #81)”. The recursive outer alignment technique is either debate (AN #5), recursive reward modeling (AN #34), or some flavor of amplification (AN #42). The technique for robustness is either transparency tools to “peer inside the model”, relaxed adversarial training (AN #70), or intermittent oversight by a competent supervisor. An additional two proposals are of the form “non-recursive outer alignment technique” plus “technique for robustness” -- the non-recursive techniques are vanilla reinforcement learning in a multiagent environment, and narrow reward learning.
Another proposal is Microscope AI, in which we train AI systems to simply understand vast quantities of data, and then by peering into the AI system we can learn the insights that the AI system learned, leading to a lot of value. We wouldn’t have the AI system act in the world, thus eliminating a large swath of potential bad outcomes. Finally, we have STEM AI, where we try to build an AI system that operates in a sandbox and is very good at science and engineering, but doesn’t know much about humans. Intuitively, such a system would be very unlikely to deceive us (and probably would be incapable of doing so).
The post contains a lot of additional content that I didn’t do justice to in this summary. In particular, I’ve said nothing about the analysis of each of these proposals on the four axes listed above; the full post talks about all 44 combinations.
Rohin's opinion: I’m glad this post exists: while most of the specific proposals could be found by patching together content spread across other blog posts, there was a severe lack of a single article laying out a full picture for even one proposal, let alone all eleven in this post.
I usually don’t think about outer alignment as what happens with optimal policies, as assumed in this post -- when you’re talking about loss functions in the real world (as I think this post is trying to do), optimal behavior can be weird and unintuitive, in ways that may not actually matter. For example, arguably for any loss function, the optimal policy is to hack the loss function so that it always outputs zero (or perhaps negative infinity).
Planning with Uncertain Specifications (Ankit Shah et al) (summarized by Rohin): Suppose you recognize that there are no “certain specifications”, and so infer a distribution over specifications. What do you then do with that distribution? This paper looks at this problem in the context where the specifications are given by formulas in linear temporal logic (which can express temporal non-Markovian constraints). They identify four possibilities:
1. Most likely: Plan with respect to the most likely specification.
2. Most coverage: Satisfying as many formulas as possible, ignoring their probability (as long as they have non-zero probability)
3. Chance constrained: Like the above, except you weight by probabilities, and drop the least likely formulas up to a parameter δ.
4. Least regret: Like the above, with δ set to zero.
Intuitively, the Most likely criterion won’t be very robust since it is only taking one specification into account, Most coverage is aiming for maximum robustness, Chance constrained interpolates, where larger δ corresponds to trading robustness for gain in ability. This is exactly the pattern we see in a task where a robot must set a dinner table.
Rohin's opinion: Ultimately, I hope that in cases like this, the agent plans conservatively initially, but also tries to learn which specification is actually correct, allowing it to become more bold over time. Nonetheless, it seems quite difficult to do this well, and even then we likely will have this tradeoff between robustness and task performance. This is the case with humans too: if you try to please everyone (robustness), you’ll end up pleasing no one (task performance).
Suphx: Mastering Mahjong with Deep Reinforcement Learning (Junjie Li et al) (summarized by Rohin): Mahjong is a large imperfect information game with complex rules where turn order can be interrupted. This makes it challenging to solve with existing techniques like MCTS and counterfactual regret minimization. This paper details what was necessary to build Suphx, an AI system that is stronger than 99.99% of humans. Some highlights:
- Like the original AlphaGo, they first learned from human gameplay and then finetuned using reinforcement learning, with deep CNNs as their models. They learned both action models as well as value models. They added an entropy bonus to ensure that the policy remained stochastic enough to continue learning over the course of RL.
- They have five learned action models, corresponding to five different decisions that need to be made in Mahjong, as well as a rule-based system for deciding whether or not to declare a winning hand.
- To handle imperfect information, they first train an oracle agent that gets access to all information, and then slowly reduce the amount of information that it gets to observe.
- They could use search to improve the performance online, but did not do so in their evaluation (since Suphx was playing on a website with time constraints). Suphx with search would probably be significantly stronger.
Rohin's opinion: I am a bit curious how they removed observations from the oracle agent, given that you usually have to keep the structure of the input to a neural net constant. Perhaps they simply zeroed out the observations they didn't want?
Mastering Complex Control in MOBA Games with Deep Reinforcement Learning (Deheng Ye et al) (summarized by Rohin): This paper presents an AI system that can play the Multi-player Online Battle Arena (MOBA) game Honor of Kings. They are inspired by OpenAI Five (AN #13) (and Honor of Kings sounds quite similar to Dota, though it is 1v1 instead of 5v5), and have a similar learning setup: reinforcement learning using PPO. Their architecture requires an off-policy algorithm (I’m not sure why, maybe they have stale parameters across their rollout servers), so they add an importance sampling correction to the PPO objective, as well as an additional type of gradient clipping. The input is a combination of the image and underlying game state info. The resulting agents are able to beat top human players, and in an event with the public, the AI system lost only 4 out of 2100 matches. Unlike OpenAI Five, this required only around 100 hours to train (though it’s unclear how much compute was used).
More Efficient NLP Model Pre-training with ELECTRA (Kevin Clark et al) (summarized by Flo): There are two main approaches to pretraining for NLP, language models (LMs) which iteratively predict the next word in a given incomplete sentence, and masked language models (MLMs), which predict the identities of a few masked words in an otherwise complete sentence. While not just looking at the previous words (bidirectionality) can be advantageous, MLMs only learn to predict the masked words, which reduces how much is learnt from a given sentence.
The authors present an alternative approach, ELECTRA, that outperforms RoBERTa while requiring less than a third of the compute. This is achieved by changing the form of the pretraining task from predicting words to discriminating fake words: Instead of masking, some words are replaced by words generated by an MLM and the trained model has to classify these as fake. This way, we get bidirectionality, but also a more dense signal, as the model has to produce an output for every single word, not just the masked ones. While this looks similar to GANs, the generator is only trained on the usual MLM loss and is not incentivized to fool the discriminator, as GANs don't seem to work well on sequence data.
Read more: Paper: ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
Flo's opinion: I found it a bit surprising that replacing word prediction with fake discrimination would help that much, but from the ablations, it seems like this is really mostly an instrument to get a loss signal for every single word, which is a cool idea. On a more zoomed-out perspective, results like this seem to show that gains in algorithmic efficiency (AN #99) are not fundamentally slowing down.
DADS: Unsupervised Reinforcement Learning for Skill Discovery (Archit Sharma et al) (summarized by Rohin): Reinforcement learning in robotics typically plans directly on low-level actions. However, it sure seems like there are a simple set of primitives like walking, running, shuffling, etc. that are inherent to the robot morphology. What if we could learn these primitives, and then plan using those primitives? This paper introduces a method for learning these primitives without a reward function. They simply optimize skills for predictability and diversity (by optimizing the mutual information between the current state and next state, conditioned on which skill is being executed).
They can then use these primitives for model-based planning for a downstream task. You can think of this as a regular RL problem, except that an action in their "action space" takes the form "execute skill X for T timesteps". They use model-predictive control (MPC), in which you sample a bunch of trajectories, and execute the first action of the trajectory that gets the highest reward. Since each of their high-level actions determines the policy for T timesteps, they can scale to much longer horizon tasks than MPC can usually be used for. They show that this approach is competitive with regular model-based RL.
Read more: Paper: Dynamics-Aware Unsupervised Discovery of Skills
Rohin's opinion: I think unsupervised learning is likely to be key in getting more powerful and general AI systems without requiring a truly staggering amount of expert data, and this is a great example of what that might look like. Note though that the learned primitives are certainly not what you'd expect of a human: for example, the humanoid learns to vaguely shuffle in a direction, rather than walking. In addition, they did require specifying an "x-y prior" that required skills to be diverse based on x-y coordinates, which is why the skills learned navigation primitives, as opposed to e.g. distinct types of flailing.
I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.
An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.