Alignment Newsletter #46

Rohin Shah

Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter.

Highlights

Better Language Models and Their Implications (Alec Radford, Jeffrey Wu, Dario Amodei, Ilya Sutskever et al): OpenAI has trained a scaled up GPT model using unsupervised learning (specifically, predicting the next word given a very large context) on a very large dataset with presumably very large compute. The resulting language model can produce impressive language samples (with some cherry-picking) that to my eye are particularly good at handling long-range dependencies, which makes sense since it is based on the Transformer (see Transformer-XL entry in AN #44). It sets new state of the art performance on 7 out of 8 language modeling tasks, including difficult datasets such as LAMBADA, without using the training data for those tasks. It can also be used for more structured tasks by providing a particular context -- for example, to summarize a document, you can provide the document followed by "TL;DR:" in order to induce GPT-2 to "predict" a summary. (They use a different prediction algorithm in order to improve summarization results, but I suspect even with regular prediction you'd get something in the right ballpark.) On these more structured tasks, it doesn't get anywhere near the state of the art set by specialized systems -- but again, this is without any finetuning for the specific task that we are testing.

The paper argues that in order to get generally capable AI systems, we will need to train them on many different tasks, as in meta-learning. However, we might expect that we need hundreds of thousands of tasks in order to learn something general, just as we need hundreds of thousands of examples in order to develop good classifiers. Prediction of the next word in natural language is particularly good for this, because in order to predict well across a huge variety of text, you need to become good at many different tasks such as question answering, summarization, and even translation. The biggest challenge is in creating a dataset that has sufficient diversity -- they do this by scraping all outbound links from Reddit with at least 3 karma.

Unusually for research, but in accordance with its charter (AN #2), OpenAI has decided not to release the model publicly, citing the possibility of malicious uses of the model. This has been controversial, with the debate raging for days on Twitter. I haven't paid enough attention to the debate to give a reasonable summary so you'll have to rely on other sources for that.

Rohin's opinion: These are some pretty impressive results. I'm surprised that all of this came from a single order of magnitude more data and model size, I would have expected it to take more than that. I think this lends a lot of support to the hypothesis that unsupervised learning with sufficient amounts of compute and diverse data can lead to generally capable AI systems. (See this SlateStarCodex post for a more detailed version of this take.) This is also some evidence that we will have AI systems that can pass the Turing Test before we have general AI systems, that is, the Turing Test is not AI-complete.

Thinking About Risks From AI: Accidents, Misuse and Structure (Remco Zwetsloot et al) (summarized by Richard): The authors argue that in addition to risk from misuse of AI and "accidents", we should pay attention to the structural perspective: how AI changes the broader environment and incentives of various actors. Possible examples include creating winner-take-all competition or creating overlap between offensive and defensive actions. In the face of these effects, even competent and well-intentioned decision-makers might be pressured into making risky choices. To ameliorate this problem, more people should focus on AI policy, particularly social scientists and historians; and we should think hard about creating collective norms and institutions for AI.

Richard's opinion: This post makes an important point in a clear and concise way. My only concern is that "structural problems" is such a broad heading that practically anything can be included, making it more difficult to specifically direct attention towards existential threats (the same is true for the term "accidents", which to me doesn't properly reflect the threat of adversarial behaviour from AI). I don't know how to best handle this tradeoff, but think it's a point worth raising.

Rohin's opinion: I just wanted to add a note on why we've highlighted this piece. While many of the particular concrete examples have been explained before, the underlying system for thinking about AI is new and useful. I particularly liked the distinction made between focusing on agency in AI (which leads you to think about accidents and misuse) vs. thinking about incentives and structure (which leads you to think about the entire causal chain leading up to the moment where an agent causes something bad to happen).

Technical AI alignment

Reward learning theory

"Normative assumptions" need not be complex, Humans interpreting humans and Anchoring vs Taste: a model (Stuart Armstrong): We have seen before that since humans are not perfectly rational, it is impossible (AN #31) to deduce their preferences, even with a simplicity prior, without any additional assumptions. This post makes the point that those assumptions need not be complex -- for example, if we could look at the "source code" of an agent, and we can find one major part with the same type signature as a reward function, and another major part with the type signature of a planner, then we can output the first part as the reward function. This won't work on humans, but we can hope that a similarly simple assumption that bakes in a lot of knowledge about humans could allow us to infer human preferences.

Since we seem to be very capable of inferring preferences of other humans, we might want to replicate our normative assumptions. The key idea is that we model ourselves and others in very similar ways. So, we could assume that if H is a human and G another human, then G's models of H's preferences and rationality are informative of H's preferences and rationality.

Stuart then shows how we could apply this to distinguish between choices made as a result of anchoring bias vs. actual taste preferences. Suppose that in condition 1, our human H would pay $1 or $3 for the same bar of chocolate depending on whether they were anchored on $0.01 or $100, and in condition 2 they would pay $1 or $3 depending on whether the chocolate has nuts. Ideally, we'd call the first case a bias, and the second one a preference. But in both cases, H's choice was determined by access to some information, so how can we distinguish between them? If we have access to H's internal model, we might expect that in the nuts case the information about nuts passes through a world model that then passes it on to a reward evaluator, whereas in the anchoring case the world model throws the information away, but it still affects the reward evaluator through a side channel. So we could add the normative assumption that only information that goes through the world model can be part of preferences. Of course, we could imagine another agent where the anchoring information goes through the world model and the nuts goes through the side channel -- but this agent is not human-like.

Rohin's opinion: There's one possible view where you look at the impossibility result around inferring preferences, and think that value alignment is hopeless. I don't subscribe to this view, for basically the reasons given in this post -- while you can't infer preferences for arbitrary agents, it certainly seems possible for humans in particular.

That said, I would expect that we accomplish this by learning a model that implicitly knows how to think about human preferences, rather than by explicitly constructing particular normative assumptions that we think will lead to good behavior. Explicit assumptions will inevitably be misspecified (AN #32), which is fine if we can correct the misspecification in the future, but at least under the threat model of an AI system that prevents us from changing its utility function (which I believe is the threat model Stuart usually considers) this isn't an option available to us.

Philosophical deliberation

The Argument from Philosophical Difficulty (Wei Dai): Since humans disagree wildly on what a good future looks like or what a good ethical theory is, we need to solve these philosophical problems in order to ensure a good future (which here means that we capture "most" of the value that we could get in theory). For example, we need to figure out what to do given that we might be in a simulation, and we need to make sure we don't lose sight of our "true" values in the presence of manipulation (AN #37). AI will tend to exacerbate these problems, for example because it will likely differentially accelerate technological progress relative to moral progress.

One way to achieve this is to make sure the AI systems we build correctly solve these problems. We could either solve the philosophical issues ourselves and program them in, specify a metaphilosophy module that allows the AI to solve philosophy problems itself, or have the AI learn philosophy from humans/defer to humans for philosophical solutions. Other possibilities include coordination to "keep the world stable" over a period of (say) millennia where we solve philosophical problems with AI help, and building corrigible AI systems with the hope that their overseers will want to solve philosophical problems. All of these approaches seem quite hard to get right, especially given "human safety problems", that is the fact that human moral intuitions likely do not generalize outside the current environment, and that they can be easily manipulated.

Rohin's opinion: This seems like a real problem, but I'm not sure how important it is. It definitely seems worth thinking about more, but I don't want to rule out the possibility that the natural trajectory that we will take assuming we develop useful AI systems will lead to us solving philosophical problems before doing anything too extreme, or before our values are irreversibly corrupted. I currently lean towards this view; however, I'm very uncertain about this since I haven't thought about it enough. Regardless of importance, it does seem to have almost no one working on it and could benefit from more thought. (See this comment thread for more details.)

Some Thoughts on Metaphilosophy (Wei Dai): This post considers some ways that we could think about what philosophy is. In particular, it highlights perspectives about what philosophy does (answer confusing questions, enable us to generalize out of distribution, solve meta-level problems that can then be turned into fast object-level domain-specific problem solvers) and how it works (slow but general problem solving, interminable debate, a general Turing Machine). Given that we haven't figured out metaphilosophy yet, we might want to preserve option value by e.g. slowing down technological progress until we solve metaphilosophy, or try to replicate human metaphilosophical abilities using ML.

Rohin's opinion: I think this is getting at a property that humans have that I've been thinking about that I sometimes call explicit or logical reasoning, and I think the key property is that it generalizes well out of distribution, but is very slow to run. I definitely want to understand it better for the purpose of forecasting what AI will be able to do in the future. It would also be great to understand the underlying principles in order to figure out how to actually get good generalization.

Adversarial examples

On Evaluating Adversarial Robustness (Nicholas Carlini et al)

Verification

Certified Adversarial Robustness via Randomized Smoothing (Jeremy M Cohen et al)

Forecasting

Evidence on good forecasting practices from the Good Judgment Project (Daniel Kokotajlo) (summarized by Richard): This post lists some of the key traits which are associated with successful forecasting, based on work from the Good Judgement Project (who won IARPA's forecasting tournament by a wide margin). The top 5: past performance in the same broad domain; making more predictions on the same question; deliberation time; collaboration on teams; and intelligence. The authors also summarise various other ideas from the Superforecasting book.

Read more: Accompanying blog post

Miscellaneous (Alignment)

Three Biases That Made Me Believe in AI Risk (beth) (summarized by Richard): Beth (not to be confused with AI safety researcher Beth Barnes) argues firstly that the language we use overly anthropomorphises AI, which leads to an exaggerated perception of risks; secondly, that the sense of meaning that working on AI safety provides causes motivated reasoning; and thirdly, that we anchor away from very low numbers (e.g. it seems absurd to assign existential AI risk a probability of 0.0000000000000000000000000000001, since that has so many zeros! Yet Beth thinks this number significantly overestimates the risk.)

Richard's opinion: I'm glad to see this sort of discussion taking place - however, I disagree quite strongly with arguments 1 and 3. On 1: it's true that for current systems, it's often better to describe them without assigning them agency, but only because they're still very simple compared with humans (or smart animals). Whether or not it will be appropriate to consider advanced AI to have intentions and goals is a complex question - I think there are strong arguments for that claim. On 3: I think that it's very reasonable to shy away from very small probabilities without overwhelming amounts of evidence, to counteract standard human overconfidence. Beth's alternative of reasoning using bits of evidence seems like it would push almost everyone towards unjustifiably strong conclusions on most questions, as it does for her on AI risk.

Would I think for ten thousand years? (Stuart Armstrong): Many ideas in AI safety involve delegating key decisions to simulations that can think longer. This post points out that you need to worry about value drift and other unforeseen problems in this situation. The comments also point out that there will likely be differences between the simulation and the real world that could be problematic (e.g. what prevents the humans from going crazy from isolation?)

Rohin's opinion: Typically, if you argue that a simulation of you that thinks longer can't solve some problem X, the natural response is that that implies you couldn't solve X either. However, the differences between the simulation environment and real environment could make it be the case that in reality you could solve a problem that you couldn't in simulation (e.g. imagine the simulation didn't have access to the Internet). This suggests that if you wanted to do this you'd have to set up the simulation very carefully.

AI strategy and policy

Thinking About Risks From AI: Accidents, Misuse and Structure (Remco Zwetsloot et al): Summarized in the highlights!

Risk factors for s-risks (Tobias Baumann) (summarized by Richard): This post discusses four risk factors for creating extreme disvalue in the universe (s-risks): advanced technology, lack of effort to avoid those outcomes, inadequate security and law enforcement, and polarisation and divergence of values. Tobias notes that he's most worried about cases where most of these factors occur, because the absence of any of them mitigates the threat posed by the others.

Toward AI Security: Global Aspirations for a More Resilient Future (Jessica Cussins Newman): This report analyzes various AI security risks (including both near-term and long-term concerns) and categorizes them, and then analyzes how different national strategies and policies have engaged with these risks. Most interestingly (to me) it comes to the conclusion that most national AI strategies are focused on very different areas and often ignore (in the sense of not mentioning) risks that other countries have highlighted, though there are still some areas for cooperation, such as improving the transparency and accountability of AI systems.

Rohin's opinion: It's pretty strange to me that different governments would take such different approaches to AI - this suggests that either academics, think tanks, policy analysts etc. do not agree on the risks, or that there isn't enough political pressure for some of the risks to make it into the strategies. It seems like the AI community would have a significant opportunity to shape policy in the latter case -- I'd imagine for example that an open letter signed by thousands of researchers could be quite helpful in creating political will. (Of course, creating a comprehensive open letter that most researchers will approve of might be quite hard to do.)

Ethical and societal implications of algorithms, data, and artificial intelligence: a roadmap for research (Nuffield Foundation and Leverhulme Centre for the Future of Intelligence)

Other progress in AI

Reinforcement learning

Introducing PlaNet: A Deep Planning Network for Reinforcement Learning (Danijar Hafner et al)

Deep learning

Better Language Models and Their Implications (Alec Radford, Jeffrey Wu, Dario Amodei, Ilya Sutskever et al): Summarized in the highlights!

Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey (Longlong Jing et al)

News

FHI DPhil Scholarships (Rose Hadshar): The Future of Humanity Institute is accepting applications for scholarships for candidates beginning a DPhil programme.

11