Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world.


What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) (Andrew Critch et al) (summarized by Rohin): A robust agent-agnostic process (RAAP) is a process that robustly leads to an outcome, without being very sensitive to the details of exactly which agents participate in the process, or how they work. This is illustrated through a “Production Web” failure story, which roughly goes as follows:

A breakthrough in AI technology leads to a wave of automation of $JOBTYPE (e.g management) jobs. Any companies that don’t adopt this automation are outcompeted, and so soon most of these jobs are completely automated. This leads to significant gains at these companies and higher growth rates. These semi-automated companies trade amongst each other frequently, and a new generation of "precision manufacturing'' companies arise that can build almost anything using robots given the right raw materials. A few companies develop new software that can automate $OTHERJOB (e.g. engineering) jobs. Within a few years, nearly all human workers have been replaced.

These companies are now roughly maximizing production within their various industry sectors. Lots of goods are produced and sold to humans at incredibly cheap prices. However, we can’t understand how exactly this is happening. Even Board members of the fully mechanized companies can’t tell whether the companies are serving or merely appeasing humanity; government regulators have no chance.

We do realize that the companies are maximizing objectives that are incompatible with preserving our long-term well-being and existence, but we can’t do anything about it because the companies are both well-defended and essential for our basic needs. Eventually, resources critical to human survival but non-critical to machines (e.g., arable land, drinking water, atmospheric oxygen…) gradually become depleted or destroyed, until humans can no longer survive.

Notice that in this story it didn’t really matter what job type got automated first (nor did it matter which specific companies took advantage of the automation). This is the defining feature of a RAAP -- the same general story arises even if you change around the agents that are participating in the process. In particular, in this case competitive pressure to increase production acts as a “control loop” that ensures the same outcome happens, regardless of the exact details about which agents are involved.

Another (outer) alignment failure story (Paul Christiano) (summarized by Rohin): Suppose we train AI systems to perform task T by having humans look at the results that the AI system achieves and evaluating how well the AI has performed task T. Suppose further that AI systems generalize “correctly” such that even in new situations they are still taking those actions that they predict we will evaluate as good. This does not mean that the systems are aligned: they would still deceive us into thinking things are great when they actually are not. This post presents a more detailed story for how such AI systems can lead to extinction or complete human disempowerment. It’s relatively short, and a lot of the force comes from the specific details that I’m not going to summarize, so I do recommend you read it in full. I’ll be explaining a very abstract version below.

The core aspects of this story are:

1. Economic activity accelerates, leading to higher and higher growth rates, enabled by more and more automation through AI.

2. Throughout this process, we see some failures of AI systems where the AI system takes some action that initially looks good but we later find out was quite bad (e.g. investing in a Ponzi scheme, that the AI knows is a Ponzi scheme but the human doesn’t).

3. Despite this failure mode being known and lots of work being done on the problem, we are unable to find a good conceptual solution. The best we can do is to build better reward functions, sensors, measurement devices, checks and balances, etc. in order to provide better reward functions for agents and make it harder for them to trick us into thinking their actions are good when they are not.

4. Unfortunately, since the proportion of AI work keeps increasing relative to human work, this extra measurement capacity doesn’t work forever. Eventually, the AI systems are able to completely deceive all of our sensors, such that we can’t distinguish between worlds that are actually good and worlds which only appear good. Humans are dead or disempowered at this point.

(Again, the full story has much more detail.)

Rohin's opinion: Both the previous story and this one seem quite similar to each other, and seem pretty reasonable to me as a description of one plausible failure mode we are aiming to avert. The previous story tends to frame this more as a failure of humanity’s coordination, while this one frames it (in the title) as a failure of intent alignment. It seems like both of these aspects greatly increase the plausibility of the story, or in other words, if we eliminated or made significantly less bad either of the two failures, then the story would no longer seem very plausible.

A natural next question is then which of the two failures would be best to intervene on, that is, is it more useful to work on intent alignment, or working on coordination? I’ll note that my best guess is that for any given person, this effect is minor relative to “which of the two topics is the person more interested in?”, so it doesn’t seem hugely important to me. Nonetheless, my guess is that on the current margin, for technical research in particular, holding all else equal, it is more impactful to focus on intent alignment. You can see a much more vigorous discussion in e.g. this comment thread.



Rissanen Data Analysis: Examining Dataset Characteristics via Description Length (Ethan Perez et al) (summarized by Rohin): We are often interested in estimating how useful a particular capability might be for a model. For example, for Factored Cognition (AN #36) we're interested in how useful the "decomposition" ability is, that is, how useful it is to decompose the original question into subquestions (as in this paper (AN #95)). This paper proposes a simple methodology: give the model oracle access to the capability in question, and see how much it improves its predictions. This is measured in an online learning setup (rather than in one fell swoop at the end of training), in order to evaluate how useful the capability is in both low and high data regimes.

(The paper frames this as asking how much better you can compress the labels when you have access to the capability, relative to not having the capability. This can be seen as an upper bound on the minimum description length, which in turn is one way of operationalizing Occam's razor. I find the prediction view more intuitive, and as far as I can tell the two views are equivalent in the context of this paper.)

They then use this framework to investigate a bunch of empirical questions:

1. For question answering models trained from scratch, both ML decompositions and human decompositions are helpful, though ML still has a long way to go to catch up to human decompositions.

2. One way to evaluate gender bias in a dataset is to ask, "how useful is the "capability" of seeing the male-gendered words", relative to the same question for female-gendered words. This confirms the general male-gendered bias, even in a dataset that has more female-gendered words.

3. Some papers have claimed that neural nets are effectively "bag-of-words" models, i.e. they don't pay attention to the ordering of words in a sentence. They evaluate how useful the capability of "getting the correct order" is, and find that it does lead to significantly better results.

Making AI Safe through Debate (Jeremie Harris and Ethan Perez) (summarized by Rohin): This hour-long podcast is a good introduction to iterated amplification and debate, from a more ML perspective than most other explanations.

AXRP Episode 6 - Debate and Imitative Generalization (Daniel Filan and Beth Barnes) (summarized by Rohin): This podcast covers a bunch of topics, such as debate (AN #5), cross examination (AN #86), HCH (AN #34), iterated amplification (AN #40), and imitative generalization (AN #133) (aka learning the prior (AN #109)), along with themes about universality (AN #81). Recommended for getting a broad overview of this particular area of AI alignment.


LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning (Jian Liu, Leyang Cui et al) (summarized by Dan Hendrycks): LogiQA is a benchmark that attempts to track models' understanding of logic and reason.

It consists of translated questions from the Civil Servants Examination of China, designed to test civil servant candidates.

The questions often require thought and deliberation. Two examples are as follows,

David knows Mr. Zhang's friend Jack, and Jack knows David's friend Ms. Lin. Everyone of them who knows Jack has a master's degree, and everyone of them who knows Ms. Lin is from Shanghai.

Who is from Shanghai and has a master's degree?

A. David.

B. Jack.

C. Mr. Zhang.

D. Ms. Lin.

Last night, Mark either went to play in the gym or visited his teacher Tony. If Mark drove last night, he didn't go to play in the gym. Mark would go visit his teacher Tony only if he and his teacher had an appointment. In fact, Mark had no appointment with his teacher Tony in advance.

Which is true based on the above statements?

A. Mark went to the gym with his teacher Tony last night.

B. Mark visited his teacher Tony last night.

C. Mark didn't drive last night.

D. Mark didn't go to the gym last night.

See Figure 2 of the paper for the answers to these two questions (I don't want to spoil the answers). In the paper, the authors show that RoBERTa models obtain around 36% accuracy, whereas human-level accuracy is around 86%.

Dan Hendrycks' opinion: This is one of the few datasets that poses a challenge to today's Transformers, which makes it noteworthy. Despite its difficulty, accuracy is nontrivial and reliably increasing. In the appendix of a recent work, I and others show that performance on LogiQA is around 50% for an 11 billion parameter Transformer model. (Note that 50% is the models OOD generalization accuracy or transfer accuracy. The model was fine-tuned to answer some types of multiple-choice questions, but it did not fine-tune on LogiQA-style questions at all.) Consequently current models are already attaining nontrivial performance. Having done some LogiQA questions myself, I am surprised accuracy is already this high. Moreover, LogiQA accuracy is reliably increasing, as accuracy is increasing by about 15% for every order of magnitude increase in model size. If trends continue, a 1 trillion parameter model (10x the size of GPT-3) should be able to "solve" LogiQA.

I think LogiQA provides clear evidence that off-the-shelf Transformers are starting to acquire many "System 2" reasoning skills and can perform more than just snap judgments.


AI and International Stability: Risks and Confidence-Building Measures (Michael Horowitz et al) (summarized by Flo): Militaries are likely incentivized to integrate machine learning in their operations and because AI is a general-purpose technology, we cannot expect militaries to not use it at all. Still, it matters a great deal how and for which purposes militaries use AI. While militaries are currently not spending a lot on AI, there are several risks from broader adoption: An acceleration of warfare, and ensuing pressure for more automation as well as increased difficulty of managing escalation. More difficulties in assessing others' strength and less immediate human cost of conflict, leading to more risk-taking. Accidents due to AI systems' brittleness being mistaken as attacks and inflaming tensions.

This paper explores confidence-building measures (CBMs) as a way to reduce the negative effects of military AI use on international stability. CBMs were an important tool during the Cold War. However, as CBMs rely on a shared interest to succeed, their adoption has proven challenging in the context of cybersecurity, where the stakes of conflict are less clear than in the Cold War. The authors present a set of CBMs that could diminish risks from military use of AI and discuss their advantages and downsides. On the broad side, these include building norms around the military use of AI, dialogues between civil actors with expertise in the military use of AI from different countries, military to military dialogues, and code of conducts with multilateral support. On the more specific side, states could engage in public signalling of the importance of Test and Evaluation (T&E), transparency about T&E processes and push for international standards for military AI T&E. In addition, they could cooperate on civilian AI safety research, agree on specific rules to prevent accidental escalation (similar to the Incidents at Sea Agreement from the Cold War), clearly mark autonomous systems as such, and declare certain areas as off-limits for autonomous systems. Regarding nuclear weapons, the authors suggest an agreement between states to retain exclusive human control over nuclear launch decisions and a prohibition of uninhabited nuclear launch platforms such as submarines or bombers armed with nuclear weapons.

Read more: Import AI #234

Flo's opinion: While some of the discussed measures like marking autonomous weapon systems are very specific to the military use of AI, others such as the measures focussed on T&E could be useful more broadly to reduce risks from competitive pressures around AI. I believe that military AI use is the largest source of AI risk in the next few years, so I am glad that people are working on this.


FLI Job Postings (summarized by Rohin): The Future of Life Institute has 3 new job postings for full-time equivalent remote policy focused positions. They're looking for a Director of European Policy, a Policy Advocate, and a Policy Researcher, all primarily focused on AI policy and governance. Additional policy areas of interest may include lethal autonomous weapons, synthetic biology, nuclear weapons policy, and the management of existential and global catastrophic risk. Applications are accepted on a rolling basis.


I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.


An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

New Comment