Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).


Estimating the Brittleness of AI: Safety Integrity Levels and the Need for Testing Out-Of-Distribution Performance (Andrew L. John) (summarized by Flo): Test, Evaluation, Verification, and Validation (TEVV) is an important barrier for AI applications in safety-critical areas. Current TEVV standards have very different rules for certifying software and certifying human operators. It is not clear which of these processes should be applied for AI systems.

If we treat AI systems as similar to human operators, we would certify them ensuring that they pass tests of ability. This does not give much of a guarantee of robustness (since only a few situations can be tested), and is only acceptable for humans because humans tend to be more robust to new situations than software. This could be a reasonable assumption for AI systems as well: while systems are certainly vulnerable to adversarial examples, the authors find that AI performance degrades surprisingly smoothly out of distribution in the absence of adversaries, in a plausibly human-like way.

While AI might have some characteristics of operators, there are good reasons to treat it as software. The ability to deploy multiple copies of the same system increases the threat of correlated failures, which is less true of humans. In addition, parallelization can allow for more extensive testing that is typical for software TEVV. For critical applications, a common standard is that of Safety Integrity Levels (SILs), which correspond to approximate failure rates per hour. Current AI systems fail way more often than current SILs for safety-critical applications demand. For example an image recognition system would require an accuracy of 0.99999997 at 10 processed frames per second just to reach the weakest SIL used in aviation.

However, SILs are often used on multiple levels and it is possible to build a system with a strong SIL from weaker components by using redundant components that fail independently or by detecting failures sufficiently early, such that AI modules could still be used safely as parts of a system specifically structured to cope with their failures. For example, we can use out-of-distribution detection to revert to a safe policy in simple applications. However, this is not possible for higher levels of automation where such a policy might not be available.

Flo's opinion: While I agree with the general thrust of this article, comparing image misclassification rates to rates of catastrophic failures in aviation seems a bit harsh. I am having difficulties imagining an aviation system that fails due to a single input that has been processed wrongly, even though the correlation between subsequent failures given similar inputs might mean that this is not necessary for locally catastrophic outcomes.

Rohin's opinion: My guess is that we’ll need to treat systems based primarily on neural nets similarly to operators. The main reason for this is that the tasks that AI systems will solve are usually not even well-defined enough to have a reliability rate like 0.99999997 (or even a couple of orders of magnitude worse). For example, human performance on image classification datasets is typically under 99%, not because humans are bad at image recognition, but because in many cases what the “true label” should be is ambiguous. For another example, you’d think “predict the next word” would be a nice unambiguous task definition, but then for the question “How many bonks are in a quoit?“, should your answer be “There are three bonks in a quoit” or “The question is nonsense”? (If you’re inclined to say that it’s obviously the latter, consider that many students will do something like the former if they see a question they don’t understand on an exam.)



AI Paradigms and AI Safety: Mapping Artefacts and Techniques to Safety Issues (Jose Hernandez-Orallo et al) (summarized by Rohin) (H/T Haydn Belfield): What should prioritization within the field of AI safety look like? Ideally, we would proactively look for potential issues that could arise with many potential AI technologies, making sure to cover the full space of possibilities rather than focusing on a single area. What does prioritization look like in practice? This paper investigates, and finds that it is pretty different from this ideal.

In particular, they define a set of 14 categories of AI techniques (examples include neural nets, planning and scheduling, and combinatorial optimization), and a set of 10 kinds of AI artefacts (examples include agents, providers, dialoguers, and swarms). They then analyze trends in the amount of attention paid to each technique or artefact, both for AI safety and AI in general. Note that they construe AI safety very broadly by including anything that addresses potential real-world problems with AI systems.

While there are a lot of interesting trends, the main conclusion is that there is an approximately 5-year delay between the emergence of an AI paradigm and safety research into that paradigm. In addition, safety research tends to neglect non-dominant paradigms.

Rohin's opinion: One possible conclusion is that safety research should be more diversified across different paradigms and artefacts, in order to properly maximize expected safety. However, this isn’t obvious: it seems likely that if the dominant paradigm has 50% of the research, it will also have, say, 80% of future real-world deployments, and so it could make sense to have 80% of the safety research focused on it. Rather than try to predict which paradigm will become dominant (a very difficult task), it may be more efficient to simply observe which paradigm becomes dominant and then redirect resources at that time (even though that process takes 5 years to happen).


Avoiding Negative Side Effects due to Incomplete Knowledge of AI Systems (Sandhya Saisubramanian et al) (summarized by Rohin): This paper provides an overview of the problem of negative side effects, and recent work that aims to address it. It characterizes negative side effects based on whether they are severe, reversible, avoidable, frequent, stochastic, observable, or exclusive (i.e. preventing the agent from accomplishing its main task), and describes existing work and how they relate to these characteristics.

In addition to the canonical point that negative side effects arise because the agent’s model is lacking (whether about human preferences or environment dynamics or important features to pay attention to), they identify two other main challenges with negative side effects. First, fixing negative side effects would likely require collecting feedback from humans, which can be expensive and challenging. Second, there will usually be a tradeoff between pursuing the original goal and avoiding negative side effects; we don’t have principled methods for dealing with this tradeoff.

Finally, they provide a long list of potential directions for future side effect research.


Foundational Philosophical Questions in AI Alignment (Lucas Perry and Iason Gabriel) (summarized by Rohin): This podcast starts with the topic of the paper Artificial Intelligence, Values and Alignment (AN #85) and then talks about a variety of different philosophical questions surrounding AI alignment.

Exploring AI Safety in Degrees: Generality, Capability and Control (John Burden et al) (summarized by Rohin) (H/T Haydn Belfield): This paper argues that we should decompose the notion of “intelligence” in order to talk more precisely about AI risk, and in particular suggests focusing on generality, capability, and control. We can think of capability as the expected performance of the system across a wide variety of tasks. For a fixed level of capability, generality can be thought of as how well the capability is distributed across different tasks. Finally, control refers to the degree to which the system is reliable and deliberate in its actions. The paper qualitatively discusses how these characteristics could interact with risk, and shows an example quantitative definition for a simple toy environment.



The Animal-AI Testbed and Competition (Matthew Crosby et al) (summarized by Rohin) (H/T Haydn Belfield): The Animal-AI testbed tests agents on the ability to solve the sorts of tasks that are used to test animal cognition: for example, is the agent able to reach around a transparent obstacle in order to obtain the food inside. This has a few benefits over standard RL environments:

1. The Animal-AI testbed is designed to test for specific abilities, unlike environments based on existing games like Atari.

2. A single agent is evaluated on multiple hidden tasks, preventing overfitting. In contrast, in typical RL environments the test setting is identical to the train setting, and so overfitting would count as a valid solution.

The authors ran a competition at NeurIPS 2019 in which submissions were tested on a wide variety of hidden tasks. The winning submission used an iterative method to design the agent: after using PPO to train an agent with the current reward and environment suite, the designer would analyze the behavior of the resulting agent, and tweak the reward and environments and then continue training, in order to increase robustness. However, it still falls far short of the perfect 100% that the author can achieve on the tests (though the author is not seeing the tests for the first time, as the agents are).

Read more: Building Thinking Machines by Solving Animal Cognition Tasks

Rohin's opinion: I’m not sure that the path to general intelligence needs to go through replicating embodied animal intelligence. Nonetheless, I really like this benchmark, because its evaluation setup involves new, unseen tasks in order to prevent overfitting, and because of its focus on learning multiple different skills. These features seem important for RL benchmarks regardless of whether we are replicating animal intelligence or not.

Generalized Hindsight for Reinforcement Learning (Alexander C. Li et al) (summarized by Rohin): Hindsight Experience Replay (HER) introduced the idea of relabeling trajectories in order to provide more learning signal for the algorithm. Intuitively, if you stumble upon the kitchen while searching for the bedroom, you can’t learn much about the task of going to the bedroom, but you can learn a lot about the task of going to the kitchen. So even if the original task was to go to the bedroom, we can simply pretend that the trajectory got rewards as if the task was to go to the kitchen, and then update our kitchen-traversal policy using an off-policy algorithm.

HER was limited to goal-reaching tasks, in which a trajectory would be relabeled as attempting to reach the state at the end of the trajectory. What if we want to handle other kinds of goals? The key insight of this paper is that trajectory relabeling is effectively an inverse RL problem: we want to find the task or goal for which the given trajectory is (near-)optimal. This allows us to generalize hindsight to arbitrary spaces of reward functions.

This leads to a simple algorithm: given a set of N possible tasks, when we get a new trajectory, rank how well that trajectory does relative to past experience for each of the N possible tasks, and then relabel that trajectory with the task for which it is closest to optimal (relative to past experience). Experiments show that this is quite effective and can lead to significant gains in sample efficiency. They also experiment with other heuristics for relabeling trajectories, which are less accurate but more computationally efficient.

Rohin's opinion: Getting a good learning signal can be a key challenge with RL. I’m somewhat surprised it took this long for HER to be generalized to arbitrary reward spaces -- it seems like a clear win that shouldn’t have taken too long to discover (though I didn’t think of it when I first read HER).

Rewriting History with Inverse RL: Hindsight Inference for Policy Improvement (Benjamin Eysenbach, Xinyang Geng et al) (summarized by Rohin): This paper was published at about the same time as the previous one, and has the same key insight. There are three main differences with the previous paper:

1. It shows theoretically that MaxEnt IRL is the “optimal” (sort of) way to relabel data if you want to optimize the multitask MaxEnt RL objective.

2. In addition to using the relabeled data with an off-policy RL algorithm, it also uses the relabeled data with behavior cloning.

3. It focuses on fewer environments and only uses a single relabeling strategy (MaxEnt IRL relabeling).


FHI is hiring Researchers, Research Fellows, and Senior Research Fellows (Anne Le Roux) (summarized by Rohin): FHI is hiring for researchers across a wide variety of topics, including technical AI safety research and AI governance. The application deadline is October 19.


I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.


An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

New Comment