Cross-posted from the BAIR Blog and from my blog, Bounded Regret.

Along with researchers from Google Brain and OpenAI, we are releasing a paper on Unsolved Problems in ML Safety. Due to emerging safety challenges in ML, such as those introduced by recent large-scale models, we provide a new roadmap for ML Safety and refine the technical problems that the field needs to address. As a preview of the paper, in this post we consider a subset of the paper's directions, namely withstanding hazards ("Robustness"), identifying hazards ("Monitoring"), and steering ML systems ("Alignment").

Overview. In this post, we consider problems in Robustness, Monitoring, and Alignment. In the full paper, we also consider the problem of External Safety.


Robustness research aims to build systems that are less vulnerable to extreme hazards and to adversarial threats. Two problems in robustness are robustness to long tails and robustness to adversarial examples.

Long Tails

Examples of long tail events. First row, left: an ambulance in front of a green light. First row, middle: birds on the road. First row, right: a reflection of a pedestrian. Bottom row, left: a group of people cosplaying. Bottom row, middle: a foggy road. Bottom row, right: a person partly occluded by a board on their back. (Source)

To operate in open-world high-stakes environments, ML systems will need to endure unusual events and tail risks. However, current ML systems are brittle in the face of real-world complexity, and they become unreliable when faced with novel situations like those above. To reduce the chance that ML systems choose a wrong course of action in environments dominated by rare events, models need to be unusually robust.

Adversarial Examples

Adversarial perturbations. An example of an input image altered by an adversarial perturbation. After the adversarial perturbation, the neural network makes a high confidence mistake. (Source)

Adversaries can easily manipulate vulnerabilities in ML systems and cause them to make mistakes. As shown above, carefully crafted small perturbations are enough to break ML systems. In the paper, we focus on this problem but also suggest that researchers consider more realistic settings, like when attackers can create perceptible images or when attack specifications are not known beforehand.


Monitoring research aims to create tools and features that help human operators identify hazards and inspect ML systems. Two problems in monitoring are anomaly detection and backdoor detection. This list is nonexhaustive, and we include other problems in the paper including calibration, honest outputs, and detecting emergent capabilities.

Anomaly Detection

Anomaly detection. On the left is a usual image which belongs to an ImageNet class, so the ImageNet classifier knows how to handle the image. On the right is an anomalous image which does not belong to any ImageNet class. Nonetheless, the model classifies the image with high confidence.

Anomaly detectors can warn human operators of potential hazards, and this can help them reduce their exposure to hazards. For example, anomaly detectors can help detect malicious uses of ML systems or flag novel examples for human review. However, deep learning-based anomaly detectors are not highly reliable, as shown in the figure above.


Backdoors. Depicted is a backdoored facial recognition system that gates building access. The backdoor could be triggered by a specific unique item chosen by an adversary, such as a pair of glasses. If the adversary wears that specific pair of glasses, the backdoored facial recognition will allow the adversary in the building. (Source)

ML systems risk carrying backdoors. Backdoored models behave correctly and benignly in almost all scenarios, but in particular circumstances chosen by the adversary, they have been taught to behave incorrectly. Models trained on massive datasets scraped from online are increasingly likely to be trained on poisoned data and thereby have backdoors injected. Moreover, downstream models are increasingly obtained by a single upstream foundation model, so a single backdoored system could render backdoors commonplace.


Alignment research aims to create safe ML system objectives and have them safely pursued. Two problems in alignment are value learning and proxy gaming, but the paper includes many additional problems.

Value Learning

Estimating human values such as pleasantness. Transformer models can partially separate between pleasant and unpleasant states given diverse open-world inputs. Utility values or pleasantness values are not ground truth values and are products of the model’s own learned utility function. (Source)

Encoding human goals and intent is challenging because many human values are hard to define and measure. How can we teach ML systems to model happiness, good judgment, freedom of action, meaningful experiences, safe outcomes, and more? In the figure above, we show that models are starting to have traction on the problem, but they nonetheless make many mistakes and can only process simple inputs. More research is needed to learn reliable representations for happiness and other human values.

Proxy Gaming

Boatrace Proxy Gaming. An RL agent gained a high score not by finishing the race but by going in the wrong direction, catching on fire, and colliding into other boats. (Source)

Objective proxies can be gamed by optimizers and adversaries. In fact, Goodhart’s law asserts that “When a measure becomes a target, it ceases to be a good measure.” This means that we cannot just learn a proxy for human values---we must also make it robust to optimizers that are incentivized to game the proxy. An example of a reward maximizing agent gaming a video game proxy is in the figure above.

In the full paper, we describe several more problems, clarify each problem’s motivation, and provide concrete research directions. Check out the paper here.

New Comment