[AN #66]: Decomposing robustness into capability robustness and alignment robustness

Rohin Shah

Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I'm always happy to hear feedback; you can send it to me by replying to this email.

Starting this week, we have a few new summarizers; you can always find the whole team here. I (Rohin) will continue to edit all of the summaries and opinions, and add some summaries and opinions of my own.

Audio version here (may not be up yet).

Highlights

2-D Robustness (Vladimir Mikulik) (summarized by Matthew): Typically when we think about machine learning robustness we imagine a scalar quantity representing how well a system performs when it is taken off its training distribution. When considering mesa optimization (AN #58), it is natural to instead decompose robustness into two variables: robust capabilities and robust alignment. When given an environment that does not perfectly resemble its training environment, a mesa optimizer could be dangerous by competently pursuing a mesa objective that is different from the loss function used during training. This combination of robust capabilities without robust alignment is an example of a malign failure, the most worrisome outcome of creating a mesa optimizer.

Matthew's opinion: Decomposing robustness in this way helps me distinguish misaligned mesa optimization from the more general problem of machine learning robustness. I think it's important for researchers to understand this distinction because it is critical for understanding why a failure to solve the robustness problem could plausibly result in a catastrophe rather than merely a benign capabilities failure.

Rohin's opinion: I strongly agree with this distinction, and in fact when I think about the problem of mesa optimization, I prefer to only think about models whose capabilities are robust but whose objective or goal is not, rather than considering the internals of the model and whether or not it is performing search, which seems like a much hairier question.

Technical AI alignment

Iterated amplification

Finding Generalizable Evidence by Learning to Convince Q&A Models (Ethan Perez et al) (summarized by Asya): This paper tries to improve performance on multiple-choice questions about text passages using a technique similar to AI safety via debate (AN #5). The set-up consists of a judge model and one or more evidence agents. First, the judge model is pretrained on samples consisting of a passage, a multiple-choice question about that passage, and the correct answer to that question. Then, in the experimental portion of the set-up, instead of looking at a full passage, the judge model looks at a subsequence of the passage created by combining the outputs from several evidence agents. Each evidence agent has been given the same passage and assigned a particular answer to the question, and must select a limited number of sentences from the passage to present to the judge model to convince it of that answer.

The paper varies several parameters in its setup, including the training process for the judge model, the questions used, the process evidence agents use to select sentences, etc. It finds that for many settings of these parameters, when judge models are tasked with generalizing from shorter passages to longer passages, or easier passages to harder passages, they do better with the new passages when assisted by the evidence agents. It also finds that the sentences given as evidence by the evidence agents are convincing to humans as well as the judge model.

Asya's opinion: I think it's a cool and non-trivial result that debating agents can in fact improve model accuracy. It feels hard to extrapolate much from this narrow example to debate as a general AI safety technique. The judge model is answering multiple-choice questions rather than e.g. evaluating a detailed plan of action, and debating agents are quoting from existing text rather than generating their own potentially fallacious statements.

What are the differences between all the iterative/recursive approaches to AI alignment? (Issa Rice)

Mesa optimization

Utility ≠ Reward (Vladimir Mikulik) (summarized by Rohin): This post describes the overall story from mesa-optimization (AN #58). Unlike the original paper, it focuses on the distinction between a system that is optimized for some task (e.g. a bottle cap), and a system that is optimizing for some task. Normally, we expect trained neural nets to be optimized; risk arises when they are also optimizing.

Agent foundations

Theory of Ideal Agents, or of Existing Agents? (John S Wentworth) (summarized by Flo): There are at least two ways in which a theoretical understanding of agency can be useful: On one hand, such understanding can enable the design of an artificial agent with certain properties. On the other hand, it can be used to describe existing agents. While both perspectives are likely needed for successfully aligning AI, individual researchers face a tradeoff: either they focus their efforts on existence results concerning strong properties, which helps with design (e.g. most of MIRI's work on embedded agency (AN #31)), or they work on proving weaker properties for a broad class of agents, which helps with description (e.g. all logical inductors can be described as markets, summarized next). The prioritization of design versus description is a likely crux in disagreements about the correct approach to developing a theory of agency.

Flo's opinion: To facilitate productive discussions it seems important to disentangle disagreements about goals from disagreements about means whenever we can. I liked the clear presentation of this attempt to identify a common source of disagreements on the (sub)goal level.

Markets are Universal for Logical Induction (John S Wentworth) (summarized by Rohin): A logical inductor is a system that assigns probabilities to logical statements (such as "the millionth digit of pi is 3") over time, that satisfies the logical induction criterion: if we interpret the probabilities as prices of contracts that pay out $1 if the statement is true and $0 otherwise, then there does not exist a polynomial-time trader function with bounded money that can make unbounded returns over time. The original paper shows that logical inductors exist. This post proves that for any possible logical inductor, there exists some market of traders that produces the same prices as the logical inductor over time.

Adversarial examples

E-LPIPS: Robust Perceptual Image Similarity via Random Transformation Ensembles (Markus Kettunen et al) (summarized by Dan H): Convolutional neural networks are one of the best methods for assessing the perceptual similarity between images. This paper provides evidence that perceptual similarity metrics can be made adversarially robust. Out-of-the-box, network-based perceptual similarity metrics exhibit some adversarial robustness. While classifiers transform a long embedding vector to class scores, perceptual similarity measures compute distances between long and wide embedding tensors, possibly from multiple layers. Thus the attacker must alter far more neural network responses, which makes attacks on perceptual similarity measures harder for adversaries. This paper makes attacks even harder for the adversary by using a barrage of input image transformations and by using techniques such as dropout while computing the embeddings. This forces the adversarial perturbation to be substantially larger.

AI strategy and policy

Why Responsible AI Development Needs Cooperation on Safety (Amanda Askell et al) (summarized by Nicholas): AI systems are increasingly being developed by companies, and as such it is important to understand how competition will affect the safety and robustness of these systems. This paper models companies as agents engaging in a cooperate-defect game, where cooperation represents responsible development, and defection represents a failure to develop responsibly. This model yields five factors that increase the likelihood of companies cooperating on safety. Ideally, companies will have high trust that others cooperate on safety, large benefits from mutual cooperation (shared upside), large costs from mutual defection (shared downside), not much incentive to defect when others cooperate (low advantage), and not be harmed too much if others defect when they cooperate (low exposure).

They then suggest four concrete strategies that can help improve norms today. First, companies should help promote accurate beliefs about the benefits of safety. Second, companies should collaborate on research and engineering. Third, companies should be transparent and allow for proper oversight and feedback. Fourth, the community should incentivize adhering to high safety standards by rewarding safety work and penalizing unsafe behavior.

Nicholas's opinion: Given that much of current AI progress is being driven by increases in computation power, it seems likely to me that companies will soon become more significant players in the AI space. As a result, I appreciate that this paper tries to determine what we can do now to make sure that the competitive landscape is conducive to taking proper safety precautions. I do, however, believe that the single step cooperate-defect game which they use to come up with their factors seems like a very simple model for what will be a very complex system of interactions. For example, AI development will take place over time, and it is likely that the same companies will continue to interact with one another. Iterated games have very different dynamics, and I hope that future work will explore how this would affect their current recommendations, and whether it would yield new approaches to incentivizing cooperation.

Other progress in AI

Hierarchical RL

Reinforcement Learning with Competitive Ensembles of Information-Constrained Primitives (Anirudh Goyal et al) (summarized by Zach): Learning policies that generalize to new environments is a fundamental challenge in reinforcement learning. In particular, humans seem to be adept at learning skills and understanding the world in a way that is compositional, hinting at the source of the discrepancy. Hierarchical reinforcement learning (HRL) has partially addressed the discrepancy by decomposing policies into options/primitives/subpolicies that a top-level controller selects from. However, generalization is limited because the top-level policy must work for all states.

In this paper, the authors explore a novel decentralized approach where policies are still decomposed into primitives, but without a top-level controller. The key idea is to incentivize each primitive to work on a different cluster of states. Every primitive has a variational information bottleneck between the state and predicted action, that allows us to quantify how much information about the state the primitive uses in selecting actions. Intuitively, a primitive that knows how to open gates is going to extract a lot of information about gates from the state to choose an appropriate action, and won’t extract much information in states without gates. So, our high-level controller can just be: check which primitive is using the most state information, and let that primitive choose the action.

The reward R from a trajectory is split amongst the primitives in proportion to how likely each primitive was to be chosen. This is what incentivizes the primitives to use information from the state. The primitives also get a cost in proportion to how much information they use, incentivizing them to specialize to a particular cluster of states. Finally, there is a regularization term that also incentivizes specialization, and in particular prevents a collapse where a single primitive is always active.

To demonstrate effectiveness, the authors compare the baseline HRL methods option-critic and Meta-learning Shared Hierarchy to their method in grid-world and motion imitation transfer tasks. They show that using an ensemble of primitives can outperform more traditional HRL methods in generalization across tasks.

Zach's opinion: Overall, this paper is compelling because the method presented is both promising and provides natural ideas for future work. The method presented here is arguably simpler than HRL and the ability to generalize to new environments is simple to implement. The idea of introducing competition at an information theoretic level seems natural and the evidence for better generalization capability is compelling. It'd be interesting to see what would happen if more complex primitives were used.

Miscellaneous (AI)

Unreproducible Research is Reproducible (Xavier Bouthillier et al) (summarized by Flo): This paper argues that despite the growing popularity of sharing code, machine learning research has a problem with reproducibility. It makes the distinction between the reproducibility of methods/results, which can be achieved by fixing random seeds and sharing code, and the reproducibility of findings/conclusions, which requires that different experimental setups (or at least random seeds) lead to the same conclusion.

Several popular neural network architectures are trained on several image classification datasets several times with different random seeds determining the weight initialization and sampling of data. The relative rankings of the architectures with respect to the test accuracy are found to vary relevantly with the random seed for all data sets, as well as between data sets.

The authors then argue that while the reproducibility of methods can help with speeding up exploratory research, the reproducibility of findings is necessary for empirical research from which robust conclusions can be drawn. They claim that exploratory research that is not based on robust findings can get inefficient, and so call for the machine learning community to do more empirical research.

Flo's opinion: I really like that this paper not just claims that there is a problem with reproducibility, but demonstrates this more rigorously using an experiment. More robust empirical findings seem quite important for getting to a better understanding of machine learning systems in the medium term. Since this understanding is especially important for safety relevant research, where exploratory research seems more problematic by default, I am excited for a push in that direction.

News

Open Phil AI Fellowship (summarized by Rohin): The Open Phil AI Fellowship is seeking applications for its third cohort. Applications are due by October 25. The fellowship is open to current and incoming PhD students, including those with pre-existing funding sources. It provides up to 5 years of support with a stipend of $40,000 and a travel allocation of $10,000.

[-]AlexMennen6y30

I do, however, believe that the single step cooperate-defect game which they use to come up with their factors seems like a very simple model for what will be a very complex system of interactions. For example, AI development will take place over time, and it is likely that the same companies will continue to interact with one another. Iterated games have very different dynamics, and I hope that future work will explore how this would affect their current recommendations, and whether it would yield new approaches to incentivizing cooperation.

It may be difficult for companies to get accurate information about how careful their competitors are being about AI safety. An iterated game in which players never learn what the other players did on previous rounds is the same as a one-shot game. This points to a sixth factor that increases chance of cooperation on safety: high transparency, so that companies may verify their competitors' cooperation on safety. This is closely related to high trust.

AI ALIGNMENT FORUM
AF