[AN #71]: Avoiding reward tampering through current-RF optimization

Rohin Shah

Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I'm always happy to hear feedback; you can send it to me by replying to this email.

Audio version here (may not be up yet).

Highlights

Designing agent incentives to avoid reward tampering (Tom Everitt, Ramana Kumar, and Marcus Hutter) (summarized by Flo): Reward tampering occurs when a reinforcement learning agent actively changes its reward function. The post uses Causal Influence Diagrams (AN #61) to analyze the problem in a simple grid world where an agent can easily change the definition of its reward. The proposed solution is current-RF optimization: Instead of maximizing the sum of rewards that would be given after each action (where the reward signal can dynamically change over time), the agent searches for and executes a plan of actions that would maximize the current, unchanged reward signal. The agent would then not be incentivized to tamper with the reward function since the current reward is not maximized by such tampering. There are two different flavours to this: time-inconsistency-aware agents account for future changes in their own behaviour due to modified reward signals, while TI-unaware agents ignore this in their planning. TI-aware agents have an incentive to preserve their reward signal and are therefore potentially incorrigible.

Flo's opinion: I enjoyed this application of causal diagrams and think that similar detailed analyses of the interactions between failure modes like wireheading, instrumental goals like reward preservation and the specific implementation of an agent would be quite valuable. That said, I am less excited about the feasibility of the proposed solution since it seems to require detailed knowledge of the agent about counterfactual rewards. Also, I expect the distinction between changes in the reward signal and changes in the state that happen to also affect the reward to be very fuzzy in real problems and current-RF optimization seems to require a very sharp boundary.

Rohin's opinion: I agree with Flo's opinion above, and I think the example in the blog post shows how the concept "what affects the reward" is fuzzy: in their gridworld inspired by the game Baba Is You, they say that moving the word "Reward" down to make rocks rewarding is "tampering", whereas I would have called that a perfectly legitimate way to play given my knowledge of Baba Is You.

Technical AI alignment

Problems

An Increasingly Manipulative Newsfeed (Michaël Trazzi and Stuart Armstrong) (summarized by Matthew): An early argument for specialized AI safety work is that misaligned systems will be incentivized to lie about their intentions while weak, so that they aren't modified. Then, when the misaligned AIs are safe from modification, they will become dangerous. Ben Goertzel found the argument unlikely, pointing out that weak systems won't be good at deception. This post asserts that weak systems can still be manipulative, and gives a concrete example. The argument is based on a machine learning system trained to maximize the number of articles that users label "unbiased" in their newsfeed. One way it can start being deceptive is by seeding users with a few very biased articles. Pursuing this strategy may cause users to label everything else unbiased, as it has altered their reference for evaluation. The system is therefore incentivized to be dishonest without necessarily being capable of pure deception.

Matthew's opinion: While I appreciate and agree with the thesis of this post -- that machine learning models don't have to be extremely competent to be manipulative -- I would still prefer a different example to convince skeptical researchers. I suspect many people would reply that we could easily patch the issue without doing dedicated safety work. In particular, it is difficult to see how this strategy arises if we train the system via supervised learning rather than training it to maximize the number of articles users label unbiased (which requires RL).

Rohin's opinion: I certainly agree with this post that ML models don't need to be competent to be manipulative. However, it's notable that this happened via the model randomly manipulating people (during exploration) and noticing that it helps it achieve its objective. It seems likely that to cause human extinction via a treacherous turn, the model would need to do zero-shot manipulation. This seems much less likely to happen (though I wouldn't say it's impossible).

Mesa optimization

Are minimal circuits deceptive? (Evan Hubinger) (summarized by Rohin): While it has been argued that the simplest program that solves a complex task is likely to be deceptive, it hasn't yet been argued whether the fastest program that solves a complex task will be deceptive. This post argues that fast programs will often be forced to learn a good policy (just as we need to do today), and the learned policy is likely to be deceptive (presumably due to risks from learned optimization (AN #58)). Thus, there are at least some tasks where the fastest program will also be deceptive.

Rohin's opinion: This is an intriguing hypothesis, but I'm not yet convinced: it's not clear why the fastest program would have to learn the best policy, rather than directly hardcoding the best policy. If there are multiple possible tasks, the program could have a nested if structure that figures out which task needs to be done and then executes the best policy for that task. More details in this comment.

Impact measurement and value-neutrality verification (Evan Hubinger) (summarized by Rohin): So far, most uses of impact formalizations (AN #64) don't help with inner alignment, because we simply add impact to the (outer) loss function. This post suggests that impact formalizations could also be adapted to verify whether an optimization algorithm is value-neutral -- that is, no matter what objective you apply it towards, it provides approximately the same benefit. In particular, AUP (AN #25) measures the expectation of the distribution of changes in attainable utilities for a given action. You could get a measure of the value-neutrality of an action by instead computing the standard deviation of this distribution, since that measures how different the changes in utility are. (Evan would use policies instead of actions, but conceptually that's a minor difference.) Verifying value-neutrality could be used to ensure that the strategy-stealing assumption (AN #65) is true.

Rohin's opinion: I continue to be confused about the purpose of the strategy-stealing assumption, so I don't have a strong opinion about the importance of value-neutrality verification. I do think that the distribution of changes to attainable utilities is a powerful mathematical object, and it makes sense that there are other properties of interest that involve analyzing it.

Gradient hacking (Evan Hubinger) (summarized by Rohin): This post calls attention to the problem of gradient hacking, where a powerful agent being trained by gradient descent could structure its computation in such a way that it causes its gradients to update it in some particular way. For example, a mesa optimizer could structure its computation to first check whether its objective has been tampered with, and if so to fail catastrophically, so that the gradients tend to point away from tampering with the objective.

Rohin's opinion: I'd be interested in work that further sketches out a scenario in which this could occur. I wrote about some particular details in this comment.

Learning human intent

Leveraging Human Guidance for Deep Reinforcement Learning Tasks (Ruohan Zhang et al) (summarized by Nicholas): A core problem in RL is the communication of our goals and prior knowledge to an agent. One common approach to this is imitation learning: the human provides example demonstrations of a task, and the agent learns to mimic them. However, there are some limitations to this approach, such as requiring the human to be capable of the task. This paper outlines five different modalities from which agents can learn: evaluations, preferences, hierarchical feedback, observations, and attention (for example, where humans are looking while solving a task). It then suggests future research directions.

For this summary, I will focus on the future research directions, but you can read the full paper to understand existing approaches. The first issue is that datasets of human guidance are difficult to capture and depend on many specific factors of the individuals providing guidance. As a result, the paper suggests creating standard datasets to save effort and enable fair comparisons. The second direction is to better understand how humans should teach agents. The literature currently emphasizes progress in learning methods, but improved teaching methods may be just as valuable when learning from human guidance. The last is unifying learning across different input modalities; ideally an agent would be able to learn from many different types of human guidance over different phases of its learning.

Nicholas's opinion: I think the problem of providing human guidance to agents is a core problem in alignment, and I am glad to see more discussion of that problem. I generally think that this type of broad overview is very valuable for communicating research to those who just want a broad overview of the field and don’t need to know the individual details of each paper. However, I would appreciate if there were more quantitative comparisons of the tradeoffs between different paradigms. The introduction mentions sample efficiency and the large effort required for human labelling, which made me hope for theoretical or empirical comparisons of the different methods with regards to sample efficiency and labelling effort. Since this was lacking, it also left me unclear on what motivated their suggested research directions. Personally, I would be much more excited to pursue a research direction if there were quantitative results showing particular failure modes or negative characteristics of current approaches that motivated that particular approach.

Rohin's opinion: This seems like a great survey paper and I like their proposed future directions, especially on learning from different kinds of human guidance, and on improving methods of teaching. While it does seem useful to have datasets of human guidance in order to compare algorithms, this prevents researchers from making improvements by figuring out new forms of guidance not present in the dataset. As a result, I'd be more excited about benchmarks that are evaluated by how much time it takes for Mechanical Turkers to train an agent to complete the task. Admittedly, it would be costlier in both time and money for researchers to do such an evaluation.

Miscellaneous (Alignment)

Vox interview with Stuart Russell (Kelsey Piper) (summarized by Rohin): Kelsey talked with Stuart Russell about his new book, Human Compatible (AN #69).

Other progress in AI

Meta learning

Meta-Learning with Implicit Gradients (Aravind Rajeswaran et al) (summarized by Nicholas): The field of meta-learning endeavors to create agents that don’t just learn, but instead learn how to learn. Concretely, the goal is to train an algorithm on a subset of tasks, such that it can get low error on a different subset of tasks with minimal training.

Model Agnostic Meta Learning (MAML) tackles this problem by finding a set of initial parameters, θ, from which it is easy to quickly learn other tasks. During training, an inner loop copies θ into parameters φ, and optimizes φ for a fixed number of steps. Then an outer loop computes the gradient of θ through the inner optimization process (e.g. backpropagating through gradient descent) and updates θ accordingly.

MAML as described above has a few downsides, which this paper addresses.

1. The base optimizer itself must be differentiable, not just the loss function.

2. The gradient computation requires linear compute and memory in the number of steps, and suffers from vanishing and exploding gradients as that number increases.

3. In the inner loop, while φ is initially identical to θ, its dependence on θ fades as more steps occur.

Implicit MAML (iMAML) addresses these with two innovations. First, it adds a regularization term to keep φ close to θ, which maintains the dependence of φ on θ throughout training. Second, it computes the outer update gradient in closed form based purely on the final value of φ rather than using the entire optimization trajectory. Because the inner loop is an optimization process, the end result is an optimum, and thus has zero gradient. This leads to an implicit equation that when differentiated gives a closed form representation for the gradient of θ. This enables iMAML to work with inner optimization sequences that have more training steps, or that are not differentiable.

Nicholas's opinion: I am not an expert in meta-learning, but my impression is that this paper removes a major bottleneck to future research on meta-learning and gives a clear direction for future work on medium-shot learning and more complex inner optimizers. I’m also particularly interested in how this will interact with risks from learned optimization (AN #58). In particular, it seems possible to me that the outer parameters θ or a subset of them might potentially encode an internal mesa-optimizer. On the other hand, they may simply be learning locations in the parameter space from which it is easy to reach useful configurations. Interpretability of weights is a difficult problem, but I'd be excited about any work that sheds light on what characteristics of θ enable it to generalize across tasks.

10

[AN #71]: Avoiding reward tampering through current-RF optimization

10