[AN #126]: Avoiding wireheading by decoupling action feedback from action effects

Rohin Shah

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.

HIGHLIGHTS

Avoiding Tampering Incentives in Deep RL via Decoupled Approval (Ramana Kumar, Jonathan Uesato et al) (summarized by Rohin): Current-RF optimization (AN #71) shows that to avoid tampering with the reward, we can have an agent that evaluates plans it makes according to the current reward function, rather than the reward after tampering, and this is sufficient to remove any incentive for tampering. However, that work required the ability to evaluate actions and/or plans using the "current reward". How might we implement this in deep RL algorithms in practice?

Let's take a simple example: suppose for an autonomous personal assistant, every once in a while we query the user for their satisfaction, write it to a file, and then use that file to train the assistant. Then with normal RL the assistant is incentivized to "tamper" by rewriting the contents of the file to show maximal satisfaction. In this context, current-RF optimization would say that before rewriting the contents, the agent should ask the user whether that's a good idea. However, we can't ask the user about every action, and our agent does need to take some actions in order to explore the environment.

The authors formalize this as a Corrupted Feedback MDP, in which the feedback which the agent gets is corrupted in some states. They assume that the human gives approval feedback, which they formalize as the advantage function. (The advantage is the long-term value of the queried action relative to the average action for the current state.) This ensures that the agent only needs to myopically select the action with highest approval, which means we can run any old deep RL algorithm with the discount set to zero. However, this doesn't solve the problem, because with deep RL the feedback is given after the action is executed, at which point it has already been corrupted (in our example, the file already claims the user is maximally satisfied).

To fix this, the authors introduce decoupling, in which the action executed by the agent and the action on which feedback is given are sampled independently. The idea is that even if the executed action leads to corruption, the corrupted update is equally likely to affect every action, and so in expectation it cancels out. (This requires the uniform corruption assumption, which states that the corruption is added to the feedback, and is independent of the queried action, though it can depend on the executed action.) They derive decoupled approval versions of policy gradients and Q-learning, and prove that the local updates made by these algorithms move towards the approval maximizing policy (in expectation).

They then evaluate the algorithms on a new environment, REALab, in which the agent must collect apples. However, in this environment, the feedback variable is represented in the environment by "registers". The agent can thus tamper with the feedback by interacting with these registers. The experiments show that while standard RL learns to tamper, DA-PG only tampers "accidentally" (i.e. to the same extent that is done by a policy trained with uncorrupted feedback). DA-QL tampers a bit more often, but this could just be due to noise.

Read more: REALab paper and blog post

Rohin's opinion: I like that this work has moved to the assumption that there is an advisor who is capable of providing feedback agnostic to tampering: the user's feedback is correct when the agent only asks about a tampering action rather than executing it. The information about what is and isn't tampering has to come from somewhere, and the natural source is some human overseer. (Perhaps previous work was also written with this in mind, but if so it wasn't obvious to me from the papers.) Note that this is a pretty weak assumption -- it only means that the overseer needs to not actively reward any tampering behavior; they don't need to penalize it as well. For example, when the agent proposes that they rewrite the file, the overseer just needs to notice that rewriting the file doesn't help with being a good assistant, they don't also need to notice that if the rewriting were to happen then the agent would have tampered with the reward.

The current experiments operate in an episodic setting, meaning that tampering effects get undone by resetting after each episode. However, in many realistic cases, once the reward is tampered with, you wouldn't expect there to be a way to "reset" it, and so you are stuck with the tampering forever and won't learn the right policy. (Delegative Reinforcement Learning (AN #57) is an example of work which avoids the assumption for this reason.) This is probably fine if each tampering action has a small effect and you need many such actions to have a cumulatively bad effect. In this case, removing the incentive to tamper means that when the agent tampers via exploration, the tampering isn't reinforced, and so the agent won't tamper again in the future. However, if it is unacceptable to take even a single tampering action (as in the case of rewriting the source code for the reward function), then it doesn't help that after exploring that action we don't reinforce it, as the damage has already been done.

Another way to think about it is that current deep RL systems primarily learn from experience (rather than reasoning), which is why the benefit only kicks in after the agent has randomly explored a tampering action. However, if we build superintelligent AI systems, they will be doing some form of reasoning, and in that setting if we can remove the incentive to tamper with the reward, that may indeed prevent the agent from ever tampering with the reward.

TECHNICAL AI ALIGNMENT

LEARNING HUMAN INTENT

I Know What You Meant: Learning Human Objectives by (Under)estimating Their Choice Set (Ananth Jonnavittula et al) (summarized by Rohin): Misspecification in reward learning (AN #32) can be quite bad, and seems nearly inevitable to happen. The key insight of this paper is that we can mitigate its effects by ensuring that we err on the side of underestimating the demonstrator’s capabilities.

Consider inverse reinforcement learning (IRL), where we get demonstrations of good behavior. In practice, there are some demonstrations that humans can’t give: for example, when teleoperating a complex robot arm, humans might find it challenging to move a coffee cup without tilting it. Ideally, we would estimate the set of possible trajectories the demonstrator could have given, known as their choice set, and only model them as noisily rational across trajectories from that set.

However, we won’t perfectly estimate this choice set, and so there will be some misspecification. If we overestimate the demonstrator’s capabilities, for example by assuming they could move the coffee cup perfectly straightly, then since that isn’t the demonstration we get we would infer that the human couldn’t have cared about keeping the cup upright. However, if we underestimate the demonstrator’s capabilities, there’s no such issue.

If we make the theoretical simplification that the demonstrator chooses the actual best trajectory out of their choice set, then we can prove that in the case of underestimation, you will always assign as much probability to the true reward function as you would if you had the correct choice set. (Intuitively, this is because for reward r, if the trajectory is optimal under the true choice set, then it must also be optimal under the underestimated choice set.)

Okay, but how do we ensure we have underestimated the choice set? This paper suggests that we augment the demonstrations that we do observe. For example, we can take the real demonstration, and inject noise into it, along the lines of D-REX (AN #60). Alternatively, we can repeat actions -- the idea is that it is easier for a human to give consistent inputs than to change the actions constantly. Finally, we can make the demonstration sparser, i.e. reduce the magnitude of the actions (in the robotics setting).

The authors run experiments in simulated domains as well as with a user study and report good results.

Rohin's opinion: I really liked the insight: because IRL provides information about the maximum of a set, it is generally safe to underestimate that set (i.e. work with a subset), but is not safe to overestimate that set (i.e. work with a superset). It’s simple and intuitive but can plausibly provide significant benefits in practice.

PREVENTING BAD BEHAVIOR

Learning to be Safe: Deep RL with a Safety Critic (Krishnan Srinivasan et al) (summarized by Rohin): While there has been a lot of work on verifying logical specifications and avoiding constraint violations, I’ve said before that the major challenge is in figuring out what specifications or constraints to use in the first place. This paper takes a stab at this problem by learning safety constraints and transferring them to new situations.

In particular, we assume that we have some way of telling whether a given trajectory violates a constraint (e.g. a human looks at it and says whether or not a violation has happened). We also assume access to a safe environment in which constraint violations are acceptable. For example, for robots, our constraint could be that the robot never crashes, and in the safe environment the robot could be constrained to only move slowly, so that if they do crash there is no permanent damage. We then want to train the agent to perform some task in the true training environment (e.g. with no restrictions on speed), such that we avoid constraint violations with high probability even during training.

The key idea is to pretrain a safety Q-function in the safe environment, that is, a function Qsafe(s, a) that specifies the probability of eventually violating a constraint if we take action a in state s. We have the agent choose actions that are estimated to be on the verge of being too risky, in order to optimize for getting more information about the constraints.

Once we have this safety Q-function, we can use it as a shield (1 (AN #124), 2 (AN #16)). Specifically, any actions whose risk is above some threshold ε have their probabilities set to zero. Using this shield, we can then train for the true task in the (unsafe) training environment using RL, while only behaving safely. Of course, this depends on the safety Q-function successfully generalizing to the new environment. We also add the safety Q-function as part of the RL objective to disincentivize constraint violations.

Their experiments show that this approach significantly reduces the number of constraint violations during training, though in absolute numbers there are often still hundreds of constraint violations (or about 1% of the number of training steps).

Rohin's opinion: I’m glad to see more work on this: robustness techniques seem particularly important to get working with learned specifications, and this paper (like the next one) takes a real shot at this goal. In some sense it isn’t that clear what we gain from an approach like this -- now, instead of requiring robustness from the agent, we require robustness from the safety Q-function (since we transfer it from the safe environment to the training environment). Nonetheless, we might hope the safety Q-function is easier to learn and more likely to transfer between the two environments, since it could be simpler than a full policy.

Recovery RL: Safe Reinforcement Learning with Learned Recovery Zones (Brijen Thananjeyan, Ashwin Balakrishna et al) (summarized by Rohin): This paper introduces Recovery RL, which tackles the same problem as the previous paper: given a dataset of constraint violations (presumably collected from some safe environment), train an agent to perform some task while exploring safely. Like the previous paper, it starts by learning a safety Q-function from the dataset of constraint violations.

The difference is in how this safety Q-function is used. The previous paper uses it as a shield on a policy, and also uses it in the RL objective to get a policy that is performant and safe. Recovery RL instead splits these into two separate policies: there is a task policy that only optimizes for performance, and a recovery policy that only optimizes for safety. During training, the safety Q-function is used to monitor the likelihood of a constraint violation, and if a violation is sufficiently likely, control is handed over to the recovery policy which then tries to make the probability of constraint violation as low as possible.

Experiments show that the method performs significantly better than other baselines (including SQRL, the method from the previous paper). Note however that the environments on which the methods were tested were not the same as in the previous paper.

AI STRATEGY AND POLICY

The Grey Hoodie Project: Big Tobacco, Big Tech, and the threat on academic integrity (Mohamed Abdalla et al) (summarized by Rohin): Big tech companies fund a lot of academic research, including on AI ethics. This paper points out that we would not trust research on smoking that was funded by tobacco companies: why should AI ethics research be any different? Enough information has now surfaced (through litigation) for us to see that Big Tobacco’s actions were clearly unacceptable, but it took years for this to be realized. The same thing could be happening again with Big Tech.

The paper identifies four goals that drive investment into academia by big industries, and argues that these are consistent with the actions of Big Tobacco and Big Tech. First, funding academic research allows companies to present themselves as socially responsible. For example, some researchers have argued that academic or non-profit institutions like the ACLU and MIT do not have any effective power in the Partnership on AI and their membership ends up serving a legitimating function for the companies in the partnership.

Second, companies can influence the events and decisions made by universities. Top conferences in ML receive large sponsorships from companies, and many of the workshops have such sponsorships as well, including ones about AI ethics.

Third, companies can influence the research conducted by individual scientists. The authors studied funding of professors at four top universities, and found that of the cases where they could determine funding, over 52% had been funded by Big Tech, and the number rose to 58% when restricting to those who had published in ethics or fairness. There need not be any explicit pressure for this to be an issue: the implicit threat of loss of funding can be enough to prevent some types of research.

Fourth, companies can discover academics who can be leveraged in other situations. For example, tobacco companies explicitly searched for academics who would testify in favor of the companies at legislative hearings. In Big Tech, there are similar suggestive stories: for example, in one case a professor who had been funded indirectly by Google criticized antitrust scrutiny of Google. They then joined the FTC, and shortly after the FTC dropped their antitrust suit against Google.

The paper concludes with some ideas on how the current situation could be improved.

Read more: FLI Podcast

Rohin's opinion: I’d love to write an opinion on this. Unfortunately, regardless of which “side” of the issue I come down on, if I explained it publicly in this newsletter, I would worry that a journalist would adversarially quote me in a way that I would dislike. (Consider: "Employee confirms DeepMind's efforts to whitewash AI ethics efforts" or "This researcher's fervent defense of Big Tech shows how Google brainwashes its employees".) So I’m not going to say anything here. And just for the record: DeepMind did not tell me not to write an opinion here.

(Am I being too paranoid? Probably, but I've heard of too many horror stories to take the chance. I'm already worrying that someone will quote even this and talk about how Big Tech indoctrinates its employees against virtuous journalists who expose its evils.)

For the sake of transparency about how I make these decisions, if there were only one “side” of the issue that I wouldn’t be willing to express publicly, I would not write an opinion. Also, I doubt that this policy would ever trigger for a technical paper.

NEWS

CHAI Internship (Martin Fukui) (summarized by Rohin): CHAI internships (AN #74) are open once again! The deadline for applications is December 13.

AI Safety Camp virtual edition 2021 (Remmelt Ellen et al) (summarized by Rohin): The second virtual AI Safety Camp will take place over the first half of 2021. Applications will close on December 15.

European Master's Programs in Machine Learning, Artificial Intelligence, and related fields (Marius, Leon et al) (summarized by Rohin): Each article in this series is supposed to give prospective students an honest evaluation of the teaching, research, industry opportunities, and city life of a specific European Master’s program in ML or AI. Note that I have not read through the articles myself.

FEEDBACK

I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

AI ALIGNMENT FORUM
AF