[AN #114]: Theory-inspired safety solutions for powerful Bayesian RL agents

Rohin Shah

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

SECTIONS

HIGHLIGHTS

TECHNICAL AI ALIGNMENT

ITERATED AMPLIFICATION

MESA OPTIMIZATION

AGENT FOUNDATIONS

FORECASTING

MISCELLANEOUS (ALIGNMENT)

OTHER PROGRESS IN AI

REINFORCEMENT LEARNING

NEWS

HIGHLIGHTS

The Alignment Problem for Bayesian History-Based Reinforcement Learners (Tom Everitt et al) (summarized by Rohin): After forgetting its existence for quite a while, I've finally read through this technical report (which won first place in round 2 of the AI alignment prize (AN #3)). It analyzes the alignment problem from an AIXI-like perspective, that is, by theoretical analysis of powerful Bayesian RL agents in an online POMDP setting.

In this setup, we have a POMDP environment, in which the environment has some underlying state, but the agent only gets observations of the state and must take actions in order to maximize rewards. The authors consider three main setups: 1) rewards are computed by a preprogrammed reward function, 2) rewards are provided by a human in the loop, and 3) rewards are provided by a reward predictor which is trained interactively from human-generated data.

For each setup, they consider the various objects present in the formalism, and ask how these objects could be corrupted, misspecified, or misleading. This methodology allows them to identify several potential issues, which I won't get into as I expect most readers are familiar with them. (Examples include wireheading and threatening to harm the human unless they provide maximal reward.)

They also propose several tools that can be used to help solve misalignment. In order to prevent reward function corruption, we can have the agent simulate the future trajectory, and evaluate this future trajectory with the current reward, removing the incentive to corrupt the reward function. (This was later developed into current-RF optimization (AN #71).)

Self-corruption awareness refers to whether or not the agent is aware that its policy can be modified. A self-corruption unaware agent is one that behaves as though it's current policy function will never be changed, effectively ignoring the possibility of corruption. It is not clear which is more desirable: while a self-corruption unaware agent will be more corrigible (in the MIRI sense), it also will not preserve its utility function, as it believes that even if the utility function changes the policy will not change.

Action-observation grounding ensures that the agent only optimizes over policies that work on histories of observations and actions, preventing agents from constructing entirely new observation channels ("delusion boxes") which mislead the reward function into thinking everything is perfect.

The interactive setting in which a reward predictor is trained based on human feedback offers a new challenge: that the human data can be corrupted or manipulated. One technique to address this is to get decoupled data: if your corruption is determined by the current state s, but you get feedback about some different state s', as long as s and s' aren't too correlated it is possible to mitigate potential corruptions.

Another leverage point is how we decide to use the reward predictor. We could consider the stationary reward function, which evaluates simulated trajectories with the current reward predictor, i.e. assuming that the reward predictor will never be updated again. If we combine this with self-corruption unawareness (so that the policy also never expects the policy to change), then the incentive to corrupt the reward predictor's data is removed. However, the resulting agent is time-inconsistent: it acts as though its reward never changes even though it in practice does, and so it can make a plan and start executing it, only to switch over to a new plan once the reward changes, over and over again.

The dynamic reward function avoids this pitfall by evaluating the kth timestep of a simulated trajectory by also taking an expectation over future data that the reward predictor will get. This agent is no longer time-inconsistent, but it now incentivizes the agent to manipulate the data. This can be fixed by building a single integrated Bayesian agent, which maintains a single environment model that predicts both the reward function and the environment model. The resulting agent is time-consistent, utility-preserving, and has no direct incentive to manipulate the data. (This is akin to the setup in assistance games / CIRL (AN #69).)

One final approach is to use a counterfactual reward function, in which the data is simulated in a counterfactual world where the agent executed some known safe default policy. This no longer depends on the current time, and is not subject to data corruption since the data comes from a hypothetical that is independent of the agent's actual policy. However, it requires a good default policy that does the necessary information-gathering actions, and requires the agent to have the ability to simulate human feedback in a counterfactual world.

Read more: Tom Everitt's PhD thesis

Rohin's opinion: This paper is a great organization and explanation of several older papers (that haven't been summarized in this newsletter because they were published before 2018 and I read them before starting this newsletter), and I wish I had read it sooner. It seems to me that the integrated Bayesian agent is the clear winner -- the only downside is the computational cost, which would be a bottleneck for any of the models considered here.

One worry I have with this sort of analysis is that the guarantees you get out of it depends quite a lot on how you model the situation. For example, let's suppose that after I sleep I wake up refreshed and more capable of intellectual work. Should I model this as "policy corruption", or as a fixed policy that takes as an input some information about how rested I am?

TECHNICAL AI ALIGNMENT

ITERATED AMPLIFICATION

Universality Unwrapped (Adam Shimi) (summarized by Rohin): This post explains the ideas behind universality and ascription universality, in a more accessible way than the original posts and with more detail than my summary.

MESA OPTIMIZATION

Mesa-Search vs Mesa-Control (Abram Demski) (summarized by Rohin): This post discusses several topics related to mesa optimization, and the ideas in it led the author to update towards thinking inner alignment problems are quite likely to occur in practice. I’m not summarizing it in detail here because it’s written from a perspective on mesa optimization that I find difficult to inhabit. However, it seems to me that this perspective is common so it seems fairly likely that the typical reader would find the post useful.

AGENT FOUNDATIONS

Radical Probabilism (Abram Demski) (summarized by Rohin): The traditional Bayesian treatment of rational agents assumes that the only way an agent can get new information is by getting some new observation that is known with probability 1. However, we would like a theory of rationality that can allow for agents that also get more information by thinking longer. In such a situation, some of the constraints imposed by traditional Bayesian reasoning no longer apply. This detailed post explores what constraints remain, and what types of updating are allowable under this more permissive definition of rationality.

Read more: The Bayesian Tyrant

Rohin's opinion: I particularly enjoyed this post; it felt like the best explanation in relatively simple terms of a theory of rationality that is more suited to bounded agents that cannot perfectly reason about an environment larger than they are. (Note “simple” really is relative; the post still assumes a lot of technical knowledge about traditional Bayesianism.)

FORECASTING

My AI Timelines Have Sped Up (Alex Irpan) (summarized by Nicholas): Alex Irpan updates his predictions of AGI sooner to:

10% chance by 2035 (previously 2045)

50% chance by 2045 (previously 2050)

90% chance by 2070

The main reasons why are:

- Alex is now more uncertain because research pace over the past five years have been more surprising than expected, faster in some domains, but slower than others.

- Accounting for improvements in tooling. New libraries like TensorFlow and PyTorch have accelerated progress. Even CNNs can be used as a “tool” that provides features for downstream tasks like robotic control.

- He previously thought that labeled data might be a bottleneck, based on scaling laws showing that data needs might increase faster than compute; however, semi- and unsupervised learning have improved significantly, GPT-3 being the latest example of this.

- Alex now believes that compute will play a larger role and that compute can scale faster than algorithms because there is large worldwide consumer demand.

The post ends with a hypothetical description of how AGI may happen soon that I will leave out of the summary but recommend reading.

Nicholas's opinion: My personal opinion on timelines is that I think it is much more informative to draw out the full CDF/PDF of when we will get to AGI instead of percentages by different years. It isn’t included in the post, but you can find Alex’s here. I end up placing higher likelihood on AGI happening sooner than Alex does, but I largely agree with his reasoning.

More uncertainty than the original prediction seems warranted to me; the original prediction had a very high likelihood of AGI between 2045-2050 that I didn’t understand. Of the rest of the arguments, I agree most strongly with the section on tooling providing a speedup. I’d even push the point farther to say that there are many inputs into current ML systems, and all of them seem to be improving at a rapid clip. Hardware, software tools, data, and the number of ML researchers all seem to be on track to improve significantly over the next decade.

MISCELLANEOUS (ALIGNMENT)

The Problem with Metrics is a Fundamental Problem for AI (Rachel Thomas et al) (summarized by Flo): The blog post lists five problems of current AI that are exacerbated by the cheap cost and easy scaling of AI systems combined with the common belief that algorithms are objective and error-free:

1. It is often hard for affected people to address problems in algorithmic decisions

2. The complexity of AI problems can easily lead to a diffusion of responsibility

3. AI can encode biases and sometimes magnify them via feedback loops

4. Big tech companies lack accountability

5. Current AI systems usually focus exclusively on optimizing metrics.

The paper then dives deeper into the last point. They review a series of case studies and form four conclusions. First, measured metrics are usually only a proxy for what we really care about: Youtube's terminal goal is certainly not to maximize viewing time and society does not inherently care about student test scores. Secondly, metrics can and will be gamed: Soviet workers would often achieve their production targets at the cost of some unmeasured aspects of performance, reported waiting times in the English healthcare system were distorted once targets were set for them and evaluating teachers by test scores has led to cheating scandals in the US. Third, metrics tend to overemphasise short-term concerns as they are often easier to measure. This can be seen in businesses like Facebook and Wells Fargo that have faced political backlash, worse access to talent pools, or lawsuits because of an excessive focus on click-through rates and quarterly earnings. Fourth, tech firms often focus on metrics that are associated with addictive environments. For example, "engagement" metrics are used as proxies for user preferences but rarely reflect them accurately in contexts that were optimized for these metrics. The authors then propose three remedies: Using multiple metrics to get a more holistic picture and make gaming harder, combining metrics with qualitative accounts, and involving domain experts and stakeholders that would be personally affected by the deployed system.

Flo's opinion: I found this interesting to read, as it does not really seem to be written from the perspective of AI Safety but still lists some problems that are related to AI safety and governance. Just think of an AI system tasked to help with realizing human preferences magnifying "biases" in its preference elicitation via unwanted feedback loops, or about the lack of firms accountability for socioeconomic disturbances their AI systems could create that the windfall clause (AN #88) was envisioned to mitigate.

OTHER PROGRESS IN AI

REINFORCEMENT LEARNING

Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey (Sanmit Narvekar et al) (summarized by Zach): For a variety of learning problems, the training process is organized so that new concepts and tasks leverage previously learned information. This can serve as a broad definition of curriculum learning. This paper gives an overview of curriculum learning and a framework to organize various approaches to the curriculum learning problem. One central difficulty is that there is a broad class of methods that can be considered curricula. At one extreme, we have curricula where new tasks are created to speed up learning. At another extreme, some curricula simply reorder experience samples. For example, the prioritized replay buffer is one such reordering method. Thus, to cover as much of the literature as possible the authors outline a framework for curriculum learning and then use that structure to classify various approaches. In general, the definition, learning, construction, and the evaluation of curricula are all covered in this work. This is done by breaking the curriculum learning problem into three steps: task generation, sequencing, and transfer learning. Using this problem decomposition the authors give an overview of work addressing each component.

Zach's opinion: Before I read this, I thought of curricula as 'hacks' used to improve training. However, the authors' presentation of connections with transfer learning and experience replay has significantly changed my opinion. In particular, the phrasing of curriculum learning as a kind of 'meta-MDP seems particularly interesting to me. Moreover, there seem to be interesting challenges in this field. One such challenge is that there does not seem to be a great amount of theory about why curricula work which could indicate a point of departure for people interested in safety research. Knowing more about theory could help answer safety questions. For example, how do we design curricula so that we can guarantee/check the agent is behaving correctly at each step?

NEWS

Looking for adversarial collaborators to test our Debate protocol (Beth Barnes) (summarized by Rohin): OpenAI is looking for people to help test their debate (AN #86) protocol, to find weaknesses that allow a dishonest strategy to win such debates.

FEEDBACK

I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

[-]David Scott Krueger6y10

Regarding curriculum learning: I think its very neglected, and seems likely to be a core component of prosaic alignment approaches. The idea of a "basin of attraction for corrigibility (or other desirable properties)" seems likely to rely on appropriate choice of curriculum.