Alignment Newsletter #32

Rohin Shah

Remember, treat all of the "sequence" posts as though I had highlighted them!

Highlights

Spinning Up in Deep RL (Joshua Achiam): OpenAI has released an educational resource aimed to help software engineers become skilled at deep reinforcement learning. It includes simple implementations of many deep RL algorithms (as opposed to the relatively complex, highly optimized implementations in Baselines), educational exercises, documentation, and tutorials. OpenAI will host a workshop on the topic at their headquarters on Feb 2nd, and are also planning to hold a workshop at CHAI some time in early 2019.

Rohin's opinion: I know that a lot of effort has gone into this project, and I expect that as a result this is probably the best educational resource on deep RL out there. The main other resource I know of is the deep RL bootcamp, which probably supplements this resource nicely, especially with the lectures (though it is a year out of date).

Technical AI alignment

Embedded agency sequence

Embedded World-Models (Abram Demski and Scott Garrabrant): A few slides have been added to this post since my summary last week, going into more detail about the grain-of-truth problem. This problem is particularly hard because your learned world model must include the world model itself inside of it, even in the presence of an environment that can behave adversarially towards the world model. It is easy to construct deterministic paradoxes where the world model cannot be correct -- for example, in rock-paper-scissors, if your model predicts what the opponent will do and plays the action that wins against the prediction, the opponent will (if they can) predict that and play the action that beats your action, falsifying your model. While game theory solves these sorts of scenarios, it does so by splitting the agent away from the environment, in a way that is very reminiscent of the dualistic approach. Recently, reflective oracles were developed, that solve this problem by having probabilistic models that were robust to self-reference, but they still assume logical omniscience.

Subsystem Alignment (Abram Demski and Scott Garrabrant): Any agent is likely to be built out of multiple subsystems, that could potentially have their own goals and work at cross-purposes to each other. A simple unrealistic example would be an agent composed of two parts -- a world model and a decision algorithm (akin to the setup in World Models (AN #23)). The decision algorithm aims to cause some feature of the world model to be high. In this case, the decision algorithm could trick the world model into thinking the feature is high, instead of actually changing the world so that the feature is high (a delusion box).

Why not just build a monolithic agent, or build an agent whose subcomponents are all aligned with each other? One reason is that our agent may want to solve problems by splitting into subgoals. However, what then prevents the agent from optimizing the subgoal too far, to the point where it is no longer helps for the original goal? Another reason is that when we make subagents to solve simpler tasks, they shouldn't need the whole context of what we value to do their task, and so we might give them a "pointer" to the true goal that they can use if necessary. But in that case, we have introduced a level of indirection, which a previous post (AN #31) argues leads to wireheading.

Perhaps the most insidious case is search, which can produce subagents by accident. Often, it is easier to solve a problem by searching for a good solution than deriving it from first principles. (For example, machine learning is a search over functions, and often outperforms hand-designed programs.) However, when an agent searches for a good solution, the solution it finds might itself be an agent optimizing some other goal that is currently correlated with the original goal, but can diverge later due to Goodhart's law. If we optimize a neural net for some loss function, we might get such an inner optimizer. As an analogy, if an agent wanted to maximize reproductive fitness, they might have used evolution to do this -- but in that case humans would be inner optimizers that subvert the original agent's goals (since our goals are not to maximize reproductive fitness).

Rohin's opinion: The first part of this post seems to rest upon an assumption that any subagents will have long-term goals that they are trying to optimize, which can cause competition between subagents. It seems possible to instead pursue subgoals under a limited amount of time, or using a restricted action space, or using only "normal" strategies. When I write this newsletter, I certainly am treating it as a subgoal -- I don't typically think about how the newsletter contributes to my overall goals, I just aim to write a good newsletter. Yet I don't recheck every word until the email is sent. Perhaps this is because that would be a new strategy I haven't used before and so I evaluate it with my overall goals, instead of just the "good newsletter" goal, or perhaps it's because my goal also has time constraints embedded in it, or something else, but in any case it seems wrong to think of newsletter-Rohin as optimizing long term preferences for writing as good a newsletter as possible.

I agree quite strongly with the second part of the post, about inner optimizers that could arise from search. Agents that maximize some long-term preferences are certainly possible, and it seems reasonably likely that a good solution to a complex problem would involve an optimizer that can adjust to different circumstances (for concreteness, perhaps imagine OpenAI Five (AN #13)). I don't think that inner optimizers are guaranteed to show up, but it seems quite likely, and they could lead to catastrophic outcomes if they are left unchecked.

Embedded Curiosities (Scott Garrabrant): This sequence concludes with a brief note on why MIRI focuses on embedded agency. While most research in this space is presented from a motivation of mitigating AI risk, Scott has presented it more as an intellectual puzzle, something to be curious about. There aren't clear, obvious paths from the problems of embedded agency to specific failure modes. It's more that the current dualistic way of thinking about intelligence will break down with smarter agents, and it seems bad if we are still relying on these confused concepts when reasoning about our AI systems, and by default it doesn't seem like anyone will do the work of finding better concepts. For this work, it's better to have a curiosity mindset, which helps you orient towards the things you are confused about. An instrumental strategy approach (which aims to directly mitigate failure modes) is vulnerable to the urge to lean on the shaky assumptions we currently have in order to make progress.

Rohin's opinion: I'm definitely on board with the idea of curiosity-driven research, it seems important to try to find the places in which we're confused and refine our knowledge about them. I think my main point of departure is that I am less confident than (my perception of) MIRI that there is a nice, clean formulation of embedded agents and intelligence that you can write down -- I wouldn't be surprised if intelligence was relatively environment-specific. (This point was made in Realism about rationality (AN #25).) That said, I'm not particularly confident about this and think there's reasonable room for disagreement -- certainly I wouldn't want to take everyone at MIRI and have them work on application-based AI alignment research.

Iterated amplification sequence

Preface to the sequence on iterated amplification (Paul Christiano): This is a preface, read it if you're going to read the full posts, but not if you're only going to read these summaries.

Value learning sequence

Latent Variables and Model Mis-Specification (Jacob Steinhardt): The key thesis of this post is that when you use a probabilistic model with latent variables (also known as hidden variables, or the variables whose vaues you don't know), the values inferred for those latent variables may not have the intended meaning if the model is mis-specified. For example, in inverse reinforcement learning we use a probabilistic model that predicts the observed human behavior from the latent utility function, and we hope to recover the latent utility function and optimize it.

A mis-specified model is one in which there is no setting of the parameters such that the resulting probability distribution matches the true distribution from which the data is sampled. For such a model, even in the limit of infinite data, you are not going to recover the true distribution. (This distinguishes it from overfitting, which is not a problem with infinite data.) In this case, instead of the latent variables taking on the values that we want (eg. in IRL, the true utility function), they could be repurposed to explain parts of the distribution that can't be adequately modeled (eg. in IRL, if you don't account for humans learning, you might repurpose the utility function parameters to say that humans like to change up their behavior a lot). If you then use the inferred latent variable values, you're going to be in for a bad time.

So, under mis-specification, the notion of the "true" value of latent variables is no longer meaningful, and the distribution over latent variables that you learn need not match reality. One potential solution would be counterfactual reasoning, which informally means that your model must be able to make good predictions on many different distributions.

Model Mis-specification and Inverse Reinforcement Learning (Owain Evans and Jacob Steinhardt): While the previous post focused on mis-specification in general, this one looks at inverse reinforcement learning (IRL) in particular. In IRL, the latent variable is the utility function, which predicts the observed variable, behavior. They identify three main categories where mis-specification could harm IRL. First, IRL could misunderstand the actions available to the human. For example, if I accidentally hit someone else due to a reflex, but IRL doesn't realize it's a reflex and thinks I could have chosen not to do that, it would infer I don't like the other person. In addition, inferring actions is hard, since in many cases we would have to infer actions from video frames, which is a challenging ML problem. Second, IRL could misunderstand what information and biases are available to the human. If I go to a cafe when it is closed, but IRL thinks that I know it's closed, it's might incorrectly infer a preference for taking a walk. Similarly, if it doesn't know about the planning bias, it might infer that humans don't care about deadlines. Third, IRL may not realize that humans are making long-term plans, especially if the data they are trained on is short and episodic (a form of mis-specification that seems quite likely). If you see a student studying all the time, you might infer that they like studying, instead of that they want a good grade. Indeed, this inference probably gets you 99% accuracy, since the student does in fact spend a lot of time studying. The general issue is that large changes in the model of the human might only lead to small changes in predictive accuracy, and this gets worse with longer-term plans.

Future directions for ambitious value learning (Rohin Shah): This post is a summary of many different research directions related to ambitious value learning that are currently being pursued.

Agent foundations

What are Universal Inductors, Again? (Diffractor)

Learning human intent

Learning from Demonstration in the Wild (Feryal Behbahani et al) (summarized by Richard): This paper learns traffic trajectories from unsupervised data by converting traffic camera footage into a Unity scene simulation, using that simulation to generate pseudo-LIDAR readings for each "expert trajectory", and then training an agent to imitate them using a variant of generative adversarial imitation learning (GAIL).

Richard's opinion: This is a cool example of how huge amounts of existing unlabeled video data might be utilised. The task they attempt is significantly more complex than those in other similar work (such as this paper which learns to play Atari games from Youtube videos); however, this also makes it difficult to judge how well the learned policy performed, and how much potential it has to transfer into the real world.

Handling groups of agents

Multi-Agent Overoptimization, and Embedded Agent World Models (David Manheim): This post and the associated paper argue for the complexity of multiagent settings, where you must build a model of how other agents act, even though they have models of how you act. While game theory already deals with this setting, it only does so by assuming that the agents are perfectly rational, an assumption that doesn't hold in practice and doesn't grapple with the fact that your model of the opponent cannot be perfect. The paper lists a few failure modes. Accidental steering happens when one agent takes action without the knowledge of what other agents are doing. Coordination failures are exactly what they sound like. Adversarial misalignment happens when one agent chooses actions to mislead a victim agent into taking actions that benefit the first agent. Input spoofing and filtering happen when one agent doctors the training data for a victim agent. Goal co-option occurs when one agent takes control over the other agent (possibly by modifying their reward function).

Rohin's opinion: It's great to see work on the multiagent setting! This setting does seem quite a bit more complex, and hasn't been explored very much from the AI safety standpoint. One major question I have is how this relates to the work already done in academia for different settings (typically groups of humans instead of AI agents). Quick takes on how each failure mode is related to existing academic work: Accidental steering is novel to me (but I wouldn't be surprised if there has been work on it), coordination failures seem like a particular kind of (large scale) prisoner's dilemma, adversarial misalignment is a special case of the principal-agent problem, input spoofing and filtering and goal co-option seem like special cases of adversarial misalignment (and are related to ML security as the paper points out).

Interpretability

Explaining Explanations in AI (Brent Mittelstadt et al)

Adversarial examples

Is Robustness [at] the Cost of Accuracy? (Dong Su, Huan Zhang et al) (summarized by Dan H): This work shows that older architectures such as VGG exhibit more adversarial robustness than newer models such as ResNets. Here they take adversarial robustness to be the average adversarial perturbation size required to fool a network. They use this to show that architecture choice matters for adversarial robustness and that accuracy on the clean dataset is not necessarily predictive of adversarial robustness. A separate observation they make is that adversarial examples created with VGG transfers far better than those created with other architectures. All of these findings are for models without adversarial training.

Robustness May Be at Odds with Accuracy (Dimitris Tsipras, Shibani Santurkar, Logan Engstrom et al) (summarized by Dan H): Since adversarial training can markedly reduce accuracy on clean images, one may ask whether there exists an inherent trade-off between adversarial robustness and accuracy on clean images. They use a simple model amenable to theoretical analysis, and for this model they demonstrate a trade-off. In the second half of the paper, they show adversarial training can improve feature visualization, which has been shown in several concurrent works.

Adversarial Examples Are a Natural Consequence of Test Error in Noise (Anonymous) (summarized by Dan H): This paper argues that there is a link between model accuracy on noisy images and model accuracy on adversarial images. They establish this empirically by showing that augmenting the dataset with random additive noise can improve adversarial robustness reliably. To establish this theoretically, they use the Gaussian Isoperimetric Inequality, which directly gives a relation between error rates on noisy images and the median adversarial perturbation size. Given that measuring test error on noisy images is easy, given that claims about adversarial robustness are almost always wrong, and given the relation between adversarial noise and random noise, they suggest that future defense research include experiments demonstrating enhanced robustness on nonadversarial, noisy images.

Verification

MixTrain: Scalable Training of Formally Robust Neural Networks (Shiqi Wang et al)

Forecasting

AGI-11 Survey (Justis Mills): A survey of participants in the AGI-11 participants (with 60 respondents out of over 200 registrations) found that 43% thought AGI would appear before 2030, 88% thought it would appear before 2100, and 85% believed it would be beneficial for humankind.

Rohin's opinion: Note there's a strong selection effect, as AGI is a conference specifically aimed at general intelligence.

Field building

Current AI Safety Roles for Software Engineers (Ozzie Gooen): This post and its comments summarize the AI safety roles available for software engineers (including ones that don't require ML experience).

Miscellaneous (Alignment)

When does rationality-as-search have nontrivial implications? (nostalgebraist): Many theories of idealized intelligence, such as Solomonoff induction, logical inductors and Bayesian reasoning, involve a large search over a space of strategies and using the best-performing one, or a weighted combination where the weights depend on past performance. However, the procedure that involves the large search is not itself part of the space of strategies -- for example, Solomonoff induction searches over the space of computable programs to achieve near-optimality at prediction tasks relative to any computable program, but is itself uncomputable. When we want to actually implement a strategy, we have to choose one of the options from our set, rather than the infeasible idealized version, and the idealized version doesn't help us do this. It would be like saying that a chess expert is approximating the rule "consult all possible chess players weighted by past performance" -- it's true that these will look similar behaviorally, but they look very different algorithmically, which is what we actually care about for building systems.

Rohin's opinion: I do agree that in the framework outlined in this post (the "ideal" being just a search over "feasible" strategies) the ideal solution doesn't give you much insight, but I don't think this is fully true of eg. Bayes rule. I do think that understanding Bayes rule can help you make better decisions, because it gives you a quantitative framework of how to work with hypotheses and evidence, which even simple feasible strategies can use. (Although I do think that logically-omniscient Bayes does not add much over regular Bayes rule from the perspective of suggesting a feasible strategy to use -- but in the world where logically-omniscient Bayes came first, it would have been helpful to derive the heuristic.) In the framework of the post, this corresponds to the choice of "weight" assigned to each hypothesis, and this is useful because feasible strategies do still look like search (but instead of searching over all hypotheses, you search over a very restricted subset of them). So overall I think I agree with the general thrust of the post, but don't agree with the original strong claim that 'grappling with embeddedness properly will inevitably make theories of this general type irrelevant or useless, so that "a theory like this, except for embedded agents" is not a thing that we can reasonably want'.

Beliefs at different timescales (Nisan)

Near-term concerns

Privacy and security

A Marauder's Map of Security and Privacy in Machine Learning (Nicolas Papernot)

AI strategy and policy

The Vulnerable World Hypothesis (Nick Bostrom) (summarized by Richard): Bostrom considers the possibility "that there is some level of technology at which civilization almost certainly gets destroyed unless quite extraordinary and historically unprecedented degrees of preventive policing and/or global governance are implemented." We were lucky, for example, that starting a nuclear chain reaction required difficult-to-obtain plutonium or uranium, instead of easily-available materials. In the latter case, our civilisation would probably have fallen apart, because it was (and still is) in the "semi-anarchic default condition": we have limited capacity for preventative policing or global governence, and people have a diverse range of motivations, many selfish and some destructive. Bostrom identifies four types of vulnerability which vary by how easily and widely the dangerous technology can be produced, how predictable its effects are, and how strong the incentives to use it are. He also idenitifies four possible ways of stabilising the situation: restrict technological development, influence people's motivations, establish effective preventative policing, and establish effective global governance. He argues that the latter two are more promising in this context, although they increase the risks of totalitarianism. Note that Bostrom doesn't take a strong stance on whether the vulnerable world hypothesis is true, although he claims that it's unjustifiable to have high credence in its falsity.

Richard's opinion: This is an important paper which I hope will lead to much more analysis of these questions.

Other progress in AI

Exploration

Contingency-Aware Exploration in Reinforcement Learning (Jongwook Choi, Yijie Guo, Marcin Moczulski et al)

Reinforcement learning

Spinning Up in Deep RL (Joshua Achiam): Summarized in the highlights!

Are Deep Policy Gradient Algorithms Truly Policy Gradient Algorithms? (Andrew Ilyas, Logan Engstrom et al) (summarized by Richard): This paper argues that policy gradient algorithms are very dependent on additional optimisations (such as value function clipping, reward scaling, etc), and that they operate with poor estimates of the gradient. It also demonstrates that the PPO objective is unable to enforce a trust region, and that the algorithm's empirical success at doing so is due to the additional optimisations.

Richard's opinion: While the work in this paper is solid, the conclusions don't seem particularly surprising: everyone knows that deep RL is incredibly sample intensive (which straightforwardly implies inaccurate gradient estimates) and relies on many implementation tricks. I'm not familiar enough with PPO to know how surprising their last result is.

Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control (Kendall Lowrey, Aravind Rajeswaran et al)

VIREL: A Variational Inference Framework for Reinforcement Learning (Matthew Fellows, Anuj Mahajan et al)

Learning Shared Dynamics with Meta-World Models (Lisheng Wu, Minne Li et al)

Deep learning

Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing (Jacob Devlin and Ming-Wei Chang)

Learning Concepts with Energy Functions (Igor Mordatch)

AGI theory

A Model for General Intelligence (Paul Yaworsky)