Alignment Newsletter #41

Rohin Shah

Building AI systems that require informed consent

Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter.

This newsletter is late because I wanted to include Towards formalizing universality and related posts, but I haven't yet understood them well enough to put them in this newsletter so I'm not including them this time and hope to put them in next week.

Highlights

Non-Consequentialist Cooperation? (Abram Demski): One possible way to build useful AI systems is to have them try to help us. Taking more of a libertarian stance, a robot could help us in an autonomy-centric way, which would only take actions if we give it our informed consent. We can't ask for explicit consent for every action, since there's no clear way to break down actions, and it would certainly be too onerous to give informed consent to every possible motor command. As a result, our robots will need to infer when we have given consent. This increases the chances of misunderstanding, but we could try to have a high threshold, so that the robot asks us if it is even a little unsure about whether we have given consent.

If we want to precisely define consent, we'll need to solve some of the same problems that impact measures have to contend with. In particular, we would need to get consent for outcomes that the robot knows will happen as a result of its actions, but not ones that happen as a side effect. It's fine to give informed consent to the robot to buy bananas from a grocery store, even if that could cause a hurricane, as long as the robot doesn't know that it would cause a hurricane. Another issue is that inferring consent requires you to confront the issue that humans can be irrational. A third issue is that we might prevent the robot from taking actions that would help us that we can't understand would help us -- consider trying to ask a dog for its informed consent to take it to the vet.

Rohin's opinion: This seems like an interesting idea for how to build an AI system in practice, along the same lines as corrigibility. We notice that value learning is not very robust: if you aren't very good at value learning, then you can get very bad behavior, and human values are sufficiently complex that you do need to be very capable in order to be sufficiently good at value learning. With (a particular kind of) corrigibility, we instead set the goal to be to make an AI system that is trying to help us, which seems more achievable even when the AI system is not very capable. Similarly, if we formalize or learn informed consent reasonably well (which seems easier to do since it is not as complex as "human values"), then our AI systems will likely have good behavior (though they will probably not have the best possible behavior, since they are limited by having to respect informed consent).

However, this also feels different from corrigibility, in that it feels more like a limitation put on the AI system, while corrigibility seems more like a property of the AI's "motivational system". This might be fine, since the AI might just not be goal-directed. One other benefit of corrigibility is that if you are "somewhat" corrigible, then you would like to become more corrigible, since that is what the human would prefer; informed-consent-AI doesn't seem to have an analogous benefit.

Technical AI alignment

Iterated amplification sequence

AlphaGo Zero and capability amplification (Paul Christiano): AlphaGo Zero works by starting with a randomly chosen policy and value network. Then, it repeatedly applies a "policy improvement" step: it runs MCTS using the policy and value networks to guide the search, which results in moves that are better than the policy and value networks used alone, and then trains the policy and value networks on those moves. Iterated amplification is very similar: it starts with a base policy, and then repeatedly applies a "policy improvement" step that consists of first amplifying the policy and then distilling it. MCTS is analogous to amplification, and training on the resulting moves is analogous to distillation.

So, if a general method of building capable agents is to constantly apply "policy improvement" steps, the key challenge is to figure out a "policy improvement" operator that preserves alignment. (We could hopefully get an aligned weak initial agent by eg. imitating a human's cognitive procedure.) In iterated amplification, we are hoping that "think longer" (as formalized by increasingly large bounded versions of HCH (AN #34)) would be an alignment-preserving policy improvement operator.

Directions and desiderata for AI alignment (Paul Christiano): What further research needs to be done to get iterated amplification to work? There are three main areas: robustness, reward learning, and amplification.

With reliability and robustness, we want our AI systems to never fail catastrophically, whereas the simple setup of iterated amplification only optimizes performance in the average case. This is discussed more in Techniques for Optimizing Worst-Case Performance, which talks about adversarial training, interpretability, and verification.

Reward learning, as I understand it here, is the same thing as narrow value learning, described in this newsletter. We've talked a lot about the challenges of reward learning, so I'll leave that part out.

Finally, deliberation and amplification is about how to train an ML system to exceed human performance. At present, the only way we have of doing this is to use reinforcement learning to optimize some simple objective. However, given that "human values" is not a simple objective, we need to figure out another way that allows us to get superhuman performance on the tasks that we actually care about. You could do this with ambitious value learning, but not with narrow value learning. Alternatively, you could try to replicate human cognition and run it longer, either by using iterated amplification, or by using inverse reinforcement learning on cognition.

Then, the post goes on to talk about desiderata for any solution to AI alignment, proposing that the solution should be secure, competitive and scalable.

A secure AI system is one that works even when the environment is behaving adversarially. This is not possible in full generality because of no-free-lunch theorems, but we could at least hope that our AI system is never actively "trying" to hurt us. This is desirable because it holds us to a relatively clear standard that we can evaluate currently, rather than trying to predict how the future will look (a notoriously hard task) and evaluating whether our AI system will work in that future. In addition, the typical iterative method of trying things out and seeing how they do may not work in the future, because technological progress may accelerate due to recursive improvement. If we instead focus on arguments and analysis that suggest that our AI systems are secure, we would not need the iterative method. Finally, there could in fact be adversaries (that themselves might be AI systems) that we need to deal with.

An AI system is competitive if it is almost as efficient and capable as the best possible AI system we could build at the same time (which may be unaligned). This is necessary because without it it becomes very likely that someone will build the unaligned version for the efficiency gains; we would need global coordination to avoid this which seems quite difficult. In addition, it is relatively easy to tell when a system is competitive, whereas if an AI system is uncompetitive, it is hard to tell how uncompetitive it is. Once again, with an uncompetitive AI system, we need to predict the future in order to tell whether or not it will be okay to rely on that system.

A scalable AI system is one that remains aligned as the underlying technologies become more and more powerful. Without scalability, we would need to predict how far a particular scheme will take us (which seems hard), and we need to continually invest in alignment research in order to "keep up" with improving capabilities, which might be particularly hard to do in the face of accelerating progress due to recursive improvement.

These are very demanding desiderata. However, while it may seem impossible to get a secure, competitive, scalable solution to AI alignment, we don't know why yet, and impossibility claims are notoriously hard to get right. In any case, even if it is impossible, it would be worthwhile to clarify exactly why these desiderata are impossible to meet, in order to figure out how to deal with the resulting problems.

Rohin's opinion: The three directions of reliability/robustness, reward learning, and amplification seem great. Robustness seems particularly hard to achieve. While there is current work on adversarial training, interpretability and verification, even if all of the problems that researchers currently work on were magically solved, I don't have a story for how that leads to robustness of (say) an agent trained by iterated amplification. Normally, I'd also talk about how amplification seems to require some sort of universality assumption, but now I can just refer you to ascription universality (described in this newsletter).

I am more conflicted about the desiderata. They seem very difficult to satisfy, and they don't seem strictly necessary to achieve good outcomes. The underlying view here is that we should aim for something that we know is sufficient to achieve good outcomes, and only weaken our requirements if we find a fundamental obstacle. My main issue with this view is that even if it is true that the requirements are impossible to satisfy, it seems very hard to know this, and so we may spend a lot of time trying to satisfy these requirements and most of that work ends up being useless. I can imagine that we try to figure out ways to achieve robustness for several years in order to get a secure AI system, and it turns out that this is impossible to do in a way where we know it is robust, but in practice any AI system that we train will be sufficiently robust that it never fails catastrophically. In this world, we keep trying to achieve robustness, never find a fundamental obstruction, but also never succeed at creating a secure AI system.

Another way of phrasing this is that I am pessimistic about the prospects of conceptual thinking, which seems to be the main way by which we could find a fundamental obstruction. (Theory and empirical experiments can build intuitions about what is and isn't hard, but given the complexities of the real world it seems unlikely that either would give us the sort of crystallized knowledge that Paul is aiming for.) Phrased this way, I put less credence in this opinion, because I think there are a few examples of conceptual thinking being very important, though not that many.

Value learning sequence

What is narrow value learning? (Rohin Shah): This post introduces the concept of narrow value learning (as opposed to ambitious value learning (AN #31), where an AI system is trained to produce good behavior within a particular domain, without expecting generalization to novel circumstances. Most current work in the ML community on preference learning or reward learning falls under narrow value learning.

Ambitious vs. narrow value learning (Paul Christiano): While previously I defined narrow value learning as obtaining good behavior in a particular domain, this post defines it as learning the narrow subgoals and instrumental values that the human is pursuing. I believe that these are pointing at the same underlying concept. There are a few reasons to be optimistic about using only narrow value learning to build very powerful AI systems. First, it should be relatively easy to infer many instrumental goals that humans have, such as "acquiring more resources under my control", "better understanding the world and what I want", "remaining in control of deployed AI systems", etc. which an AI system could then pursue with all of its ingenuity. Second, we could infer enough of human preferences to keep humans in a safe environment where they can deliberate in order to figure out what they want to do with the future. Third, humans could use these narrow AI systems as tools in order to implement sophisticated plans, allowing them to perform tasks that we would currently consider to be beyond human ability.

Human-AI Interaction (Rohin Shah): One of the lessons of control theory is that you can achieve significantly stronger guarantees if you are able to make use of feedback. Self-driving cars without any sensors would be basically impossible to build. Ambitious value learning (AN #31) aims to find the utility function that will determine the optimal behavior for the rest of time -- without any feedback. However, human preferences and values will evolve over time as we are exposed to new technologies, cultural norms, and governance structures. This is analogous to the environmental disturbances that control theory assumes, and just as with control theory it seems likely that we will need to have feedback, in the form of some data about human preferences, to accommodate these changes.

This suggests we might consider an AI design where the AI system constantly elicits information from humans about their preferences using narrow value learning techniques, and acts based on its current understanding. The obvious way of doing this would be to have an estimate of the reward that is updated over time, and actions are chosen based on the current estimate. However, this still has several problems. Most notably, if the AI system chooses actions that are best according to the current reward estimate, it still has convergent instrumental subgoals, and in particular actions that disable the narrow value learning system to lock in the current reward estimate would be rated very highly. Another problem is that this model assumes that human preferences and values change "by magic" and any such change is good -- but in reality, we likely want to make sure this process is "good", and in particular does not end up being determined by the AI manipulating our preferences.

Technical agendas and prioritization

Comments on CAIS (Richard Ngo): This post is another summary of and response to Comprehensive AI Services, after mine last week (AN #40). I recommend it to get a different take on an important set of ideas. It delves much more into the arguments between the CAIS perspective and the single-AGI-agent perspective than my summary did.

Agent foundations

No surjection onto function space for manifold X (Stuart Armstrong)

Learning human intent

Non-Consequentialist Cooperation? (Abram Demski): Summarized in the highlights!

Risk-Aware Active Inverse Reinforcement Learning (Daniel S. Brown, Yuchen Cui et al)

On the Utility of Model Learning in HRI (Rohan Choudhury, Gokul Swamy et al)

Reward learning theory

Hierarchical system preferences and subagent preferences (Stuart Armstrong): Often, when looking at a hierarchical system, you can ascribe preferences to the system as a whole, as well as to individual parts or subagents of the system. Often, any divergence between the parts can be either interpreted as a difference between the goals of the parts, or a failure of rationality of one of the parts. The post gives a particular algorithm, and shows how based on the structure of the code of the subagent, we could either infer that the subagent is mistaken, or that it has different goals.

Now, we could infer meta-preferences by seeing how the system tends to self-modify -- perhaps we notice that it tends to amplify the "goals" of one particular subagent, in which case we can infer that it has a meta-preference for those goals. But without that, there's no correct answer to what the true goals are. In the post's own words: "In the absence of some sort of meta-preferences, there are multiple ways of establishing the preferences of a hierarchical system, and many of them are equally valid."

Interpretability

Personalized explanation in machine learning (Johanes Schneider et al)

AI strategy and policy

The American Public’s Attitudes Concerning Artificial Intelligence (Baobao Zhang et al): This presents results from a survey of Americans about their attitudes towards AI. There's not a compelling objective story I can tell, so you might as well look at the executive summary, which presents a few interesting highlights. One interesting fact: the median person thinks that we're more likely than not to have "high-level machine intelligence" within 10 years! You could also read Vox's take, which emphasizes that the public is concerned about long-term AI risk.

Building an AI World (Tim Dutton et al) (summarized by Richard): This report summarises the AI strategies released by 18 different countries and regions. In particular, it rates how much emphasis they put on each of 8 areas. Broadly speaking, countries were most focused on Industrial strategy, Research, and AI talent (in that order), moderately focused on Ethics and Data, and least focused on Future of work, AI in governments, and Inclusion.

Richard's opinion: Since this report discusses neither its methodology nor the implications of its findings, it's difficult to know what conclusions to draw from it. The overall priorities seem to be roughly what I would have expected, except that I'm positively surprised by how much emphasis was placed on ethics.

Other progress in AI

Reinforcement learning

Paired Open-Ended Trailblazer (POET): Endlessly Generating Increasingly Complex and Diverse Learning Environments and Their Solutions (Rui Wang et al) (summarized by Richard): The POET algorithm uses evolutionary strategies to evolve a population of pairs of tasks and agents. During each iteration, it first generates a new environment by perturbing an existing environment, then optimises each agent for its paired environment, then attempts to transfer agents between existing environments to improve performance (in case one environment turns out to be a useful "stepping stone" towards another). New environments are kept if they are neither too hard nor too easy for the current population of agents. This algorithm was tested using the Bipedal Walker environment, where it significantly outperformed standard evolutionary search.

Richard's opinion: I think that the "problem problem" is going to become increasingly important in RL, and that this is a promising approach. Note that this paper's contribution seems to be mainly that it combines ideas from previous papers on minimal criteria coevolution and innovation engines.

Self-supervised Learning of Image Embedding for Continuous Control (Carlos Florensa et al)

Hierarchical RL

Hierarchical Reinforcement Learning via Advantage-Weighted Information Maximization (Takayuki Osa)

Non-Consequentialist Cooperation? (Abram Demski): [...]

However, this also feels different from corrigibility, in that it feels more like a limitation put on the AI system, while corrigibility seems more like a property of the AI's "motivational system". This might be fine, since the AI might just not be goal-directed. One other benefit of corrigibility is that if you are "somewhat" corrigible, then you would like to become more corrigible, since that is what the human would prefer; informed-consent-AI doesn't seem to have an analogous benefit.

You could definitely think of it as a limitation to put on a system, but I actually wasn't thinking of it that way when I wrote the post. I was trying to imagine something which only operates from this principle. Granted, I didn't really explain how that could work. I was imagining that it does something like sample from a probability distribution which is (speaking intuitively) the intersection of what you expect it to do and what you would like it to do.

(It now seems to me that although I put "non-consequentialist" in the title of the post, I didn't explain the part where it isn't consequentialist very well. Which is fine, since the post was very much just spitballing.)

I was imagining that it does something like sample from a probability distribution which is (speaking intuitively) the intersection of what you expect it to do and what you would like it to do.

When I use the word "limitation", I intend to include things like this. I think the way I'm using it is something like "a method of restricting possible behaviors that is not implemented through changing the "motivational system"". The idea being that if the AI is badly motivated, and our limitation doesn't eliminate all of the bad behaviors (a very strong condition that seems hard to meet), then it is likely that the AI will end up picking out one of the bad behaviors. This is similar in spirit to the ideas behind nearest unblocked strategies.

Here, we are eliminating bad behaviors by choosing a particular set of behaviors (ones that violate your expectations of what the robot will do) and not considering them at all. This seems like it is not about the "motivational system", and if this were implemented in a robot that does have a separate "motivational system" (i.e. it is goal-directed), I worry about a nearest unblocked strategy.

This seems like it is not about the "motivational system", and if this were implemented in a robot that does have a separate "motivational system" (i.e. it is goal-directed), I worry about a nearest unblocked strategy.

I am confused about where you think the motivation system comes into my statement. It sounds like you are imagining that what I said is a constraint, which could somehow be coupled with a seperate motivation system. If that's your interpretation, that's not what I meant at all, unless random sampling counts as a motivation system. I'm saying that all you do is sample from what's consented to.

But, maybe what you are saying is that in "the intersection of what the user expects and what the user wants", the first is functioning as a constraint, and the second is functioning as a motivation system (basically the usual IRL motivation system). If that's what you meant, I think that's a valid concern. What I was imagining is that you are trying to infer "what the user wants" not in terms of end goals, but rather in terms of actions (really, policies) for the AI. So, it is more like an approval-directed agent to an extent. If the human says "get me groceries", the job of the AI is not to infer the end state the human is asking the robot to optimize for, but rather, to infer the set of policies which the human is trying to point at.

There's no optimization on top of this finding perverse instantiations of the constraints; the AI just follows the policy which it infers the human would like. Of course the powerful learning system required for this to work may perversely instantiate these beliefs (ie, there may be daemons aka inner optimizers).

(The most obvious problem I see with this approach is that it seems to imply that the AI can't help the human do anything which the human doesn't already know how to do. For example, if you don't know how to get started filing your taxes, then the robot can't help you. But maybe there's some way to differentiate between more benign cases like that and less benign cases like using nanotechnology to more effectively get groceries?)

A third interpretation of your concern is that you're saying that if the thing is doing well enough to get groceries, there has to be powerful optimization somewhere, and wherever it is, it's going to be pushing toward perverse instantiations one way or another. I don't have any argument against this concern, but I think it mostly amounts to a concern about inner optimizers.

(I feel compelled to mention again that I don't feel strongly that the whole idea makes any sense. I just want to convey why I don't think it's about constraining an underlying motivation system.)

But, maybe what you are saying is that in "the intersection of what the user expects and what the user wants", the first is functioning as a constraint, and the second is functioning as a motivation system (basically the usual IRL motivation system).

This is basically what I meant. Thanks for clarifying that you meant something else.

The most obvious problem I see with this approach is that it seems to imply that the AI can't help the human do anything which the human doesn't already know how to do.

Yeah, this is my concern with the thing you actually meant. (It's also why I incorrectly assumed that "what the user wants" was meant to be goal-directed optimization, as opposed to about policies the user approves of.) It could work combined with something like amplification where you get to assume that the overseer is smarter than the agent, but then it's not clear if the part about "what the user expects" buys you anything over the "what the user wants" part.

A third interpretation of your concern is that you're saying that if the thing is doing well enough to get groceries, there has to be powerful optimization somewhere, and wherever it is, it's going to be pushing toward perverse instantiations one way or another. I don't have any argument against this concern, but I think it mostly amounts to a concern about inner optimizers.

This does seem like a concern, but it wasn't the one I was thinking about. It also seems like a concern about basically any existing proposal. Usually when talking about concerns I don't bring up the ones that are always concerns, unless someone explicitly claims that their solution obviates that concern.

Another way of phrasing this is that I am pessimistic about the prospects of conceptual thinking, which seems to be the main way by which we could find a fundamental obstruction. (Theory and empirical experiments can build intuitions about what is and isn't hard, but given the complexities of the real world it seems unlikely that either would give us the sort of crystallized knowledge that Paul is aiming for.) Phrased this way, I put less credence in this opinion, because I think there are a few examples of conceptual thinking being very important, though not that many.

Can you expand on your reasons for pessimism?

Idk, it seems hard to do, I personally have had trouble doing it, the future is vast and complex and hard to fit in your head, when trying to make an argument that eliminates all possible bad behaviors while including all the good ones it seems like you're going to forget some cases, which proofs let you avoid because they hold you to a very high standard but there's no equivalent with conceptual thinking.

(These aren't very clear/are confused, if that wasn't obvious already.)

Another way of putting it is that conceptual thinking doesn't seem to have great feedback loops, which experiments clearly have and theory kind of has (you can at least get the binary true/false feedback once you prove any particular theorem).

Non-Consequentialist Cooperation? (Abram Demski): [...]

However, this also feels different from corrigibility, in that it feels more like a limitation put on the AI system, while corrigibility seems more like a property of the AI's "motivational system". This might be fine, since the AI might just not be goal-directed. One other benefit of corrigibility is that if you are "somewhat" corrigible, then you would like to become more corrigible, since that is what the human would prefer; informed-consent-AI doesn't seem to have an analogous benefit.

I was imagining that it does something like sample from a probability distribution which is (speaking intuitively) the intersection of what you expect it to do and what you would like it to do.

This seems like it is not about the "motivational system", and if this were implemented in a robot that does have a separate "motivational system" (i.e. it is goal-directed), I worry about a nearest unblocked strategy.

(I feel compelled to mention again that I don't feel strongly that the whole idea makes any sense. I just want to convey why I don't think it's about constraining an underlying motivation system.)

But, maybe what you are saying is that in "the intersection of what the user expects and what the user wants", the first is functioning as a constraint, and the second is functioning as a motivation system (basically the usual IRL motivation system).

This is basically what I meant. Thanks for clarifying that you meant something else.

The most obvious problem I see with this approach is that it seems to imply that the AI can't help the human do anything which the human doesn't already know how to do.

A third interpretation of your concern is that you're saying that if the thing is doing well enough to get groceries, there has to be powerful optimization somewhere, and wherever it is, it's going to be pushing toward perverse instantiations one way or another. I don't have any argument against this concern, but I think it mostly amounts to a concern about inner optimizers.

Another way of phrasing this is that I am pessimistic about the prospects of conceptual thinking, which seems to be the main way by which we could find a fundamental obstruction. (Theory and empirical experiments can build intuitions about what is and isn't hard, but given the complexities of the real world it seems unlikely that either would give us the sort of crystallized knowledge that Paul is aiming for.) Phrased this way, I put less credence in this opinion, because I think there are a few examples of conceptual thinking being very important, though not that many.

Can you expand on your reasons for pessimism?

(These aren't very clear/are confused, if that wasn't obvious already.)

7

Alignment Newsletter #41

7

Highlights

Technical AI alignment

Iterated amplification sequence

Value learning sequence

Technical agendas and prioritization

Agent foundations

Learning human intent

Reward learning theory

Interpretability

AI strategy and policy

Other progress in AI

Reinforcement learning

Hierarchical RL