So far we have been talking about how to learn “values” or “instrumental goals”. This would be necessary if we want to figure out how to build an AI system that does exactly what we want it to do. However, we’re probably fine if we can keep learning and building better AI systems. This suggests that it’s sufficient to build AI systems that don’t screw up so badly that it ends this process. If we accomplish that, then steady progress in AI will eventually get us to AI systems that do what we want.

So, it might be helpful to break down the problem of learning values into the subproblems of learning what to do, and learning what not to do. Standard AI research will continue to make progress on learning what to do; catastrophe happens when our AI system doesn’t know what not to do. This is the part that we need to make progress on.

This is a problem that humans have to solve as well. Children learn basic norms such as not to litter, not to take other people’s things, what not to say in public, etc. As argued in Incomplete Contracting and AI alignment, any contract between humans is never explicitly spelled out, but instead relies on an external unwritten normative structure under which a contract is interpreted. (Even if we don’t explicitly ask our cleaner not to break any vases, we still expect them not to intentionally do so.) We might hope to build AI systems that infer and follow these norms, and thereby avoid catastrophe.

It’s worth noting that this will probably not be an instance of narrow value learning, since there are several differences:

  • Narrow value learning requires that you learn what to do, unlike norm inference.
  • Norm following requires learning from a complex domain (human society), whereas narrow value learning can be applied in simpler domains as well.
  • Norms are a property of groups of agents, whereas narrow value learning can be applied in settings with a single agent.

Despite this, I have included it in this sequence because it is plausible to me that value learning techniques will be relevant to norm inference.

Paradise prospects

With a norm-following AI system, the success story is primarily around accelerating our rate of progress. Humans remain in charge of the overall trajectory of the future, and we use AI systems as tools that enable us to make better decisions and create better technologies, which looks like “superhuman intelligence” from our vantage point today.

If we still want an AI system that colonizes space and optimizes it according to our values without our supervision, we can figure out what our values are over a period of reflection, solve the alignment problem for goal-directed AI systems, and then create such an AI system.

This is quite similar to the success story in a world with Comprehensive AI Services.

Plausible proposals

As far as I can tell, there has not been very much work on learning what not to do. Existing approaches like impact measures and mild optimization are aiming to define what not to do rather than learn it.

One approach is to scale up techniques for narrow value learning. It seems plausible that in sufficiently complex environments, these techniques will learn what not to do, even though they are primarily focused on what to do in current benchmarks. For example, if I see that you have a clean carpet, I can infer that it is a norm not to walk over the carpet with muddy shoes. If you have an unbroken vase, I can infer that it is a norm to avoid knocking it over. This paper of mine shows how this you can reach these sorts of conclusions with narrow value learning (specifically a variant of IRL).

Another approach would be to scale up work on ad hoc teamwork. In ad hoc teamwork, an AI agent must learn to work in a team with a bunch of other agents, without any prior coordination. While current applications are very task-based (eg. playing soccer as a team), it seems possible that as this is applied to more realistic environments, the resulting agents will need to infer norms of the group that they are introduced into. It’s particularly nice because it explicitly models the multiagent setting, which seems crucial for inferring norms. It can also be thought of as an alternative statement of the problem of AI safety: how do you “drop in” an AI agent into a “team” of humans, and have the AI agent coordinate well with the “team”?

Potential pros

Value learning is hard, not least because it’s hard to define what values are, and we don’t know our own values to the extent that they exist at all. However, we do seem to do a pretty good job of learning society’s norms. So perhaps this problem is significantly easier to solve. Note that this is an argument that norm-following is easier than ambitious value learning, not that it is easier than other approaches such as corrigibility.

It is also feels easier to work on inferring norms right now. We have many examples of norms that we follow; so we can more easily evaluate whether current systems are good at following norms. In addition, ad hoc teamwork seems like a good start at formalizing the problem, which we still don’t really have for “values”.

This also more closely mirrors our tried-and-true techniques for solving the principal-agent problem for humans: there is a shared, external system of norms, that everyone is expected to follow, and systems of law and punishment are interpreted with respect to these norms. For a much more thorough discussion, see Incomplete Contracting and AI alignment, particularly Section 5, which also argues that norm following will be necessary for value alignment (whereas I’m arguing that it is plausibly sufficient to avoid catastrophe).

One potential confusion: the paper says “We do not mean by this embedding into the AI the particular norms and values of a human community. We think this is as impossible a task as writing a complete contract.” I believe that the meaning here is that we should not try to define the particular norms and values, not that we shouldn’t try to learn them. (In fact, later they say “Aligning AI with human values, then, will require figuring out how to build the technical tools that will allow a robot to replicate the human agent’s ability to read and predict the responses of human normative structure, whatever its content.”)

Perilous pitfalls

What additional things could go wrong with powerful norm-following AI systems? That is, what are some problems that might arise, that wouldn’t arise with a successful approach to ambitious value learning?

  • Powerful AI likely leads to rapidly evolving technologies, which might require rapidly changing norms. Norm-following AI systems might not be able to help us develop good norms, or might not be able to adapt quickly enough to new norms. (One class of problems in this category: we would not be addressing human safety problems.)
  • Norm-following AI systems may be uncompetitive because the norms might overly restrict the possible actions available to the AI system, reducing novelty relative to more traditional goal-directed AI systems. (Move 37 would likely not have happened if AlphaGo were trained to “follow human norms” for Go.)
  • Norms are more like soft constraints on behavior, as opposed to goals that can be optimized. Current ML focuses a lot more on optimization than on constraints, and so it’s not clear if we could build a competitive norm-following AI system (though see eg. Constrained Policy Optimization).
  • Relatedly, learning what not to do imposes a limitation on behavior. If an AI system is goal-directed, then given sufficient intelligence it will likely find a nearest unblocked strategy.


One promising approach to AI alignment is to teach AI systems to infer and follow human norms. While this by itself will not produce an AI system aligned with human values, it may be sufficient to avoid catastrophe. It seems more tractable than approaches that require us to infer values to a degree sufficient to avoid catastrophe, particularly because humans are proof that the problem is soluble.

However, there are still many conceptual problems. Most notably, norm following is not obviously expressible as an optimization problem, and so may be hard to integrate into current AI approaches.

New Comment
4 comments, sorted by Click to highlight new comments since:

I feel like there are three facets to "norms" v.s. values, which are bundled together in this post but which could in principle be decoupled. The first is representing what not to do versus what to do. This is reminiscent of the distinction between positive and negative rights, and indeed most societal norms (e.g. human rights) are negative, but not all (e.g. helping an injured person in the street is a positive right). If the goal is to prevent catastrophe, learning the 'negative' rights is probably more important, but it seems to me that most techniques developed could learn both kinds of norms.

Second, there is the aspect of norms being an incomplete representation of behaviour: they impose some constraints, but there is not a single "norm-optimal" policy (contrast with explicit reward maximization). This seems like the most salient thing from an AI standpoint, and as you point out this is an underexplored area.

Finally, there is the issue of norms being properties of groups of agents. One perspective on this is that humans are realising their values through constructing norms: e.g. if I want to drive safely, it is good to have a norm to drive on the left or right side of the road, even though I may not care which norm we establish. Learning norms directly therefore seems beneficial to neatly integrate into human society (it would be awkward if e.g. robots drive on the left and humans drive on the right). If we think the process of going from values to norms is both difficult and important for multi-agent cooperation, learning norms also lets us sidestep a potentially thorny problem.

Yeah, agreed with all of that, thanks for the comment. You could definitely try to figure out each of these things individually, eg. learning constraints that can be used with Constrained Policy Optimization is along the "what not to do" axis, and a lot of the multiagent RL work is looking at how we can get some norms to show up with decentralized training. But I feel a lot more optimistic about research that is trying to do all three things at once, because I think the three aspects do interact with each other. At least, the first two feel very tightly linked, though they probably can be separated from the multiagent setting.

Existing approaches like impact measures and mild optimization are aiming to define what not to do rather than learn it.

Stuart’s early impact approach was like this, but modern work isn’t. Or maybe by “define what not to do”, you don’t mean “leave these variables alone”, but rather that eg (some ideally formalized variant of) AUP implicitly specifies a way in which the agent interacts with its environment: passivity to significant power changes. But then by this definition, we’re doing the “defining” thing for norm-learning approaches.

I agree that norm-based approaches use learning. I just don’t know whether I agree with your assertion that eg AUP “defines” what not to do.

To my understanding, mild optimization is about how we can navigate a search space intelligently without applying too much internal optimization pressure to find really “amazing” plans. This doesn’t seem to fit either.

Relatedly, learning what not to do imposes a limitation on behavior. If an AI system is goal-directed, then given sufficient intelligence it will likely find a nearest unblocked strategy.

How pessimistic are you about this concern for this idea?

I just don’t know whether I agree with your assertion that eg AUP “defines” what not to do.

I think I mostly meant that it is not learned.

I kind of want to argue that this means the effect of not-learned things can be traced back to researcher's brains, rather than to experience with the real world. But that's not exactly right, because the actual impact penalty can depend on properties of the world, even if it doesn't use learning.

How pessimistic are you about this concern for this idea?

I don't know; it feels too early to say. I think if the norms end up in some hardcoded form such that they never update over time, nearest unblocked strategies feel very likely. If the norms are evolving over time, then it might be fine. The norms would need to evolve at the same "rate" as the rate at which the world changes.