This is a very rough intuition pump for possible alternatives to value learning.

In broad strokes, the goal of (ambitious) value learning is to define and implement a notion of cooperation (or helpfulness) in terms of two activities: (1) figuring out what humans value, (2) working to optimize that.

I'm going to try to sketch an alternative notion of cooperation/helpfulness. This intuition is based on libertarian or anarcho-capitalist ideas, but in some ways seems closer to what humans do when they try to help.


I was talking to Andrew Forrester about the suffering golem thought experiment. I'm not sure who came up with this thought experiment, but the idea is:

Suffering Golem: A golem suffers terribly in every moment of its existence, but it says it wants to keep living. Do you kill it?

The idea is that if you think it is good to kill it, you're a hedonic utilitarian: your altruistic motives have to do with maximizing pleasure / minimizing suffering. If you think it should not be killed, then you're more likely to be a preference utilitarian: your altruistic motives have to do with what the person values for themselves, rather than some other thing you think would be good for them. (I tend to lean toward preference utilitarianism myself, but don't think the question is obvious.)

Andrew Forrester was against killing it, but justified his answer with something like "it's none of your business. You could provide convenient means of suicide if you wanted..." This wasn't an expression of not caring about the welfare of the golem. Rather, it was a way of saying that you want to preserve the golem's autonomy.

I realized that although his answer was consistent with preference utilitarianism on the surface, it went beyond. I think he would likely have a similar response to the following thought experiment:

Confused Golem: A golem hates every moment of its existence, and would prefer to die, but it is unable to admit the fact to itself. It thinks that it loves life and wants to continue living. Perhaps it could eventually realize that it preferred not to exist if it thought about the question for long enough, but that day is a long way off. Do you kill it?

The autonomy-preserving move is to not kill the confused golem. You might talk to the golem about what it wants, but you wouldn't actively optimize for convincing it that it actually wants to die. (That would subtract from its autonomy.)

Informed Consent?

If I imagine something which is only motivated to help, where "help" is interpreted in the autonomy-centric way, it seems like the idea is an entity which only acts on your informed consent. It will sit and do nothing until it is confident that there is something which you want it to do, and understand the consequences of it doing.

Imagine you buy a robot which runs on these rules. The robot sits patiently in your house. Sitting there is not a violation of its directive, because it did not place itself there; whatever the consequences may be, they are a result of your autonomous action. However, it does watch you, which may violate informed consent. It has a high prior probability that you understand it is watching and consent to this, because the packaging had prominent warnings about this. Watching you is necessary for the robot to function, since it must infer your consent. It may shut off if it infers that you do not understand this.

The robot will continue doing nothing until it has gained confidence that you have fully read the instruction booklet, which contains the basic facts about how the robot functions. You may then issue commands to the robot. The instruction booklet recommends that, before you try this, you say "I consent to discuss the meaning of my commands with you, the robot." This clears the robot to ask clarifying questions about commands and to tell you about the likely consequences of commands if it does not think you understand them. Without giving consent to this, the robot will often fail to do anything without offering any explanation for its failure.

Another recommended command is "Let's discuss how we can work together." This clears the robot to make general inquiries about how it should behave toward you. Once you issue this command, for the duration of the conversation, the robot will formulate intelligent questions about what you might want, what you like and dislike, where you struggle in your life, and so on. It will make suggestions about how it can help, many invented on the spot. At some point during the conversation it will likely ask if it should maintain this level of candor going forward, or if it should only discuss its tasks in such an open-ended way upon request.

Gentle Value Extraction

What the robot will absolutely not do during this initial interview is pry into your personal life with questions optimized to extract the maximum useful information about your values and life difficulties. Although that might be the most useful thing it could do during its initial interview with you, it would break your autonomy, because many humans are uncomfortable discussing certain topics, and breaking these norms is not a reasonable consequence to expect from the command you've issued. Since humans may not even wish to consent to the robot knowing various personal details (and may accidentally reveal enough information for the robot to figure things out), the robot has to tread lightly in its inferences, too. Even asking directly whether a certain topic is OK may be an unwanted and unexpected act, making it impossible to go there unless the human brings it up on their own initiative.

The robot might not even try to gently move the discussion in the direction of greater openness about private details, because "trying to get the human to open up more" is not an obvious consequence of discussing potential tasks. But it isn't obvious; maybe trying to get people to open up is normal enough for a conversation that this is fine. The instruction booklet could warn users about it, making it an expected consequence and therefore part of what is consented to.

Explicit Consent vs Inferred Consent

At this point, someone might be thinking "Why are you talking about the robot inferring that the human consents to certain things as reasonable expectations of giving certain commands? Why give so much leeway? We could just require explicit consent instead."

Explicit consent is so impractical as to border on meaninglessness. We want the robot to have some autonomy in how it executes commands. If it knows we like cream in our coffee, it makes sense for it to just put the cream in without asking every time, or us issuing a general rule. Cream in the coffee is a reasonable expectation. The way I think about it, an explicit consent requirement would force us to approve every motor command precisely; the freedom to intelligently carry out complex tasks in response to commands requires a certain amount of freedom to infer consent.

Another way of thinking about the problem is that explicit consent places dictionary-definition English in too much of a special position. We can convey our meaning in any number of ways. In a sufficiently information-rich context, a glance might be sufficient.

Turning things the other way around, there are also cases when explicit consent doesn't make for inferred consent. If someone is made to consent under duress, consent should not be inferred.

The biggest argument I see in favor of explicit consent is that it makes for a much lower risk of misunderstanding. Misunderstanding is certainly a serious concern, and one reason why humans often require explicit consent in high-stakes situations. However, in the context of consent-based robotics, there are likely better ways of addressing the concern:

  • Requiring higher confidence in inferred consent. This might be modulated by the inferred importance of the question in a situation, so that explicit consent is required in practice for anything of importance, due to the high confidence it establishes. Measuring "importance" in this way creates its own potential safety concerns, of course.
  • Using highly robust machine-learning techniques, so that spuriously high confidence in inferred consent is very unlikely.

What Does Informed Consent Mean?

There's a conceptual problem here which feels very similar to impact measures. An impact measure is supposed to quantify the extent to which an action changes things in general. Informed consent seems to require that we quantify the degree to which a change fits within certain expectations. The notion of "change" seems to be common between the two.

For example, at least according to my intuition, an impact measure should not penalize an AI much for the butterfly-effect changes in weather patterns which any action implies. The future will include hurricanes destroying cities in a broad variety of circumstances, and small actions may create large changes in which hurricanes / which cities. If a particular action foreseeably changes the overall impact of the hurricane/city pattern on other important variables in a significant way, then an impact measure should penalize it.

Similarly, a human can have informed consent as to the consequences of a robot going to the grocery store and buying bananas without understanding all the consequences on future weather patterns, even though this will involve some large changes to which hurricanes destroy which cities at some point later. On the other hand, if the robot walks to the grocery store in just the right way so as to cause a series of severe hurricanes to tip the right dominoes to cause a severe economic collapse which would otherwise not have happened, then this is a significant unexpected consequence of going to the grocery store which the human would need to consent to separately.

Human Rationality Assumptions

The bad news is that this approach seems likely to run into essentially all the same conceptual difficulties as value learning, if not more. Although the conceptual framework is not as strongly tied to VNM-style utility functions as value learning is, the robot still needs to infer what the human believes and wants: belief for the "informed" part, and want for the "consent" part. This still sounds like it is most naturally formulated in a VNM-ish framework, although there may be other ways.

As such, it doesn't seem like it helps any with the difficulties of assuming human rationality.

Helping Animals

My friend mentioned that the suffering golem scenario depends a great deal on whether the golem is sentient. Mercy-killing suffering animals is OK, even good, without any consent. More generally, there are lots of acceptable ways of helping animals which break their autonomy in significant ways, such as taking them to the vet against protests. One might say the same thing of children.

It isn't obvious what makes the difference, but one idea might be: where there is no capacity for informed consent, other principles may apply. But what would this imply for humans? There may be consequences of actions which we lack the capacity to understand. Should the robot simply try to optimize for our preferences on those issues, without constraining acceptable consequences by consent?

How should an autonomy-respecting robot interact with children? Respecting human autonomy absolutely might make it impossible to help with certain household tasks like changing diapers. If so, the approach might not result in very capable agents.

Respecting All Humans

So far, I've focused on a thought experiment of a robot respecting the autonomy of a single designated user. Ultimately, it seems like an approach to alignment has to deal with all humans. However, getting "consent" from all humans seems impossible. How can a consent-based approach approve any actions, then?

One idea is to only require consent from humans who are impacted by an action. Any action which impacts the whole future would require consent from everyone (?), but low-impact actions could be carried out with only consent from those involved.

It's not clear to me how to approach this question.

Connections to Other Problems

  • As I mentioned, this seems to connect to impact measures.
  • The agent as described also may be a mild optimizer, because (1) it has to avoid thinking about things when those things are not understood consequences of carrying out commands, (2) plans are constrained by the human probability distribution over plans, somehow (I'm not sure how it works, but there's definitely an aspect of "unexpected plans are not allowed" in play here).
  • There is a connection to transparency, in that impacts of actions have to be described/understood (and approved) in order to be allowed.
  • The agent as I've described it sounds potentially corrigible, in that resistance to shutdown or modification would have to be an understood and approved consequence of a command.
New Comment
4 comments, sorted by Click to highlight new comments since:

Autonomy is a value and can be expressed as a part of a utility function, I think. So ambitious value learning should be able to capture it, so an aligned AI based on ambitious value learning would respect someone's autonomy when they value it themselves. If they don't, why impose it upon them?

(This assumes that we manage to solve the general problems with ambitious value learning. Is the point here that you expect we can't solve those problems and therefore need an alternative? The idea doesn't help with "the difficulties of assuming human rationality" though so what problems does it help with?)

ETA: Is the idea that even trying to do ambitious value learning constitutes violating someone's autonomy (in other words someone could have a preference against having ambitious value learning done on them) and by the time we learn this it would be too late?

Autonomy is a value and can be expressed as a part of a utility function, I think. So ambitious value learning should be able to capture it, so an aligned AI based on ambitious value learning would respect someone's autonomy when they value it themselves. If they don't, why impose it upon them?

One could make a similar argument for corrigibility: ambitious value learning would respect our desire for it to behave corrigibly if we actually wanted that, and if we didn't want that, why impose it?

Corrigibility makes sense as something to ensure in its own right because it is good to have in case the value learning is not doing what it should (or something else is going wrong).

I think respect for autonomy is similarly useful. It helps avoid evil-genie (perverse instantiation) type failures by requiring that we understand what we are asking the AI to do. It helps avoid preference-manipulation problems which value learning approaches might otherwise have, because regardless of how well expected-human-value is optimized by manipulating human preferences, such manipulation usually involves fooling the human, which violates autonomy.

(In cases where humans understand the implications of value manipulation and consent to it, it's much less concerning -- though we still want to make sure the AI isn't prone to pressure humans into that, and think carefully about whether it is really OK.)

Is the point here that you expect we can't solve those problems and therefore need an alternative? The idea doesn't help with "the difficulties of assuming human rationality" though so what problems does it help with?

It's less an alternative in terms of avoiding the things which make value learning hard, and more an alternative in terms of providing a different way to apply the same underlying insights, to make something which is less of a ruthless maximizer at the end.

In other words, it doesn't avoid the central problems of ambitious value learning (such as "what does it mean for irrational beings to have values?"), but it is a different way to try to put those insights together into a safe system. You might add other safety precautions to an ambitious value learner, such as [ambitious value learning + corrigibility + mild optimization + low impact + transparency]. Consent-based systems could be an alternative to that agglomerated approach, either replacing some of the safety measures or making them less difficult to include by providing a different foundation to build on.

Is the idea that even trying to do ambitious value learning constitutes violating someone's autonomy (in other words someone could have a preference against having ambitious value learning done on them) and by the time we learn this it would be too late?

I think there are a couple of ways in which this is true.

  • I mentioned cases where a value-learner might violate privacy in ways humans wouldn't want, because the overall result is positive in terms of the extent to which the AI can optimize human values. This is somewhat bad, but it isn't X-risk bad. It's not my real concern. I pointed it out because I think it is part of the bigger picture; it provides a good example of the kind of optimization a value-learner is likely to engage in, which we don't really want.
  • I think the consent/autonomy idea actually gets close (though maybe not close enough) to something fundamental about safety concerns which follow an "unexpected result of optimizing something reasonable-looking" pattern. As such, it may be better to make it an explicit design feature, rather than trust the system to realize that it should be careful about maintaining human autonomy before it does anything dangerous.
  • It seems plausible that, interacting with humans over time, a system which respects autonomy at a basic level would converge to different overall behavior than a value-learning system which trades autonomy off with other values. If you actually get ambitious value learning really right, this is just bad. But, I don't endorse your "why impose it on them?" argument. Humans could eventually decide to run all-out value-learning optimization (without mild optimization, without low-impact constraints, without hard-coded corrigibility). Preserving human autonomy in the meantime seems

One could make a similar argument for corrigibility: ambitious value learning would respect our desire for it to behave corrigibly if we actually wanted that, and if we didn’t want that, why impose it?

There's a disanalogy in that autonomy is probably a terminal value whereas corrigibility is only an instrumental one. In other words, I don't want a corrigible AI for the sake of having a corrigible AI, I want one so it will help me reach my other goals. I do (probably) want autonomy, and not only because it would help me reach other goals. So in fact ambitious value learning will not learn to behave corrigibly, I think, because the AI will probably think it has a better way of giving me what I ultimately want.

Oh, I think I see a different way of stating your argument that avoids this disanalogy: we're not concerned about autonomy as a terminal value here, but as an instrumental one like corrigibility. If ambitious value learning works perfectly, then it would learn autonomy as a terminal value, but we want to implement autonomy-respecting AI mainly because that would help us get what we want in case ambitious value learning fails to works perfectly.

I think I understand the basic idea and motivation now, and I'll just point out that autonomy-respecting AI seems share several problems with other non-goal-directed approaches to AI safety.

This seems like an interesting idea for how to build an AI system in practice, along the same lines as corrigibility. We notice that value learning is not very robust: if you aren't very good at value learning, then you can get very bad behavior, and human values are sufficiently complex that you do need to be very capable in order to be sufficiently good at value learning. With (a particular kind of) corrigibility, we instead set the goal to be to make an AI system that is trying to help us, which seems more achievable even when the AI system is not very capable. Similarly, if we formalize or learn informed consent reasonably well (which seems easier to do since it is not as complex as "human values"), then our AI systems will likely have good behavior (though they will probably not have the best possible behavior, since they are limited by having to respect informed consent).

However, this also feels different from corrigibility, in that it feels more like a limitation put on the AI system, while corrigibility seems more like a property of the AI's "motivational system". This might be fine, since the AI might just not be goal-directed. One other benefit of corrigibility is that if you are "somewhat" corrigible, then you would like to become more corrigible, since that is what the human would prefer; informed-consent-AI doesn't seem to have an analogous benefit.