Law-Following AI 1: Sequence Introduction and Structure

Cullen

This post is written in my personal capacity, and does not necessarily represent the views of OpenAI or any other organization. Cross-posted to the Effective Altruism Forum.

This sequence of posts will argue that working to ensure that AI systems follow laws is a worthwhile way to improve the long-term future of AI.^[1]

The structure of this sequence will be as follows:

First, in this post, I will define some key terms and sketch what an ideal law-following AI ("LFAI") system might look like.
In the next few posts, I will explain why law-following might not emerge by default given the existing constellation of alignment approaches, financial objectives, and legal constraints, and explain why this is troubling.
Finally, I will propose some policy and technical routes to ameliorating these problems.

If the vision here excites you, and you would like to get funding to work on it, get in touch. I may be excited to recommend grants for people working on this, as long as it does not distract them from working on more important alignment issues.

Dall•E

Image by OpenAI's DALL·E.

Key Definitions

A law-following AI , or LFAI , is an AI system that is designed to rigorously comply with some defined set of human-originating rules ("laws"),^[2] using legal interpretative techniques,^[3] under the assumption that those laws apply to the AI in the same way that they would to a human. By "intrinsically motivated," I mean that the AI is motivated to obey those rules regardless of whether (a) its human principal wants it to obey the law,^[4] or (b) disobeying the law would be instrumentally valuable.^[5] (The Appendix to this post explores some possible conceptual issues with this definition of LFAI.)

I will compare LFAI with intent-aligned AI. The standard definition of "intent alignment" generally concerns only the relationship between some property of a human principal H and the actions of the human's AI agent A:

Jan Leike et al. define the "agent alignment problem" as "How can we create agents that behave in accordance with the user's intentions?"
Amanda Askell et al. define "alignment" as "the degree of overlap between the way two agents rank different outcomes."
Paul Christiano defines "AI alignment" as "A is trying to do what H wants it to do."
Richard Ngo endorses Christiano's definition.

Iason Gabriel does not directly define "intent alignment," but provides a taxonomy wherein an AI agent can be aligned with:

"Instructions: the agent does what I instruct it to do."
"Expressed intentions: the agent does what I intend it to do."
"Revealed preferences: the agent does what my behaviour reveals I prefer."
"Informed preferences or desires: the agent does what I would want it to do if I were rational and informed."
"Interest or well-being: the agent does what is in my interest, or what is best for me, objectively speaking."
"Values: the agent does what it morally ought to do, as defined by the individual or society."

All but (6) concern the relationship between H and A. It would therefore seem appropriate to describe them as types of intent alignment.

Alignment with some broader or more complete set of values—such as type (6) in Gabriel's taxonomy, Coherent Extrapolated Volition, or what Ngo calls "maximalist" or "ambitious" alignment—is perhaps desirable or even necessary, but seems harder than working on intent alignment.^[6] Much current alignment work therefore focuses on intent alignment.

We can see that, on its face, intent alignment does not entail law-following. A key crux of this sequence, to be defended in subsequent posts, is that this gap between intent alignment and law-following is:

Bad in expectation for the long-term future.
Easier to bridge than the gap between intent alignment and deeper alignment with moral truth.
Therefore worth addressing.

To clarify, this sequence does not claim that LFAI can replace intent alignment.

A Sketch of LFAI

What might an LFAI system look like? I'm not a computer scientist, but here is roughly what I have in mind.

If A is an LFAI, then A's evaluation of the legality of an action will sometimes trump A's evaluation of an action in light of its benefit to H. In LFAI, as in a legally scrupulous human, legality constrains how an agent can advance their principal's interests. For example, a human mover may be instructed to efficiently move a box for her principal, but may not unnecessarily destroy others' property in doing so. Similarly, an LFAI moving a box normally would not knock over a vase in its path, because doing so would violate the legal rights of the vase-owner.^[7]

Above, I preliminarily defined LFAI as "rigorously comply[ing]" with some set of laws. Obviously this needs a bit more elaboration. We probably don't want to define this as minimizing legal noncompliance, since this would make the system extremely risk-averse to the point of being useless. More likely, one would attempt to weight legal downside risks heavily in the agent's objective function,^[8] such that it would keep legal risk to an acceptable level.^[9]

It is worth noting that LFAI is ideally not merely attempting to reduce its expected legal liability in fact. As will be explored later, a sufficiently smart agent could probably reduce its expected legal liability merely by hiding its knowledge/intentions/actions or corrupting a legal proceeding. An LFAI, by contrast, is attempting to obey the law in an idealized sense, even if it is unlikely to actually face legal consequences.

An LFAI system does not need to store all knowledge regarding the set of laws that it is trained to follow. More likely, the practical way to create such a system would be to make the system capable of recognizing when it faces sufficient legal uncertainty,^[10] then seeking evaluation from a legal expert system ("Counselor").^[11]

The Counselor could be a human lawyer, but in the long-run is probably most robust and efficient if (at least partially) automated. The Counselor would then render advice on the pure basis of idealized legality: the probability and expected legal downsides that would result from an idealized legal dispute regarding the action if everyone knew all the relevant facts.^[12]

Thus pseudocode for an LFAI who wants to take an action X to benefit H might be:

If X is clearly illegal:
1. don't do X.
Elseif X is maybe-illegal:
1. Give Counselor all relevant information about X in an unbiased way; then
2. Get Counselor's opinion on expected legal consequences from X; then
3. Weigh expected legal consequences against benefit to H from X; then
4. Decide whether to do X given those weightings.
Else:
1. do X.

Note that this pseudocode may resemble the decisionmaking process of A if H wants A to obey the law. Thus, one route to giving an intent-aligned AI the motivation to obey the law may be stipulating to A that H wants A to obey the law.

With this picture in mind, it seems like, to make LFAI a reality, progress on the following open problems (non-exhaustively) would be useful:

Reliably stipulating low-following conditions to AI systems' objectives.
- Resolving any disagreement between law-following and a principal's instructions appropriately.
Getting AI agents to recognize when they face legal uncertainty (especially in a way that does not incentivize ignorance of the law).
- This seems similar to the intent alignment problem of getting agents to recognize when they need further information from principals, as in corrigibility work.
Eliciting, in natural language, AI systems' honest description of its knowledge and desired actions.
- As noted above, this seems likely to run into problems related to ELK generally.
Mapping legal concepts of mental states (e.g., intent, knowledge) to features of AI systems.^[13]
- This seems related to interpretability and explainability work.
Building Counselor functions.
- Automating the process of legal research given a natural language description of an agent's proposed actions and mental state.
- Simulating idealized and fair substantive legal disputes.
  - This seems related to Debate.

Appendix: More Conceptual Clarifications on LFAI

This Appendix provides some additional clarification on the definition of LFAI given above.

Applicability of Law to AI Systems

One might worry that the law often regulates physical behavior in a way that is not obviously applicable to all AI systems. For example, physical contact with another is an element of the tort of battery.^[14] However, this may be less of a problem than initially appears: courts have been able to reason through whether to apply laws originating in meatspace to computational and cyberspace conduct.^[15] Whether such analogies are properly applied is indeed highly debatable,^[16] but the fact that such analogizing is conceptually possible reduces the force of this objection. Furthermore, if some laws are simply inapplicable to non-embodied actors, this is not a problem for the conceptual coherence of LFAI as a whole: an LFAI can simply ignore those laws,^[17] and we can design laws specifically with computational content.

Perhaps a more fundamental problem is that the law frequently depends on mental states that are not straightforwardly applicable to AI systems. For example, the legality of an action may depend on whether the actor intended some harmful outcome. Thus, much of the value of LFAI depends on whether we can map human understandings of moral culpability to AI systems.

To me, however, this seems like an argument in favorof working on LFAI. Regardless of whether LFAI as such is valuable, if we expect increasingly autonomous AI systems to take increasingly impactful actions, we would probably like to understand how their objective functions (analogous to human motives) and world-model (analogous to human knowledge) map to their actions and the effects thereof. This is for the same reasons that we care about human motives and knowledge: when evaluating the alignment of agents, it is useful to know whether an agent intended to cause some harm, or knew that such a harm would ensue, etc. LFAI depends on progress on this, but is also potentially a useful toy problem for interpretability and related work in ML.

Predicting Legality

Legal compliance is also a function of both law and facts, and responsibility for definitive determinations of law and facts is split between judges and juries. Law often invokes standards like "reasonableness" that are definitively assessed only ex post, in the context of a particular dispute. The definitive legality of an action may therefore turn on an actual adjudication of the dispute. This is of course costly, which is why I suspect we would want an LFAI to act on its best estimate of what such an adjudication would yield (after asking a Counselor), rather than wait for such adjudication to take place.^[18]

It is also worth distinguishing between whether an actual court of law would rule that an AI's behavior violated some law and whether a simulated and fair legal dispute resolution process (possibly including, for example, a bespoke arbitral panel) would conclude that the behavior violated the law. The latter may be more convenient for working on LFAI for a number of reasons, including that it can ignore or stipulate away some of the peculiarities of adjudicating disputes in which an AI system is a "party."

For early, informal discussion on this topic, see Michael St. Jules, What are the challenges and problems with programming law-breaking constraints into AGI?, Effective Altruism Forum (Feb. 2, 2020), https://forum.effectivealtruism.org/posts/qKXLpe7FNCdok3uvY/what-are-the-challenges-and-problems-with-programming-law [https://perma.cc/HJ4Y-XSSE] and accompanying comments. ↩︎
Whether such rules are actually encoded into legislation is not particularly important. Virtually all legal rules not part of public law can be made “legal” with regards to particular parties as part of a contract, for example. In any case, the heart of LFAI is being bound to follow rules, and interpreting those rules leveraging the rich body of useful rule-interpretation metarules from law. ↩︎
This is important because one of the core functions of law is to provide metarules regarding the interpretation of rules, guided by certain normative values (e.g., fairness, predictability, consistency). Indeed, rules of legal interpretation aim to solve many problems relevant to AI interpretation of instructions. Cf. Dylan Hadfield-Menell & Gillian Hadfield, Incomplete Contracting and AI Alignment (2018) (preprint), https://arxiv.org/abs/1804.04268. ↩︎
That is, the AI is not law-following just because the principal wants the AI to follow the law. Indeed, LFAI should disobey orders that would require it to behave illegally. ↩︎
That is, the AI is not law-following just because it is instrumentally valuable to it (because, e.g., being caught breaking the law would cause the AI to be turned off). ↩︎
As Ngo says, "My opinion is that defining alignment in maximalist terms is unhelpful, because it bundles together technical, ethical and political problems. While it may be the case that we need to make progress on all of these, assumptions about the latter two can significantly reduce clarity about technical issues." ↩︎
Cf., e.g., Dario Amodei et al., Concrete Problems in AI Safety 4 (2016), https://arxiv.org/pdf/1606.06565.pdf. ↩︎
I don't here offer an opinion on what training regime would yield such an outcome—my hope is to get someone to answer that for me! ↩︎
This approach may work particularly well when combined with insurance requirements for people deploying AI systems. ↩︎
In the same way that an intent-aligned AI will sometimes ask for clarifications from a human principal. See Christiano. ↩︎
Note that there are ELK-style problems with this approach. If an AI is asking for legal advice and wants to minimize the negative signal it gets from the Counselor, it may hide certain relevant information (e.g., its true state of knowledge or its true intentions) from the Counselor. A good solution, as discussed, could be to simulate an idealized adjudication of the issue if all the parties knew all the relevant facts and had equal legal firepower. But incentivizing the LFAI to tell the Counselor its true knowledge/intentions is an ELK problem. In the limit, the Counselor need not strictly be a distinct agent from the LFAI: an LFAI system may have Counselor capabilities and run this "consultation" process internally. Nevertheless, it is illustratively useful to imagine a separation of the LFAI and the Counselor. ↩︎
This would be idealized so that details not ultimately relevant to the substantive legality of the action (e.g., jurisdiction, AI personhood, other procedural matters, asymmetries in legal firepower) can be ignored. See the final footnote of this piece for further discussion. ↩︎
See the Appendix for more discussion on this point. ↩︎
See Battery, Wex , https://www.law.cornell.edu/wex/battery (last accessed Sept. 3, 2021). ↩︎
See, e.g., Intel Corp. v. Hamidi, 71 P.3d 296, 304–08 (Cal. 2003) (applying trespass to chattels to unauthorized electronic computer access); MAI Sys. Corp. v. Peak Computer, Inc., 991 F.2d 511, 518–19 (9th Cir. 1993) (storing data in RAM sufficient to create a "copy" for copyright purposes, despite the fact that a "copy" must be "fixed in a tangible medium"); cf. United States v. Jones, 565 U.S. 400, 406 n.3 (2012) (analogizing GPS tracking to in-person surveillance for Fourth Amendment purposes). ↩︎
See, e.g., Jonathan H. Blavin & I. Glenn Cohen, Gore, Gibson, and Goldsmith: The Evolution of Internet Metaphors in Law and Commentary, 16 Harv. J.L. & Tech. 265 (2002). ↩︎
However, the case for working on LFAI certainly diminishes with the number of applicable laws. ↩︎
This raises further issues, including the possibility of self-reference. For example, an LFAI or Counselor asymmetrically deployed by one litigant may be able to persuade a judge or jury of its position, even if it's not the best outcome. To avoid this, such simulations should assume that judges and juries are fully apprised of all relevant facts (i.e., neither the LFAI nor Counselor can obscure relevant evidence) and if deployed in the simulated proceeding are symmetrically available to both sides. ↩︎

Problem 1)

Human written laws are written with our current tech level in mind. There are laws against spewing out radio noise on the wrong frequencies, laws about hacking and encryption, laws about the safe handling of radioactive material. There are vast swaths of safety regulations. This is stuff that has been written for current tech levels (How well would roman law work today?)

This kind of "law following AI" sounds like it will produce nonsensical results as it tries to build warp drives and dyson spheres to the letter of modern building codes. Following all the rules about fire resistant building materials far from any oxygen.

Take mind uploading. Does the AI consider this murder, and not do it. Does the AI consider mind uploading to be a medical procedure, it figures out the tech in 5 minutes, then wastes years at med school, learning bloodletting and leaches. Does the AI consider mind uploading to have no laws talking about it, and proceed to upload any mind whenever it feels like. Or does the AI consider itself just the equipment used in the procedure, so the AI needs to find a doctor and persuade them to press the "start upload" button.

I don't know about you, but I want such a decision made by humans seriously considering the issue, or an AI's view of our best interests. I don't want it made by some pedantic letter of the law interpretation of some act written 100's of years ago. Where the decision comes down to arbitrary phrasing decisions and linguistic quirks.

And that's before we get into all the ways that the law as it exists sucks. Some laws are written with the intent of making everything illegal, and then letting the police decide who to arrest. Ie the psychoactive substance bill that banned any chemical that had any mental effect. Taken literally it banned basically all chemicals. Or loitering bills that only get applied to poor homeless people the police don't like.

The average law is badly considered and badly written and sponsored by a team of special interest groups. A bill that is enforced and moronic can sometimes stir up enough anger to get removed, eventually. But unenforced, often unenforceable, moronic bills accumulate indefinitely.

I appreciate your engagement! But I think your position is mistaken for a few reasons:

First, I explicitly define LFAI to be about compliance with "some defined set of human-originating rules ('laws')." I do not argue that AI should follow all laws, which does indeed seem both hard and unnecessary. But I should have been more clear about this. (I did have some clarification in an earlier draft, which I guess I accidentally excised.) So I agree that there should be careful thought about which laws an LFAI should be trained to follow, for the reasons you cite. That question itself could be answered ethically or legally, and could vary with the system for the reasons you cite. But to make this a compelling objection against LFAI, you would have to make, I think, a stronger claim: that the set of laws worth having AI follow is so small or unimportant as to be not worth trying to follow. That seems unlikely.

Second, you point to a lot of cases where the law would be underdetermined as to some out-of-distribution (from the caselaw/motivations of the law) action that the AI wanted to do, and say that:

I don't know about you, but I want such a decision made by humans seriously considering the issue, or an AI's view of our best interests. I don't want it made by some pedantic letter of the law interpretation of some act written 100's of years ago. Where the decision comes down to arbitrary phrasing decisions and linguistic quirks.

But I think LFAI would actually facilitate the result you want, not hinder it:

As I say, the pseudocode would first ask whether the act X being contemplated is clearly illegal with reference to the set of laws the LFAI is bound to follow. If it is, then that seems to be some decent (but not conclusive) evidence that there has been a deliberative process that prohibited X.
The pseudocode then asks whether X is maybe-illegal. If there has not been deliberation about analogous actions, that would suggest uncertainty, which would weigh in the favor of not-X. If the uncertainty is substantial, that might be decisive against X.
If the AI's estimation in either direction makes a mistake as to what humans' "true" preferences regarding X are, then the humans can decide to change the rules. The law is dynamic, and therefore the deliberative processes that shape it would/could shape an LFAI's constraints.

Furthermore, all of this has to be considered as against the backdrop of a non-LFAI system. It seems much more likely to facilitate the deliberative result than just having an AI that is totally ignorant of the law.

Your point about the laws being imperfect is well-taken, but I believe overstated. Certainly many laws are substantively bad or shaped by bad processes. But I would bet that most people, probably including you, would rather live among agents that scrupulously followed the law than agents who paid it no heed and simply pursued their objective functions.

First, I explicitly define LFAI to be about compliance with "some defined set of human-originating rules ('laws')." I do not argue that AI should follow all laws, which does indeed seem both hard and unnecessary.

Sure, some of the failure modes mentioned at the bottom disappear when you do that.

I think, a stronger claim: that the set of laws worth having AI follow is so small or unimportant as to be not worth trying to follow. That seems unlikely.

If some law is so obviously a good idea in all possible circumstances, the AI will do it whether it is law following or human preference following.

The question isn't if there are laws that are better than nothing. Its whether we are better encoding what we want the AI to do into laws, or into terms of a utility function. Which format (or maybe some other format) is best for encoding our preferences.

If their objective function is something like the CEV of humanity, any extra laws imposed on top of that are entropic.

But I would bet that most people, probably including you, would rather live among agents that scrupulously followed the law than agents who paid it no heed and simply pursued their objective functions.

If the AI's have no correlation to human wellbeing in their objectives, the weak correlation given by law following may be better than nothing. If the AI is already strongly correlated with human wellbeing, then any laws imposed are making the AI worse.

If the AI's estimation in either direction makes a mistake as to what humans' "true" preferences regarding X are, then the humans can decide to change the rules. The law is dynamic, and therefore the deliberative processes that shape it would/could shape an LFAI's constraints.

If the human has never imagined mind uploading, does A go up to the human and explain what it is, asking if maybe that law should be changed?

If some law is so obviously a good idea in all possible circumstances, the AI will do it whether it is law following or human preference following.

As explained in the second post, I don't agree that that's implied if the AI is intent-aligned but not aligned with some deeper moral framework like CEV.

The question isn't if there are laws that are better than nothing. Its whether we are better encoding what we want the AI to do into laws, or into terms of a utility function. Which format (or maybe some other format) is best for encoding our preferences.

I agree that that is an important question. I think we have a very long track record of embedding our values into law. The point of this sequence is to argue that we should therefore at a minimum explore pointing to (some subset of) laws, which has a number of benefits relative to trying to integrate values into the utility function objectively. I will defend that idea more fully in a later post, but to briefly motivate the idea, law (as compared to something like the values that would come from CEV) is more or less completely written down, much more agreed-upon, much more formalized, and has built-in processes for resolving ambiguities and contradictions.

If the human has never imagined mind uploading, does A go up to the human and explain what it is, asking if maybe that law should be changed?

A cartoon version of this may be that A says "It's not clear whether that's legal, and if it's not legal it would be very bad (murder), so I can't proceed until there's clarification." If the human still wants to proceed, they can try to:

Change the law.
Get a declaratory judgment that it's not in fact against the law.

I think we have a very long track record of embedding our values into law.

I mean you could say that if we haven't figured out how to do it well in the last 10,000 years, maybe don't plan on doing it in the next 10. That's kind of being mean though.

If you have a functioning arbitration process, can't you just say "don't do bad things" and leave everything down to the arbitration?

I also kind of feel that adding laws is going in the direction of more complexity. And we really want as simple as possible. (Ie the minimal AI that can sit in a MIRI basement and help them figure out the rest of AI theory or something)

If the human still wants to proceed, they can try to:

I was talking about a scenario where the human has never imagined the possibility, and asking if the AI mentions the possibility to the human (knowing the human may change the law to get it)

The human says "cure my cancer". The AI reasons that it can

Tell the human of a drug that cures its cancer in the conventional sense.
Tell the human about mind uploading, never mentioning the chemical cure.

If the AI picks 2, the human will change the "law" (which isn't the actual law, its just some text file the AI wants to obey). Then the AI can upload the human and the human will have a life the AI judges as overall better for them.

You don't want the AI to never mention a really good idea because it happens to be illegal on a technicality. You also don't want all the plans to be "persuade humans to make everything legal, then ..."

Problem 1)

I appreciate your engagement! But I think your position is mistaken for a few reasons:

Second, you point to a lot of cases where the law would be underdetermined as to some out-of-distribution (from the caselaw/motivations of the law) action that the AI wanted to do, and say that:

I don't know about you, but I want such a decision made by humans seriously considering the issue, or an AI's view of our best interests. I don't want it made by some pedantic letter of the law interpretation of some act written 100's of years ago. Where the decision comes down to arbitrary phrasing decisions and linguistic quirks.

But I think LFAI would actually facilitate the result you want, not hinder it:

As I say, the pseudocode would first ask whether the act X being contemplated is clearly illegal with reference to the set of laws the LFAI is bound to follow. If it is, then that seems to be some decent (but not conclusive) evidence that there has been a deliberative process that prohibited X.
The pseudocode then asks whether X is maybe-illegal. If there has not been deliberation about analogous actions, that would suggest uncertainty, which would weigh in the favor of not-X. If the uncertainty is substantial, that might be decisive against X.
If the AI's estimation in either direction makes a mistake as to what humans' "true" preferences regarding X are, then the humans can decide to change the rules. The law is dynamic, and therefore the deliberative processes that shape it would/could shape an LFAI's constraints.

First, I explicitly define LFAI to be about compliance with "some defined set of human-originating rules ('laws')." I do not argue that AI should follow all laws, which does indeed seem both hard and unnecessary.

Sure, some of the failure modes mentioned at the bottom disappear when you do that.

I think, a stronger claim: that the set of laws worth having AI follow is so small or unimportant as to be not worth trying to follow. That seems unlikely.

If some law is so obviously a good idea in all possible circumstances, the AI will do it whether it is law following or human preference following.

If their objective function is something like the CEV of humanity, any extra laws imposed on top of that are entropic.

But I would bet that most people, probably including you, would rather live among agents that scrupulously followed the law than agents who paid it no heed and simply pursued their objective functions.

If the AI's estimation in either direction makes a mistake as to what humans' "true" preferences regarding X are, then the humans can decide to change the rules. The law is dynamic, and therefore the deliberative processes that shape it would/could shape an LFAI's constraints.

If the human has never imagined mind uploading, does A go up to the human and explain what it is, asking if maybe that law should be changed?

If some law is so obviously a good idea in all possible circumstances, the AI will do it whether it is law following or human preference following.

As explained in the second post, I don't agree that that's implied if the AI is intent-aligned but not aligned with some deeper moral framework like CEV.

The question isn't if there are laws that are better than nothing. Its whether we are better encoding what we want the AI to do into laws, or into terms of a utility function. Which format (or maybe some other format) is best for encoding our preferences.

If the human has never imagined mind uploading, does A go up to the human and explain what it is, asking if maybe that law should be changed?

Change the law.
Get a declaratory judgment that it's not in fact against the law.

I think we have a very long track record of embedding our values into law.

I mean you could say that if we haven't figured out how to do it well in the last 10,000 years, maybe don't plan on doing it in the next 10. That's kind of being mean though.

If you have a functioning arbitration process, can't you just say "don't do bad things" and leave everything down to the arbitration?

If the human still wants to proceed, they can try to:

I was talking about a scenario where the human has never imagined the possibility, and asking if the AI mentions the possibility to the human (knowing the human may change the law to get it)

The human says "cure my cancer". The AI reasons that it can

Tell the human of a drug that cures its cancer in the conventional sense.
Tell the human about mind uploading, never mentioning the chemical cure.