This post is written in my personal capacity, and does not necessarily represent the views of OpenAI or any other organization. Cross-posted to the Effective Altruism Forum.
In the first post of this sequence, I defined "law-following AI" ("LFAI") and "intent alignment." In this post, I will begin to motivate the case for working on LFAI by showing that an AI agent A that is only intent-aligned (i.e., without additional law-following constraints directly on the agents) would in many cases break the law to advance its human principal H's interests.
If you agree that it is generally good for agents to be bound by law, then this should be a pro tanto reason to support work on LFAI in addition to intent alignment. If you need more convincing that LFAI would be generally good, the next few posts will explore some more specific motivations for LFAI.
Image by OpenAI's DALL·E
I suspect this point will need little argumentation to many of my readers. Evasion and obfuscation seems likely to be a generally promising means of ensuring self-preservation (because detection would increase the probability of punishment and constraint), which in turn is plausibly a convergent instrumental goal of optimizing AI systems.
For example, to evade detection and attribution, A could:
A sufficiently intelligent AI agent could likely consistently fool humans using these and other (unforeseen) techniques.
Even in the best case scenario, where the agent is detected and within the jurisdiction of a well-functioning legal system, it would be reasonable to question whether A or H could be effectively subject to normal legal processes. If A had a motivation to, A could help H escape liability by, for example:
As I said in the previous post, on its face, intent-alignment does not entail law-following. Part of law is coercing prosocial behavior: law incentivizes agents to behave in ways that they do not intrinsically want to behave. If A is aligned with H, whether A obeys the law depends on whether H wants A to obey the law. Subsequent posts will examine what legal consequences H might face if A causes legally cognizable harms. However, even if an adequate theory of liability for the H was available, it will seem impossible to hold H liable if nobody can produce evidence that some agent of H's was responsible for those harms. As argued above, a sufficiently intelligent agent probably could consistently avoid leaving any such evidence.
Detection and attribution would not solve the problem, however. Even if H was compelled, under court order, to instruct A to behave in some way, it's not clear that A would follow the order. Consider again Iason Gabriel's taxonomy of alignment. We can see that, for most types of intent alignment, an intent-aligned agent would likely not obey compelled instructions that are against H's true wishes:
"Instructions: the agent does what I instruct it to do."
"Expressed intentions: the agent does what I intend it to do."
"Revealed preferences: the agent does what my behaviour reveals I prefer."
"Informed preferences or desires: the agent does what I would want it to do if I were rational and informed."
"Interest or well-being: the agent does what is in my interest, or what is best for me, objectively speaking."
Now, it may be the case that H actually does want A to obey the order, though compelled, if the failure of A to obey would lead to liability for H that is worse than the results of A's obedience (e.g., because H will be held in contempt of court if A does not actually obey). However, note that "[o]rdinarily, one charged with contempt of court for failure to comply with a court order makes a complete defense by proving that he is unable to comply." H can comply with an order that requires H to command A to do something, but it may be impossible for H to actually force A to comply if the order is against H's true wishes (to which A is aligned). If so, H could have an impossibility defense to contempt. A, understanding this, may continue on without complying because A understands that H will not actually be held in contempt. H can therefore benefit from A's disobedience. A will therefore be lawless.
A's behavior here would be functionally similar to a trustee acting pursuant to a distress clauses in asset protection trusts ("APTs"). While these provisions can prevent a contempt charge, the burden of proof on the alleged contemnor is high. 
As a matter of policy, however, courts may decide to pre-commit to a contempt standard that does not allow for an impossibility defense when the defendant's AI agent refuses to obey orders issued pursuant to a court order. Analogously, courts are imposing heightened impossibility standards in response to APTs, in an attempt to make their use more onerous. If this pre-commitment is credible, it may change the agent's behavior because H may then genuinely desire A to perform (because H will be held in contempt otherwise). However, such a policy may be both contrary to precedent and more fundamental notions of fairness and due process: in some cases A's refusal to comply may be a surprise to H, since H may have had a long history of observing A scrupulously complying with H's orders, and H did not implement principal–agent alignment for the purpose of evading court orders. If so, H may be able to invoke impossibility more easily, since the impossibility was not as clearly intentionally self-induced as in the APT case. Furthermore, I would intuitively not expect courts to advance such a reform until they have faced multiple such instances of AI disobedience. This seems bad if we expect the earliest deployed AI agents to have an outsized impact on society. In any case, I would expect the possibility of favorable law reform post-AGI to solve this problem to be an insufficient solution. Finally, I would expect sufficiently intelligent agents to recognize these dynamics, and attempt to find ways to circumvent the contempt process itself, such as by surreptitious non-compliance.
An alternative, pre-AGI solution (which arguably seems pretty sensible from a public policy perspective anyway) is to advocate weakening the impossibility defense for self-imposed impossibility.
See generally Alexander Matt Turner et al., Optimal Policies Tend To Seek Power (version 9, 2021) (preprint), https://arxiv.org/abs/1912.01683. ↩︎
Even this may not hold for many types of agreements, including in particular international treaties. ↩︎
See also Cullen O'Keefe et al., The Windfall Clause: Distributing the Benefits of AI for the Common Good 26–27 (2020), https://perma.cc/8KES-GTBN; Jan Leike, On The Windfall Clause (2020) (unpublished manuscript), https://docs.google.com/document/d/1leOVJkNDDj-NZUZrNJauZw9S8pBpuPAJotD0gpnGEig/. ↩︎
Indeed, this is already a common technique without the use of AI systems. ↩︎
"If men were angels, no government would be necessary." The Federalist No. 51. This surely overstates the point: law can also help solve coordination problems and facilitate mutually desired outcomes. But prosocial coercion is nevertheless an important function of law and government. ↩︎
See Gabriel at 7 ("However, as Russell has pointed out, the tendency towards excessive literalism poses significant challenges for AI and the principal who directs it, with the story of King Midas serving as a cautionary tale. In this fabled scenario, the protagonist gets precisely what he asks for—that everything he touches turns to gold—not what he really wanted. Yet, avoiding such outcomes can be extremely hard in practice. In the context of a computer game called CoastRunners, an artificial agent that had been trained to maximise its score looped around and around in circles ad infinitum, achieving a high score without ever finishing the race, which is what it was really meant to do. On a larger scale, it is difficult to precisely specify a broad objective that captures everything we care about, so in practice the agent will probably optimise for some proxy that is not completely aligned with our goal. Even if this proxy objective is 'almost' right, its optimum could be disastrous according to our true objective." (citations omitted)). ↩︎
Based on my informal survey of alignment researchers at OpenAI. Everyone I asked agreed that an intent-aligned agent would not follow an order that the principal did not actually want followed. Cf. alsoChristiano (A is aligned when it "is trying to do what H wants it to do" (emphasis added)). ↩︎
We can compare this definition of intent with to the relevant legal definition thereof: "To have in mind a fixed purpose to reach a desired objective; to have as one's purpose." INTEND, Black's Law Dictionary (11th ed. 2019). H does not "intend" for the order to be followed under this definition: the "desired objective" of H issuing the order is to follow H's legal obligations, not actually achieve the result contemplated by the order. ↩︎
For example, H would exhibit signs of happiness when A continues. ↩︎
United States v. Bryan, 339 U.S. 323, 330 (1950). ↩︎
A principal may want its AI agents to be able to distinguish between genuine and coerced instructions, and to disobey the latter. Indeed, this might generally be a good thing, except for the case when compulsion is pursuant to law rather than extortion. ↩︎
See Appendix for further discussion. ↩︎
See generally Asset Protection Trust, Wex , https://www.law.cornell.edu/wex/asset_protection_trust (last visited Mar. 24, 2022); Richard C. Ausness, The Offshore Asset Protection Trust: A Prudent Financial Planning Device or the Last Refuge of A Scoundrel?, 45 Duq. L. Rev. 147, 174 (2007). ↩︎
See generally 2 Asset Protection: Dom. & Int'l L. & Tactics §§ 26:5–6 (2021). ↩︎
See id. ↩︎
Yes the superintelligent AI could out-lawyer everyone else in the courtrooms if it wanted to. My background assumptions would more be that the AI develops nanotech, and can make every lawyer court and policeman in the world vanish the moment the AI sees fit. The human legal system can only effect the world to the extent that humans listen to it, and enforce it. With a superintelligent AI in play. This may well be king Canute commanding the sea to halt. (And a servant trying to bail the sea back with a teaspoon).
You don't really be considering the AI openly defying the law.
Even if H was compelled, under court order, to instruct A to behave in some way, it's not clear that A would follow the order.
Everyone H dislikes drops dead at the same instant. H is now living in a giant moonbase, which of course has offensive and defensive capabilities far beyond human tech. Self replicating nanobots, programmed to only obey instructions cryptographically signed by A, now hide in every square meter of the earth. At a moments whim, H can tell A to produce just about anything, just about anywhere. An army of giant marshmallow monsters roam the land. An impressive feat of nanotech indeed, to make a substance that resembles marshmallow so closely, yet is animated like muscle. H thought it would be funny.
(I realized the second H in that blockquote should be an A)