This post is written in my personal capacity, and does not necessarily represent the views of OpenAI or any other organization. Cross-posted to the Effective Altruism Forum

In the previous post of this sequence, I argued that intent-aligned AIs would, by default, have incentives to break the law. This post goes into one particularly bad consequence of that incentive: the increased difficulty of making credible pre-AGI commitments about post-AGI actions.

Image by OpenAI's DALL·E

In AGI policy and strategy, it would often be useful to adopt credible commitments about what various actors will do post-AGI. For example, it may be desirable for two leading nations in AGI to agree to refrain from racing to build AGI (at the potential cost to AGI safety) and instead split the economic upside from AGI, thereby transforming a negative-sum dynamic into a positive-sum one.[1] Nations might agree to forego their own development of militarily useful AI systems if they receive security assurances from states that do choose to pursue such systems,[2] thus reducing the number of actors pursuing potentially dangerous military AI development, and therefore reducing the risk of a catastrophic accident. One can imagine similar types of pre-AGI agreements concerning space colonization post-AGI, the welfare of digital minds, democratic control of AGI, and many more important issues in the post-AGI world.

Lawless AI poses an enormous problem for such agreements. Agreements between entities, such as states or companies, require performance by agents of the parties: a state or corporation only acts through its agents. But if one party cannot convincingly show that their agents will carry out orders to comply with such agreements, the other party will be very reluctant to enter into an agreement with them. All the more so if the principal cannot prevent its agents from working against aspects of the agreement that are detrimental to the principal.

I will use the following as a toy problem to show what I mean more concretely: X and Y are two leading AGI development firms considering agreeing to split the profits from AGI development if they both agree to common minimum implement safety measures to prevent racing.[3] How might this play out?

X might worry that Y will develop an AGI agent, A(Y), with the objective of maximizing the value of Y's shares. A(Y) can do this in ways that would undermine X's recourse to Y's assets. For example, A(Y) could create a subagent that surreptitiously earns money and directly mails checks to the shareholders of Y, as a sort of pseudo-dividend. That money might never pass through Y, so X would have a hard time reaching it, even if they detected this. X might also worry that A(Y) would also corrupt the legal process in Y's favor, so that even if they had a good legal claim, they could not vindicate it in court at any reasonable cost.

With these concerns in mind, the deal is unlikely to happen.

How might LFAI improve the situation? X and Y could both agree that:

  1. All sufficiently advanced AI agents ("Qualifying Agents") that they train, use, benefit from, or deploy will be LFAI.
  2. The other party may, through a neutral third-party bound by confidentiality,[4] inspect Qualifying Agents and their development to verify that the trained agent will be LFAI when deployed.
  3. Qualifying Agents will be specifically instructed to assume that they are bound by the agreement.[5]
  4. The parties will generally cause all of their agents to give full effect to the agreement.
  5. In any proceeding where the parties are adverse, neither party shall use any artificial intelligence technology to its legal or strategic benefit, unless the other party has free and equal access to the same technology.[6]
  6. In case of any ambiguity in the terms of the agreement, the ambiguity shall be resolved in favor of more constraints on Qualifying Agents and their principal(s).

The resulting agreement would probably not be watertight, but the parties would have much greater assurances of post-AGI performance than they would absent LFAI. Referring back to the example above, A(Y) could not surreptitiously earn money outside of Y and pass it on the Y's shareholders, nor could A(Y) illegally corrupt adjudications of the agreement. LFAI, then, could make the parties much more likely to be willing to enter stabilizing pre-AGI agreements that require post-AGI performance. This seems like a substantial improvement.

  1. Cf. Amanda Askell et al., The Role of Cooperation in Responsible AI Development (2019) (preprint), ↩︎

  2. Of course, this could be analogized to similar agreements regarding nuclear disarmament, such as Ukraine's fateful decision to surrender its post-Soviet nuclear arsenal in exchange for security assurances (which have since been violated by Russia). See, e.g., Editorial, How Ukraine Was Betrayed in Budapest, Wall St. J. (Feb. 23, 2022), Observers (especially those facing potential conflict with Russia) might reasonably question whether any such disarmament agreements are credible. ↩︎

  3. We will ignore antitrust considerations regarding such an agreement for the sake of illustration. ↩︎

  4. So that this inspection process cannot be used for industrial espionage. ↩︎

  5. This may not be the case as a matter of background contract and agency law, and so should be stipulated. ↩︎

  6. This is designed to guard against the case where one party develops AI super-lawyers, then wields them asymmetrically to their advantage. ↩︎

New Comment
2 comments, sorted by Click to highlight new comments since:

Decision-makers who need inspections to keep them in line are incentivized to subvert those inspections and betray the other party. It seems like what's actually necessary is people in key positions who are willing to cooperate in the prisoner's dilemma when they think they are playing against people like them - people who would cooperate even if there were no inspections.

But if there are inspections, then why do we need law-following AI? Why not have the inspections directly check that the AI would not harm the other party (hopefully because it would be helping humans in general)?

A(Y), with the objective of maximizing the value of Y's shares.

A sphere of self replicating robots, expanding at the speed of light. Turning all available atoms; including atoms that used to consist of judges, courts and shareholders; into endless stacks of banknotes. (With little "belongs to Y" notes pinned to them)