This work was supported by the Monastic Academy for the Preservation of Life on Earth. You can support my work directly here.

I will give a short presentation of this work followed by discussion on Saturday (Dec 24) at 9am Pacific / 12pm Eastern. RSVP here.


  • This is a response to Holden Karnofsky’s plan for aligning powerful AI systems in the case that such systems are developed soon.

  • I give a summary of Holden’s plan, and then critique it.

  • My basic critique is that the tools Holden is proposing are too similar to what has been tried in the construction of human institutions, and I argue that we have failed to create large numbers of human institutions that alleviate rather than exacerbate existential risk.

The Karnofsky Plan

Holden has written about how we might align very powerful AI systems if such systems were to be developed soon. The basic approach Holden proposes is to train AI systems to be honest, legible, and corrigible, and value-aligned using human supervision. Honesty means not engaging in deceit or manipulation of the supervisor. Legibility means creating artifacts that help supervisors to accurately review decisions. Corrigibility means being willing to stop when the human supervisor says no, no matter how strongly the AI believes that it is doing the right thing. Value-alignment means pursuing the goals of the supervisor. To train such AI systems, Holden says we need to pay attention to the following properties of the training environment:

(1) The human supervisor should provide high-quality feedback. This means that if we are training an AI system to be honest then the human supervisor should give a positive reward if and only if the AI system is, in fact, being honest. It can be hard to know whether an AI system is being honest. Therefore we may design further tools to help the human assess whether the AI system is being honest. In order to do this we may train auxiliary AI systems to assess our primary AI system, or we may restrict the complexity of the primary AI systems, or we may set up an ecosystem of differently-trained AI systems in such a way that dishonesty is "weeded out" by ecological dynamics.

(2) The AI learner should correctly generalize what it has been taught. This means that if we are training an AI system to be honest then the AI system should be honest even when presented with situations very unlike those that it encountered during training. To accomplish this we may identify, during training, which situations are most likely to cause undesirable behavior given the current state of the AI, and then subject the AI to those situations and provide negative rewards until the undesirable behavior is corrected. We may train auxiliary AI systems to help with identifying the situations most likely to cause undesirable behavior given the current state of the AI (adversarial training).

(3) The AI learner should not break out of the training container. This means that if we are training an AI system to be honest, it should be possible for the human supervisor to hide certain things outside the training container, such as schemes for testing the AI system in clever ways. If the AI system escapes the training container, hacks the human supervisor’s cell phone, and listens to the human supervisors discussing the structure of the next test, then the AI system may be able to appear honest without actually being honest. More generally, the AI system should not escape the training container during training. To accomplish this we may build auxiliary AI systems that help to find and patch security holes in the software that holds the training container.

(4) The AI learner should be thoroughly tested before deployment. This means that after training is complete, the AI learner should be subjected to an even more intense battery of tests before it is deployed into the world. To do this we may make use of tools that make sense of the internal workings of the AI, or we may set up AIs that are limited in ways that make them simpler and safer to assess, or we may make use of an ecosystem of differently-trained AIs interacting according to a protocol that causes undesirable properties to be "weeded out".

In order to build a training setup that has these four properties, Holden says that we need progress on (1) decoding the internal states of AI systems, (2) training AI systems that are limited in ways that make them safer, and (3) setting up ecosystems of interacting AI systems according to protocols that "weed out" undesirable behavior. The overall Karnofsky plan, therefore, calls for development of these three technologies, in order to set up a training environment with the four properties listed above, in order to develop AI systems that are honest, legible, value-aligned, and corrigible, in order to reduce and not exacerbate overall existential risk.

My take

In human institutions we also seek qualities like honesty, legibility, and corrigibility, and value-alignment. Corrigibility in human institutions looks like humans who are willing to stop when their supervisor says to stop, even if they really truly believe that what they are doing is the right thing for the organization. Value-alignment looks like humans who understand and pursue the goals of the overall organization. Honesty looks like humans scrutinizing each other’s claims. Legibility looks like decision-making processes that generate materials that help overseers to review decisions.

In order to cultivate these qualities, humans use tools similar to the ones that Holden proposes:

The analog of decoding the internal states of an AI system for human institutions is looking into the basic goals and beliefs held by a particular individual. This is not done via brain scans but by inference based on external behavior, including via sympathetic inference at the level of the nervous system. When you get to know what "makes a person tick", most people believe that they can assess trustworthiness more accurately compared to merely having a spreadsheet with situations that the person had encountered and the choices they made (although such a spreadsheet is certainly also informative). The basic reason we trust this approach is that understanding the mechanics of a person's decision-making gives us more insight into their future behavior than mere observed inputs and outputs.

The analog in human institutions of limiting AI system complexity is that we don’t hire overly sophisticated people into inappropriate positions, and after hiring people we make blanket prohibitions against certain complex forms of interaction, such as sexual relations. It’s not that sexual relations among employees are always harmful, it’s that they are very hard to understand and govern, and are sometimes very harmful. Therefore we make blanket prohibitions in order to simplify the environment to the point that it can be governed. The basic reason we trust this approach is that we see our own inability to govern very complex situations, and we believe that these blanket limitations simplify the overall situation.

The analog in human institutions of setting up ecosystems of interacting AI systems is of course the many ways that we balance the influence of different players in human institutions. This is a significant aspect of management, governance, and institution design. We create systems of checks and balances where decisions are reviewed, where feedback is solicited by organizationally separated departments, where review boards are formed to investigate signs of misbehavior, and so on. The idea is to create an environment where the only rational course of action, even for partially selfish humans, is to "play it safe by following the rules". The fundamental premise is that by limiting the information and access of individuals, and by creating an overall system that rivals or exceeds the sophistication of individuals, no individual will be able to exert power over the whole. As a result we create bureaucratic mazes that impede a lot of helpful as well as unhelpful behavior, but we believe that this is still better than allowing individuals to have lots of power. Underlying this belief is the basic observation that the individuals themselves are not completely trustworthy.

So in the case of human institutions, we have used, among other tools, the three basic techniques that the Karnofsky plan proposes to use. Have we in fact created human institutions capable of reducing rather than exacerbating existential risk? In short: no. The net effect of all the human institutions that have been created since the dawn of our civilization is the present world situation in which very powerful technologies are being developed and deployed by institutions in a way that is exacerbating existential risk. There may be some institutions and technologies that are reducing existential risk, but the sum of all the human institutions in the world is pushing existential risk up year by year.

Whatever we humans have been doing these past few thousand years, and particularly these past few hundred years, and particularly these past few decades, the net effect of it all is the present high level of existential risk. If you were on a large ship that was gradually sinking into the ocean, and the people on that ship were organized into institutions that were governed in certain ways, and yet the holes in the ship were not being patched but were becoming larger with time, then you would have to conclude that the approach being taken by the institutions on that ship is insufficient to solve the problem. There might be some good aspects to the approach, and there might be some institutions doing good things, but it would be strange for the people on that ship to take on new highly intelligent passengers and to use the same basic approach to coordinate those new passengers as they had been using to coordinate themselves. You would think that the people on the ship would seriously reconsider their approach at its most basic level.

Fundamentally, Holden’s proposal says: let’s assume that we have a bunch of mildly-to-extremely untrustworthy AIs, and let’s set up a set of checks and balances and limitations and inspections such that we end up with an AI ecosystem that is trustworthy as a whole. I appreciate this approach very much, and with the right toolset, I think it could work. But the toolset being proposed is too similar to what has already been tried with human institutions. This toolset has failed to create trustworthy institutions out of mildly-to-extremely untrustworthy humans (where "trustworthy" means, at a minimum, "does not take actions that, on net, risk destroying all life on the planet"). What I sense we need is a theory of trustworthiness that gives us insight into building systems that are trustworthy. Such a theory ought to guide our efforts when building either individual AIs or ecosystems of AIs. In the absence of such a theory, we have only trial-and-error and copying of past approaches. Given our present predicament as a civilization, it seems dangerous to further replicate the approaches that led us here.

I will give a short presentation of this work followed by discussion on Saturday (Dec 24) at 9am Pacific / 12pm Eastern. RSVP here.



New Comment
3 comments, sorted by Click to highlight new comments since:

Will the discussion be recorded?

Wasn't able to record it - technical difficulties :(

Yes, I should be able to record the discussion and post a link in the comments here.