Koen Holtman

Computing scientist and Systems architect. Currently doing self-funded AI/AGI safety research.  I participate in AI standardization under the company name Holtman Systems Research: https://holtmansystemsresearch.nl/


Counterfactual Planning

Wiki Contributions


Note: This is presumably not novel, but I think it ought to be better-known.

This indeed ought to be better-known. The real question is: why is it not better-known?

What I notice in the EA/Rationalist based alignment world is that a lot of people seem to believe in the conventional wisdom that nobody knows how to build myopic agents, nobody knows how to build corrigible agents, etc.

When you then ask people why they believe that, you usually get some answer 'because MIRI', and then when you ask further it turns out these people did not actually read MIRI's more technical papers, they just heard about them.

The conventional wisdom 'nobody knows how to build myopic agents' is not true for the class of all agents, as your post illustrates. In the real world, applied AI practitioners use actually existing AI technology to build myopic agents, and corrigible agents, all the time. There are plenty of alignment papers showing how to do these things for certain models of AGI too: in the comment thread here I recently posted a list.

I speculate that the conventional rationalist/EA wisdom of 'nobody knows how to do this' persists because of several factors. One of them is just how social media works, Eternal September, and People Do Not Read Math, but two more interesting and technical ones are the following:

  1. It is popular to build analytical models of AGI where your AGI will have an infinite time horizon by definition. Inside those models, making the AGI myopic without turning it into a non-AGI is then of course logically impossible. Analytical models built out of hard math can suffer from this built-in problem, and so can analytical models built out of common-sense verbal reasoning, In the hard math model case, people often discover an easy fix. In verbal models, this usually does not happen.

  2. You can always break an agent alignment scheme by inventing an environment for the agent that breaks the agent or the scheme. See johnswentworth's comment elsewhere in the comment section for an example of this. So it is always possible to walk away from a discussion believing that the 'real' alignment problem has not been solved.

I think I agree to most of it: I agree that some form of optimization or policy search is needed to get many things you want to use AI for. But I guess you have to read the paper to find out the exact subtle way in which the AGIs inside can be called non-consequentialist. To quote Wikipedia:

In ethical philosophy, consequentialism is a class of normative, teleological ethical theories that holds that the consequences of one's conduct are the ultimate basis for judgment about the rightness or wrongness of that conduct.

I do not talk about this in the paper, but in terms of ethical philosophy, the key bit about counterfactual planning is that it asks: judge one's conduct by its consequences in what world exactly? Mind you, the problem considered is that we have to define the most appropriate ethical value system for a robot butler, not what is most appropriate for a human.

Hi Simon! You are welcome! By the way, I very much want to encourage you to be skeptical and make up your own mind.

I am guessing that by mentioning consequentialist, you are referring to this part of Yudkowsky's list of doom:

  1. Corrigibility is anti-natural to consequentialist reasoning

I am not sure how exactly Yudkowsky is defining the terms corrigibility or consequentalist here, but I might actually be agreeing with him on the above statement, depending on definitions.

I suggest you read my paper Counterfactual Planning in AGI Systems, because it is the most accessible and general one, and because it presents AGI designs which can be interpreted as non-consequentualist.

I could see consequentialist AGI being stably corrigible if it is placed in a stable game-theoretical environment where deference to humans literally always pays as a strategy. However, many application areas for AI or potential future AGI do not offer such a stable game-theoretical environment, so I feel that this technique has very limited applicability.

If we use the 2015 MIRI paper definition of corrigibility, the alignment tax (the extra engineering and validation effort needed) for implementing corrigibility in current-generation AI systems is low to non-existent. The TL;DR here is: avoid using a bunch of RL methods that you do not want to use anyway when you want any robustness or verifiability. As for future AGI, the size of the engineering tax is open to speculation. My best guess is that future AGI will be built, if ever, by leveraging ML methods that still resemble world model creation by function approximation, as opposed to say brain uploading. Because of this, and some other reasons, I estimate a low safety engineering tax to achieve basic corrigibility.

Other parts of AGI alignment may be very expensive. e.g. the part of actually monitoring an AGI to make sure its creativity is benefiting humanity, instead of merely finding and exploiting loopholes in its reward function that will hurt somebody somewhere. To the extent that alignment cannot be cheap, more regulation will be needed to make sure that operating a massively unaligned AI will always be more expensive for a company to do than operating a mostly aligned AI. So we are looking at regulatory instruments like taxation, fines, laws that threaten jail time, and potentially measures inside the semiconductor supply chain, all depending on what type of AGI will become technically feasible, if ever.

Corrigibility with Utility Preservation is not the paper I would recommend you read first, see my comments included in the list I just posted.

To comment on your quick thoughts:

  • My later papers spell out the ML analog of the solution in `Corrigibility with' more clearly.

  • On your question of Do you have an account of why MIRI's supposed impossibility results (I think these exist?) are false?: Given how re-tellings in the blogosphere work to distort information into more extreme viewpoints, I am not surprised you believe these impossibility results of MIRI exist, but MIRI does not have any actual mathematically proven impossibility results about corrigibility. The corrigibility paper proves that one approach did not work, but does not prove anything for other approaches. What they have is that 2022 Yudkowsky is on record expressing strongly held beliefs that corrigibility is very very hard, and (if I recall correctly) even saying that nobody has made any progress on it in the last ten years. Not everybody on this site shares these beliefs. If you formalise corrigibility in a certain way, by formalising it as producing a full 100% safety, no 99.999% allowed, it is trivial to prove that a corrigible AI formalised that way can never provably exist, because the humans who will have to build, train, and prove it are fallible. Roman Yampolskiy has done some writing about this, but I do not believe that this kind or reasoning is at the core of Yudkowsky's arguments for pessimism.

  • On being misleadingly optimistic in my statement that the technical problems are mostly solved: as long as we do not have an actual AGI in real life, we can only ever speculate about how difficult it will be to make it corrigible in real life. This speculation can then lead to optimistic or pessimistic conclusions. Late-stage Yudkowsky is of course well-known for speculating that everybody who shows some optimism about alignment is wrong and even dangerous, but I stand by my optimism. Partly this is because I am optimistic about future competent regulation of AGI-level AI by humans successfully banning certain dangerous AGI architectures outright, much more optimistic than Yudkowsky is.

  • I do not think I fully support my 2019 statement anymore that 'Part of this conclusion [of Soares et al. failing to solve corrigibility] is due to the use of a Platonic agent model'. Nowadays, I would say that Soares et al did not succeed in its aim because it used a conditional probability to calculate what should have been calculated by a Pearl counterfactual. The Platonic model did not figure strongly into it.

OK, Below I will provide links to few mathematically precise papers about AGI corrigibility solutions, with some comments. I do not have enough time to write short comments, so I wrote longer ones.

This list or links below is not a complete literature overview. I did a comprehensive literature search on corrigibility back in 2019 trying to find all mathematical papers of interest, but have not done so since.

I wrote some of the papers below, and have read all the rest of them. I am not linking to any papers I heard about but did not read (yet).

Math-based work on corrigibility solutions typically starts with formalizing corrigibility, or a sub-component of corrigibility, as a mathematical property we want an agent to have. It then constructs such an agent with enough detail to show that this property is indeed correctly there, or at least there during some part of the agent lifetime, or there under some boundary assumptions.

Not all of the papers below have actual mathematical proofs in them, some of them show correctness by construction. Correctness by construction is superior to having to have proofs: if you have correctness by construction, your notation will usually be much more revealing about what is really going on than if you need proofs.

Here is the list, with the bold headings describing different approaches to corrigibility.

Indifference to being switched off, or to reward function updates

Motivated Value Selection for Artificial Agents introduces Armstrong's indifference methods for creating corrigibility. It has some proofs, but does not completely work out the math of the solution to a this-is-how-to-implement-it level.

Corrigibility tried to work out the how-to-implement-it details of the paper above but famously failed to do so, and has proofs showing that it failed to do so. This paper somehow launched the myth that corrigibility is super-hard.

AGI Agent Safety by Iteratively Improving the Utility Function does work out all the how-to-implement-it details of Armstrong's indifference methods, with proofs. It also goes into the epistemology of the connection between correctness proofs in models and safety claims for real-world implementations.

Counterfactual Planning in AGI Systems introduces a different and more easy to interpret way for constructing a a corrigible agent, and agent that happens to be equivalent to agents that can be constructed with Armstrong's indifference methods. This paper has proof-by-construction type of math.

Corrigibility with Utility Preservation has a bunch of proofs about agents capable of more self-modification than those in Counterfactual Planning. As the author, I do not recommend you read this paper first, or maybe even at all. Read Counterfactual Planning first.

Safely Interruptible Agents has yet another take on, or re-interpretation of, Armstrong's indifference methods. Its title and presentation somewhat de-emphasize the fact that it is about corrigibility, by never even discussing the construction of the interruption mechanism. The paper is also less clearly about AGI-level corrigibility.

How RL Agents Behave When Their Actions Are Modified is another contribution in this space. Again this is less clearly about AGI.

Agents that stop to ask a supervisor when unsure

A completely different approach to corrigibility, based on a somewhat different definition of what it means to be corrigible, is to construct an agent that automatically stops and asks a supervisor for instructions when it encounters a situation or decision it is unsure about. Such a design would be corrigible by construction, for certain values of corrigibility. The last two papers above can be interpreted as disclosing ML designs that also applicable in the context of this stop when unsure idea.

Asymptotically unambitious artificial general intelligence is a paper that derives some probabilistic bounds on what can go wrong regardless, bounds on the case where the stop-and-ask-the-supervisor mechanism does not trigger. This paper is more clearly about the AGI case, presenting a very general definition of ML.

Anything about model-based reinforcement learning

I have yet to write a paper that emphasizes this point, but most model-based reinforcement learning algorithms produce a corrigible agent, in the sense that they approximate the ITC counterfactual planner from the counterfactual planning paper above.

Now, consider a definition of corrigibility where incompetent agents (or less inner-aligned agents, to use a term often used here) are less corrigible because they may end up damaging themselves, their stop buttons. or their operator by being incompetent. In this case, every convergence-to-optimal-policy proof for a model-based RL algorithm can be read as a proof that its agent will be increasingly corrigible under learning.


Cooperative Inverse Reinforcement Learning and The Off-Switch Game present yet another corrigibility method with enough math to see how you might implement it. This is the method that Stuart Russell reviews in Human Compatible. CIRL has a drawback, in that the agent becomes less corrigible as it learns more, so CIRL is not generally considered to be a full AGI-level corrigibility solution, not even by the original authors of the papers. The CIRL drawback can be fixed in various ways, for example by not letting the agent learn too much. But curiously, there is very little followup work from the authors of the above papers, or from anybody else I know of, that explores this kind of thing.

Commanding the agent to be corrigible

If you have an infinitely competent superintelligence that you can give verbal commands to that it will absolutely obey, then giving it the command to turn itself into a corrigible agent will trivially produce a corrigible agent by construction.

Giving the same command to a not infinitely competent and obedient agent may give you a huge number of problems instead of course. This has sparked endless non-mathematical speculation, but in I cannot think of a mathematical paper about this that I would recommend.

AIs that are corrigible because they are not agents

Plenty of work on this. One notable analysis of extending this idea to AGI-level prediction, and considering how it might produce non-corrigibility anyway, is the work on counterfactual oracles. If you want to see a mathematically unambiguous presentation of this, with some further references, look for the section on counterfactual oracles in the Counterfactual Planning paper above.


Myopia can also be considered to be feature that creates or improves or corrigibility. Many real-world non-AGI agents and predictive systems are myopic by construction: either myopic in time, in space, or in other ways. Again, if you want to see this type of myopia by construction in a mathematically well-defined way when applied to AGI-level ML, you can look at the Counterfactual Planning paper.

As one of the few AI safety researchers who has done a lot of work on corrigibility, I have mixed feelings about this.

First, great to see an effort that tries to draw more people to working on the corrigibility, because almost nobody is working on it. There are definitely parts of the solution space that could be explored much further.

What I also like is that you invite essays about the problem of making progress, instead of the problem of making more people aware that there is a problem.

However, the underlying idea that meaningful progress is possible by inviting people to work on a 500 word essay, which will then first be judged by 'approximately 10 Judges who are undergraduate and graduate students', seems to be a bit strange. I can fully understand Sam Bowman's comment that this might all look very weird to ML people. What you have here is an essay contest. Calling it a research contest may offend some people who are actual card-carrying researchers.

Also, the more experienced judges you have represent somewhat of an insular sub-community of AI safety researchers. Specifically, I associate both Nate and John with the viewpoint that alignment can only be solved by nothing less than an entire scientific revolution. This is by now a minority opinion inside the AI safety community, and it makes me wonder what will happen to submissions that make less radical proposals which do not buy into this viewpoint.

OK, I can actually help you with the problem of an unbalanced judging panel: I volunteer to join it. If you are interested, please let me know.

Corrigibility is both

  • a technical problem: inventing methods to make AI more corrigible

  • a policy problem: forcing people deploying AI to use those methods, even if this will hurt their bottom line, even if these people are careless fools, and even if they have weird ideologies.

Of these two problems, I consider the technical problem to be mostly solved by now, even for AGI.
The big open problem in corrigibility is the policy one. So I'd like to see contest essays that engage with the policy problem.

To be more specific about the technical problem being mostly solved: there are a bunch of papers outlining corrigibility methods that are backed up by actual mathematical correctness proofs, rather than speculation or gut feelings. Of course, in the AI safety activism blogosphere, almost nobody wants to read or talk about these methods in the papers with the proofs, instead everybody bikesheds the proposals which have been stated in natural language and which have been backed up only by speculation and gut feelings. This is just how a blogosphere works, but it does unfortunately add more fuel to the meme that the technical side of corrigibility is mostly unsolved and that nobody has any clue.

Consider two common alignment design patterns: [...] (2) Fixing a utility function and then argmaxing over all possible plans.

Wait: fixing a utility function and then argmaxing over all possible plans is not an alignment design pattern, it is the bog-standard operational definition of what an optimal-policy MDP agent should do. This is what Stuart Russell calls the 'standard model' of AI. This is an agent design pattern, not an alignment design pattern. To be an alignment design pattern in my book, you have to be adding something extra or doing something different that is not yet in the bog-standard agent design.

I think you are showing that an actor-grader is just a utility maximiser in a fancy linguistic dress. Again, not an alignment design pattern in my book.

Though your use of the word doomed sounds too absolute to me, I agree with the main technical points in your analysis. But I would feel better if you change the terminology from alignment design pattern to agent design pattern.

At t+7 years, I’ve still seen no explicit argument for robust AI collusion, yet tacit belief in this idea continues to channel attention away from a potential solution-space for AI safety problems, leaving something very much like a void.

I agree with you that this part of the AGI x-risk solution space, the part where one tries to design measures to lower the probability of collusion between AGIs, is very under-explored. However, I do not believe that the root cause of this lack of attention is a widely held 'tacit belief' that robust AGI collusion is inevitable.

It is easy to imagine the existence of a very intelligent person who nevertheless hates colluding with other people. It is easy to imagine the existence of an AI which approximately maximises a reward function which has a term in it that penalises collusion. So why is nobody working on creating or improving such an AI or penalty term?

My current x-risk community model is that the forces that channel people away from under-explored parts of the solution space have nothing to do with tacit assumptions about impossibilities. These forces operate at a much more pre-rational level of human psychology. Specifically: if there is no critical mass of people working in some part of the solution space already, then human social instincts will push most people away from starting to work there, because working there will necessarily be a very lonely affair. On a more rational level, the critical mass consideration is that if you want to do work that gets you engagement on your blog post, or citations on your academic paper, the best strategy is pick a part of the solution space that already has some people working in it.

TL;DR: if you want to encourage people to explore an under-visited part of the solution space, you are not primarily fighting against a tacit belief that this part of the space will be empty of solutions. Instead, you will need to win the fight against the belief that people will be lonely when they go into that part of the space.

I continue to be surprised that people think a misaligned consequentialist intentionally trying to deceive human operators (as a power-seeking instrumental goal specifically) is the most probable failure mode.

Me too, but note how the analysis leading to the conclusion above is very open about excluding a huge number of failure modes leading to x-risk from consideration first:

[...] our focus here was on the most popular writings on threat models in which the main source of risk is technical, rather than through poor decisions made by humans in how to use AI.

In this context, I of course have to observe that any human decision, any decision to deploy an AGI agent that uses purely consequentialist planning towards maximising a simple metric, would be a very poor human decision to make indeed. But there are plenty of other poor decisions too that we need to worry about.

But it seems like roughly the entire AI existential safety community is very excited about mechanistic interpretability and entirely dismissive of Stuart Russell's approach, and this seems bizarre.

Data point: I consider myself part to be part of the AI x-risk community, but like you am not very excited about mechanistic interpretability research in an x-risk context. I think there is somewhat of a filter bubble effect going on, where people who are more exited about interpretability post more on this forum.

Stuart Russell's approach is a broad agenda, and I am not on board with of all parts of it, but I definitely read his provable safety slogan as a call for more attention to the design approach where certain AI properties (like safety and interpretability properties) are robustly created by construction.

There is an analogy with computer programming here: a deep neural net is like a computer program written by an amateur without any domain knowledge, one that was carefully tweaked to pass all tests in the test suite. Interpreting such a program might be very difficult. (There is also the small matter that the program might fail spectacularly when given inputs not present in the test suite.) The best way to create an actually interpretable program is to build it from the ground up with interpretability in mind.

What is notable here is that the CS/software engineering people who deal with provable safety properties have long ago rejected the idea that provable safety should be about proving safe an already-existing bunch of spaghetti code that has passed a test suite. The problem of interpreting or reverse engineering such code is not considered a very interesting or urgent one in CS. But this problem seems to be exactly what a section of the ML community has now embarked on. As an intellectual quest, it is interesting. As a safety engineering approach for high-risk system components, I feel it has very limited potential.

Load More