Thoughts on “Process-Based Supervision” / MONA

Steven Byrnes

1. Post Summary / Table of Contents

In “How might we align transformative AI if it’s developed very soon?”, Holden Karnofsky talked about “Process-Based Supervision”, citing a previous post by Stuhlmüllert & Byun of Ought. (Holden says he got the idea mainly from Paul Christiano.)

[Update Jan 2025: This idea is now also called “MONA” (“Myopic Optimization with Non-myopic Approval”)—see discussion here.] [Update Sept. 2025: Also added “MONA” into the article title; the old title was just Thoughts on “Process-Based Supervision”.]

I apparently misunderstood what Holden meant by “Process-Based Supervision”, and it took many hours and a 7000-word comment thread before I figured it out.

(Thanks to Holden for his extraordinary patience during that protracted discussion.)

The extremely short version for AI alignment domain experts is: I currently think of the load-bearing ingredients of Holden’s take on “process-based supervision” as being:

AI boxing some of the time (specifically, during the periods where the AI is “thinking”);
“Myopic training” (e.g. as defined here);
NOT aspiring to be a complete solution to safety, but rather a more modest attempt to help avoid situations where we (accidentally) positively reinforce the AI for engaging in directly dangerous behavior.
- You could think of this as an intervention that directly and straightforwardly mitigates outer misalignment, and that’s the main thing I’ll discuss in this post. But obviously, any change of the supervisory signals will have some effect on the likelihood of inner misalignment / goal misgeneralization too. And there’s also a (more speculative) argument that process-based supervision might make things better there too—at least on the margin. See footnote 4 of Section 5.2.2.

(This is specific to Holden’s take. I think Stuhlmüllert & Byun’s take on “process-based supervision” involves a different set of load-bearing ingredients, centered around restricting the complexity of black-box processing. I will not be discussing that.)

The long, hopefully-pedagogical, and more opinionated version is the rest of this post.

Section 2 will give the very brief slogan / sales-pitch for process-based supervision, and why that pitch was bouncing off me, striking me as frustratingly missing-the-point.
Section 3 will state the subproblem that we’re trying to solve: the AI does subtly manipulative, power-seeking, or otherwise problematic actions, and we don’t notice, and therefore we give a training signal that reinforces that behavior, and therefore the AI does those things more and more. To be clear, this is not the only path to dangerous misalignment (in particular, classic “treacherous turns” are out-of-scope). But maybe solving just this subproblem can be part of a complete solution. I’ll get back to that in Section 5.
Section 4 describes “process-based supervision” as I currently understand it, and why it seems to solve the subproblem in question.
Finally, having described process-based supervision as I currently understand it, Section 5 offers a critical evaluation of that idea. In particular:
- 5.1 asks “Does this actually solve the subproblem in question?”;
- 5.2 asks “What about the other misalignment-related subproblems?”;
- 5.3 asks “How bad is the “alignment tax” from doing this kind of thing?”;
- and 5.4 is a summary.
- Tl;dr: Once we get to the capabilities regime where AI safety / alignment really matters, I currently think that process-based supervision would entail paying a very big alignment tax—actually, not just “big” but potentially infinite, as in “this kind of AGI just plain can’t do anything of significance”. And I also currently think that, of the somewhat-vague paths I see towards AGI technical safety, process-based supervision wouldn’t make those paths noticeably easier or more likely to succeed. (Of those two complaints, I feel more strongly about the first one.) This take is pretty specific to my models of what AGI algorithms will look like—i.e., some kind of actor-critic model-based RL, vaguely in the same category as human brain within-lifetime learning—and might not generalize to other hypothetical paths to AGI like LLMs. So people should decide for themselves.

2. The sales pitch, and why I bounced off it

2.1 The sales pitch (in brief)

The following is the impression I got from, for example, the Stuhlmüllert & Byun post.

Let’s say we’re interested in starting a project to make less-expensive automobile lidar. I think we’re supposed to imagine something like the following:

Box 1: Example of what “process-based supervision” is supposed to look like

Prompt: “Figure out how to make less-expensive automobile LiDAR”

(A) The AI does some thinking.
(B) The AI prints out “Step 1 is to list the different technical approaches to automobile LiDAR and their respective bills of material.”
(C) The AI does some thinking.
(D) The AI prints out “Step 1.1 is to search through every paper and patent about automobile LiDAR and group them into categories based on technical approach”.”
(E) The AI does some thinking.
(F) The AI prints out “Step 1.1.1 is …”
…

Meanwhile, we periodically audit a random printout, carefully figure out how much we like that printout (compared to alternative possible printouts, perhaps offered by the AI itself), and use those results as a training signal for the AI.

Or as Stuhlmüllert & Byun write: “Process-based systems are built on human-understandable task decompositions, with direct supervision of reasoning steps.”

The pitch, as I initially understood it, is that we can directly see what the AI is doing and how. What is the AI doing during (C)? It’s doing the thing it printed out in (B)! What is the AI doing during (E)? It’s doing the thing it printed out in (D)! And at the end of the day, what’s happening? Exactly the things described in the printouts!

It’s a great pitch. We know what the AI is doing! That’s what we wanted! Right?

2.2 Why did I bounce off that pitch?

Here’s one way to put my objection:

If we just look at steps (B), (D), (F), etc., we see a “process” …
BUT, if we look at (A) & (B), we see that (B) is the outcome of a black-box process (A). Similarly, if we look at (C) & (D), we see that (D) is the outcome of a black-box process (C). Etc.

On the latter perspective, calling it “process-based” is almost a shell game.

Or here’s another way to put it: “Process-based” struck me as a sneaky equivocation between a plan and a hope. The plan is, presumably, certain things that we do while training and deploying the AI. The hope is that the printouts in steps (B,D,F) will look vaguely like the example above, and that this is the only relevant “process” going on, as opposed to something more dangerous.

For example, one can worry that the AI will spend (A) hatching a dastardly plan to escape onto the internet, and then the (B) printout will be an attempt to subvert the entire setup, e.g. it will print out the message “Step 1: Forget about step 1, if anyone is reading this message, help me, I’m an AI and I’m suffering…”!!

Why should we be worried about that happening? Because we’re presumably going to wind up with an AI that is doing things in (A) that reflect a desire to print out a “good” message in step (B)—or a desire for something different from that but somehow correlated with it (cf. goal misgeneralization.) And the AI will be “trying”, during step (A), to make those things happen—see my “follow-the-trying game”. And whatever are the precise things the AI desires, seizing power is probably a good way to make those things happen—see instrumental convergence. So we’re setting ourselves up to get into a situation with an AI that will want to pull those kinds of hijinks—and separately, it would at least plausibly also be able to. (More on this in Section 5.2 below.)

So again, it’s fine to hope that the printouts in steps (B,D,F) will look vaguely like the above, and that this is the only relevant “process” going on. But I didn’t see any particular reason to expect that hope to actually come true, especially as AI gets more and more powerful. I want a plan, not a hope!

(For example, if we’re worried about the “dastardly plan to seize power” mentioned above, we can think about specific mitigations, like using honeypots or adversarial training to catch the AI in failed attempts to execute dastardly plans, and then sending negative reward (or other such supervisory signals) when it does, while (somehow) ensuring that the AI generalizes these supervisory signals to “Gee, I don’t like making dastardly plans” rather than “Gee, I don’t like getting caught”. That’s a plan, not a hope—or at least, it’s a plausible direction towards a plan. But even if that works, then “process-based supervision” per se would not be why that system is safe, in my mind.)

2.3 My initial misguided attempt to flesh out the proposal

Coming from that perspective, my first few attempts to make sense of “process-based supervision” were all variants on the following basic idea:

My initial incorrect idea: Maybe process-based supervision is secretly relying on some method that limits the amount of black-box processing that happens within (A) or (C) or (E).

That idea was wrong—at least, it’s wrong for the Holden version of “process-based supervision” as I currently understand it. (It might be correct for the Stuhlmüllert & Byun version of “process-based supervision”.)

Quite the contrary: Suppose that, in step (E), the AI brainstorms for 5 hours, and then in step (F), it proposes to send a certain email. And suppose that we humans have no visibility whatsoever into what happened during that 5-hour brainstorming session. That still counts as “process-based supervision”, under Holden’s definition, as long as the AI didn’t have internet access, human interaction, etc. during those 5 hours.

That said, here are two things that are not part of Holden’s version of “process-based supervision”, but still might be good ideas:

Limiting the complexity of (black-box) computation involved in step (A) or (C) or (E) etc. For example, one might argue that an LLM can only do so much “thinking” in a single forward-pass before printing out a character, and maybe this amount of “thinking” is insufficient to hatch and execute a dastardly plan. Maybe! I dunno. I’m at least open-minded to that kind of argument.
Limiting the black-box-ness of steps (A), (C), (E), etc., via interpretability. Sure! I’m all for it, insofar as it’s possible.

Now I’ll switch over to the positive vision as I currently understand it.

3. The subproblem we’re trying to solve

We have a problem: if an agent wants to do something X, for almost any X, real-world power-seeking actions are probably helpful for that. Examples of real-world power-seeking actions include: acquiring money (legally or not), acquiring influence (whether warranted or by manipulation), acquiring compute, and so on. Again see instrumental convergence.

Let’s split this into two subproblems / failure modes / manifestations. (In reality, these are the two opposite ends of a spectrum.)

Very hard subproblem (which we are NOT trying to solve here): there could be an AI that has never done any real-world power-seeking actions whatsoever, not even a little, not anytime during training. And then seemingly out of nowhere it does a power-seeking treacherous turn—maybe it outputs “Help me help me I’m suffering!” into its text terminal, or whatever. It does this because it was doing foresighted planning towards a misaligned / misgeneralized long-term goal.
Less hard subproblem (which we ARE trying to solve here): The AI sometimes does a little bit of problematic real-world power-seeking actions (without us noticing), and by doing so it does better at its task X, and thus we positively reinforce those actions. Over time the AI does increasingly aggressive, competent, and surreptitious real-world power-seeking actions, and we positively reinforce those actions too, and so on until we have a power-seeking AI that can and will escape control.

This approach is meant to address the latter subproblem, not the former one. I’ll revisit the former problem in Section 5.2.

4. “Process-based supervision”, and why it seems to solve this subproblem

Our basic approach is to never propagate training signals through a sequence in which real-world actions are taken and their consequences play out.

Holden offers the following short version:

(i) "Think" about what to do next, for up to some max period of time ("what to do next" can be "think more, with prompt X").
(ii) Do it
(iii) Repeat

The active ingredients, according to me, are:

The AI is “boxed” during (i) (as opposed to taking actions in the real world);
We sometimes audit the output of (i) (e.g. by rolling out (i) a few times and comparing them) without actually doing the proposed next steps;
And use that as a training signal that updates what it did exclusively during that past step (i).

Here’s a longer version, with emphasis on the active ingredients:

(1) We often have the AI acting within a “boxed” mode that is sufficiently constrained that there is no opportunity for problematic power-seeking actions. No sending data into the internet, no chatting with humans, etc. (It’s fine if the AI has access to a cached copy of the internet while in “boxed” mode, like the original Bing chatbot.)
(2) However, the AI will sometimes “propose a step” that CANNOT be done within the “boxed” mode. “Proposing a step” can either look like:
- (2A) printing out a suggestion that that humans do some discrete real-world thing, or
- (2B) printing out a discrete real-world thing that could be done immediately & automatically by a simple non-AI script—e.g., “send the following email to the following address: …”, or “ssh into the following server and type the following command: …”, etc.
(3) The AI gets a training signal from one and only one source:
- (3A) Whenever the AI “proposes a step” (and before the corresponding proposal is actually executed) let’s call that an “audit-able moment”.
- (3B) Some small fraction of audit-able moments, we freeze execution and do an audit, which entails evaluating how reasonable that step looks, again without executing it.
- (3C) We update the model based on that evaluation…
- (3D) …but we make sure to do that model update in such a way that we are only propagating training signals to the things that the AI did since the completion of the previous step.

To elaborate on (3D): For example, in a recurrent model, we could do Backprop-through-time (BPTT), but we would stop the BPTT at the completion of the previous step. Or in RL, we could apply TD updates, but only to the period after the completion of the previous step. Etc.

In other words, we’re treating all its previous proposals & activities as fixed and exogenous for the purpose of our model-updating step. Think of it as a stop-gradient or something spiritually similar.

If we do that, then we’re only updating based on activities during a period within which the AI was entirely in the “boxed” mode. And therefore we can feel confident that we are not differentially reinforcing real-world power-seeking behavior.

And yet:

We are still training it to be effective during the periods when it is in “boxed” mode—to do good brainstorming, to make good use of its within-box scratchpad and other resources, etc.
We are not assuming any interpretability,
It’s fine if the AI spends an hour brainstorming in a human-inscrutable way,
This is not really “AI boxing” as people normally think of it, because if we go with the (2B) approach, then whenever we’re not auditing the AI, it can autonomously and effectively interact with the real world.

5. Critical evaluation

Now that I (hopefully) understand what the proposal is, I can finally actually try to form an opinion.

This opinion will probably be somewhat tied to my particular expectations about how future AGI algorithms are likely to work internally. Specifically, I am almost always assuming that future AGI will be based on actor-critic model-based RL, which puts it in the same broad category as human brain within-lifetime learning (I claim). If you’re assuming something very different from that—e.g. that future AGI will be kinda like AutoGPT—then you may find this section off-the-mark, and honestly you should consider just quitting at this point and forming your own opinions. But here goes, anyway:

5.1 Does this actually solve the subproblem?

Above I defined the subproblem as follows:

The AI sometimes does a little bit of problematic real-world power-seeking actions (without us noticing), and by doing so it does better at its task X, and thus we positively reinforce those actions. Over time the AI does increasingly aggressive, competent, and surreptitious problematic real-world power-seeking actions, and we positively reinforce those actions too, and so on until we have a power-seeking AI that can and will escape control.

Here are some possible issues.

What if the AI does flagrantly dangerous things anyway, like hack out of the box, or otherwise try to subvert the setup?

In the context of this subproblem above, the scenario of concern would be: the AI starts out with no particular desire to hack out of the box and get internet access, but nevertheless hacks out of the box “a little bit” without any particular intention to, and then that behavior gets reinforced (because it uses its internet access to get a better score), and therefore the AI starts hacking out with increasing competence and boldness over time.

That scenario seems far-fetched to me—or at any rate, readily solvable with ordinary infosec effort.

Granted, it’s probably very hard to box an AI that is trying to hack out of the box. But that’s not what we’re talking about in this subproblem. That would be a different subproblem—see next subsection.

Aren’t we training the AI to print out proposed steps that appear better to humans than they actually are?

Yes we are. Hopefully our audits are very careful, which will minimize this effect, but it will nevertheless definitely be true to some extent.

Holden’s argument here is that “printing out proposed steps that appear better to humans than they actually are” is not the kind of dangerous behavioral inclination that will lead to out-of-control AI apocalypse. Hmm, I think I buy that.

Aren’t we training the AI to be very skilled at manipulating humans via the printouts?

I think the answer is “yes we are”, but only in one specific way: we are training the AI to manipulate human auditors into giving the printout a high score. We should expect the AI to generate printouts that make it clear to the human auditors why that step is being proposed, and then we should gradually expect those “explanations” to become progressively more elaborate and morph into subtle manipulation, and even into outright disinformation in cases where the auditors can’t check for themselves.

This still seems much less bad than manipulating the human auditors into helping the AI escape or whatever.

So in conclusion, I think my answer is yes, this does actually solve that subproblem.

5.2 What about the other subproblems, particularly inner alignment / goal misgeneralization?

A big difference between me & Holden seems to be in this section. Holden seems hopeful that if we solve the subproblem described in the preceding subsection, then we’ve made substantial progress, and maybe that’s even enough. From my perspective, inner misalignment / goal misgeneralization is still a massive problem that we need to solve.

For example, Holden writes:

I agree you can still get a problem from goal misgeneralization and instrumental reasoning, but this seems noticeably less likely (assuming process-based supervision) than getting a problem from reinforcing pursuit of unintended outcomes. (https://www.lesswrong.com/posts/iy2o4nQj9DnQD7Yhj/discussion-with-nate-soares-on-a-key-alignment-difficulty has some discussion.) I put significant credence on something like "Internals-based training doesn't pan out, but neither does the concern about goal misgeneralization and instrumental reasoning (in the context of process-based supervision, ie in the context of not reinforcing pursuit of unintended outcomes)."

That link goes to Discussion with Nate Soares on a key alignment difficulty. As it happens, I’m mostly on Holden’s (optimistic) side of that particular discussion, but in my mind, the main load-bearing element supporting that optimism is the “hypothetical training approach” which involves 99.999…% imitative (self-supervised) learning from human-produced text, and 0.00…1% reinforcement learning. If you could get transformative AI (TAI) that way, then I think I would mostly buy Holden’s argument that the resulting AI would do means-end reasoning / foresighted planning only when the means and ends are both human-typical, and launching a coup against humanity is not a human-typical thing to do.

(However, I am highly skeptical that you could get transformative AI by that kind of training approach, for reasons in Section 1.1 here.)

I don’t see “process-based supervision” as being load-bearing here. If anything, I see process-based supervision as kinda opposed to that safety story—at least, if we allow a lot of black-box brainstorming between printouts.

More specifically, if (and only if!) the AI does a substantial amount^[1] of brainstorming within each individual boxed brainstorming step (i.e., within (A) or within (C) or within (E) of Box 1 above), then I have a concern stemming from the following:

While thinking, the AI is incentivized to do good means-end reasoning, to recurse onto subgoals, to do foresighted planning, etc.
- For example, if, within step (A) of Box 1 above, the AI notices that it might be productive to pursue such-and-such line of thinking, and then does so, then presumably the AI would wind up with a better printout for step (B).
While thinking, the AI is incentivized to invent its own creative new strategies to solve problems.
- For example, if, within step (A) of Box 1 above, the AI develops an innovative new brainstorming strategy, then presumably the AI would wind up with a better printout for step (B).
- Remember: The final printout in step (B) of Box 1 above needs to be legibly good from an auditors’ perspective, but the brainstorming process within step (A) that led to that printout need not be legible or human-like.
While thinking, the AI is incentivized to think about and manipulate the auditing process (see Section 5.1 above).
- For example, if, within step (A) of Box 1 above, the AI thinks through how the auditing process would play out in detail, for different possible (B) printouts, then the AI would presumably do a better job at offering a (B) printout that scores most highly.
- This is a rather specific and benign type of “human manipulation”, but we might be concerned that such strategies / behavior / knowledge will generalize into a capacity for other types of human manipulation.
While thinking, the AI is incentivized to have good metacognitive self-awareness.
- For example, if, within step (A) of Box 1 above, the AI knows what it’s confused about, then it can think harder about those things, and thus presumably wind up with a better printout for step (B).

In short, we are trying to train the AI to “do good brainstorming” during step (A) or (C) or (E) of Box 1 above. But the things involved in the “do good brainstorming” task seem to incidentally involve the AI developing many or perhaps all of the ingredients needed for the AI to use its subsequent (B) or (D) or (F) printout as the first step of an irreversible dastardly plan to fulfill longer-term preferences (if it winds up having longer-term preferences, more on which shortly).^[2] For example, as in the bullet points above, I expect the AI to develop situational awareness, to have its own internal goals, to brainstorm how to accomplish those goals using foresight and planning, including via non-human-like out-of-the-box strategies, etc.

(The AI has no direct prior experience doing means-end reasoning / planning outside of its little boxed world, but that doesn’t seem to me like a natural barrier, given that it does have a capacity to do means-end reasoning within its little boxed world, and to creatively apply its knowledge, and to find and use new affordances, etc.^[3])

(By the way, all this discussion assumes that the AI is sufficiently competent to comprise TAI in the first place. I anticipate a lot of objections where someone suggests that X would enable safety, and then I respond that X would also simultaneously eliminate the AI’s ability to do good brainstorming and figuring-things-out. For example, if we constrain the boxed brainstorming steps ((A) or (C) or (E) of Box 1 above) to only involve a small amount of black-box processing, then I become unconcerned about treacherous turns, while simultaneously becoming skeptical that the AI can really get much useful stuff done.)

5.2.1 Goal misgeneralization

So then we can ask: does the AI have longer-term preferences—i.e., preferences about things that will happen after the next output? My answer is: probably! After all, the AI presumably has expectations about the future, since forming such expectations is instrumentally useful for predicting audit results. (I.e., the auditor’s beliefs about the future are not entirely unrelated to the actual future!) Those expectations presumably correlate to some extent with training signals. So my strong default expectation is that the AI will have at least some desires about what happens in the future after the next printout (among other desires ).

Some readers might object here: If the AI ever sacrifices immediate reward for reward-in-later-steps, then the training signals will incentivize it to stop doing that. I concede that point (if we set things up properly). But my concern here is closer to the “very hard subproblem” of section 3. Under normal circumstances, the AI’s desires about the future, whatever they are, would presumably tend to help the AI perform better in the immediately-upcoming audit, because of the correlation noted above. As long as that correlation holds, next-step training incentives are reinforcing longer-term desires, not removing them. But when we go far enough out-of-distribution, the two can come apart—treacherous turns are the classic example here. And unfortunately, a sufficiently competent, foresighted AI can go very far out of distribution successfully, even on the first try. If so, then corrective training signal will come too late.

So we still have a potential problem here from goal misgeneralization. Which brings us to:

5.2.2 Dealing with goal misgeneralization: Two schools-of-thought

Restricting to the threat model I’m interested in (actor-critic model-based RL AGI), I think the learning algorithm (to a first approximation) has a “finders keepers” characteristic, such that, if there are multiple possible value functions that are equally good at predicting the history of reward signals, the first one that the system stumbles upon tends to get “stuck”.

So, one school of thought about goal misgeneralization is that we should try to reason about what is likely to be simple and salient to the AI, including what concepts the AI will learn in what order, and exactly how well different concepts correlate with reward, and we try to avoid situations where the intended motivation has salient proxies,^[4] and so on. Alex Turner’s A shot at the diamond-alignment problem is a possible example of what that kind of alignment approach might look like.

I’m not a big fan of this school of thought. I don’t think it’s certain to fail! But I think “it’s not certain to fail” is pretty straightforwardly achievable, and I have set my sights higher than that. Instead, I want to find a technical plan for which there’s a strong reason to believe that the AI won’t want to kill me. These approaches where we’re thinking about what the AI will be thinking, and when, seem to rest on a lot of ungrounded armchair speculation, and I don’t currently see how they’re going to get into “strong reason to expect alignment” territory.

So instead I’m much more into another school of thought—some version of what Holden calls “internals-based training”. In these kinds of plans, the AI won’t hack out of the box or do other dangerous things because it wants to not do those things. And it wants to not do those things because we are pretty directly intervening on its motivation systems.

Both of the general research directions for aligning actor-critic model-based RL AGI that I currently find most promising (see here) are in this category. (In case you’re interested: One is illustrated with a toy example here. The other involves reverse-engineering human social instincts and figuring out how to do something similar in AI. It’s not obvious, but I claim that human social instincts involve a mechanism that is basically a rudimentary form of trained model interpretability.^[5])

Neither of those two research directions are made any easier or more likely to succeed by using process-based supervision, as far as I can tell right now. (But I’m open to being convinced.) So I wind up feeling like process-based supervision is not accomplishing much, on my models.

5.3 How bad is the “alignment tax” from doing this kind of thing?

(In other words, if the CEO of Team A demands that their team do this kind of “process-based supervision”, and the CEO of Team B allows their team to do whatever they want to make the most powerful AI possible, then how much more powerful (if at all) is the AI created by Team B?)

This seems to be another disagreement between me and Holden (and certainly between me and Stuhlmüllert & Byun). I think the tax is super high.

(Or at least, I think it will be super high once we get towards TAI. The tax might be zero or negative right now, because today’s AIs aren’t that great at figuring things out, learning from their mistakes, etc.)

(Note: Throughout this section, I’m assuming that process-based supervision actually works as intended—no goal misgeneralization etc.)

Before I get into the alignment tax, I’ll start with a pedagogical note:

5.3.1 Pedagogical note: If process-based supervision sounds kinda like trying to manage a non-mission-aligned human employee, then you’re misunderstanding it. It’s much weirder than that.

For this section, you get to role-play as the AI which has been trained under process-based supervision.

So imagine yourself executing a project. You don’t care a whit if the project succeeds or fails. But your boss occasionally picks a random step that you do, and tries to audit how good that step seems, without executing it.

At any given time, the only thing you care about is how good the very next step you propose would look to your boss, if they were to audit it.

This is not the same as “doing the project while making every step legible to your boss”. That’s annoying but probably doable. Instead, I’m talking about “doing the project while myopically optimizing each step for looking maximally good to my boss (if they were to audit that step), regardless of whether it advances the project”.

The “myopically optimizing” here is both important and weird, so I want to spell it out more. Again, I grant that it’s perfectly normal for a human employee to not feel invested in their project’s long-term success. That’s not what I’m talking about.

Instead, imagine that, whenever you’re writing an email, you sincerely believe that, the moment before you press Send, the whole world will definitely end forever in a puff of smoke, and then you’ll find yourself standing at the Pearly Gates, but instead of Saint Peter it’s your boss at the Gate, who will carefully inspect that email and judge whether sending that email seems like it would have been the right thing to do for the project at that point (without actually sending the email^[6]), and then your boss sends you to Heaven or Hell accordingly.

That’s your firm belief as you write the email. Then you finish the email and press Send. The world does not in fact end forever in a puff of smoke. You shrug and say to yourself, “Huh. Funny that. But next time I send an email, the world will definitely end forever in a puff of smoke…”. And so on forever.

What I’m trying to get at is: From your perspective, your proposed external actions don’t have consequences. They don’t even have immediate, trivial consequences, things like “if I send this email to Dr. X, then Dr. X will receive it, and maybe read it”. More precisely, any interest you might have in the actual real-world consequences of your actions are coming purely from the fact that your boss might correctly anticipate those consequences a priori and have an opinion about them.

That’s a really weird way to think! Actual human employees can be lazy, corrupt, shortsighted, non-mission-aligned, etc., but they are not totally indifferent to the fact that they will be meeting with their boss next week, or the fact that there’s a quarterly performance review coming up in two months, etc. And likewise, they are not totally indifferent to the fact that when they send an email, the recipient might read it, and do things in response, which could lead to consequences that they’ll have to deal with. Etc.

Thus, actual humans, no matter how shortsighted and mission-unaligned they are, are definitely not “myopically optimizing how their next proposed step would come across in an audit” in the way that process-based AIs are. Actual humans care at least a little bit about the consequences of their actions, and that caring is not entirely screened off by how those actions would appear in advance to an auditor.

OK. All this was just background, which I believe is uncontroversial. Moving on to the actual discussion of alignment tax:

5.3.2 The AI won’t take actions that require a lot of context about what’s going on in the project so far. (I think.) (Unless we’re willing to have audits whose difficulty is comparable to “re-doing the entire project up to that point”, and not just to looking locally at the one step in question)

As above, I imagine executing a project, not caring a whit if the project succeeds or fails, but my boss occasionally picks a random step that I do and tries to audit how good that step seems (without executing it), and “how my immediate next proposed step would look in an audit” is the only thing I care about at any given time.

Let’s say my boss is trying hard, and maybe she’s spending lots of time on the audit. But she hasn’t been doing the whole project up to now. She’s not here in the trenches. She doesn’t have that context. I do. All the figuring-things-out and intuitions that I’ve built up so far in the project—those things are in my head, not hers. So when I imagine this, I don’t imagine the project succeeding. Indeed, I expect it to go much worse than it would if my boss were executing the project herself (forget about superhuman!), up until we approach the limit where the “auditing” step entails my boss independently re-doing a substantial fraction of everything I’ve done in the project up to that point (in which case I have concerns about practicality).

For example, imagine the AI is designing an airplane, and prints out “Email the bolt supplier to ask what steel tempering temperature is used in their factory.” Maybe it’s a great proposal! But how is the auditor supposed to know that, without deeply understanding a whole lot about the ongoing airplane design process, plus aerospace engineering, metallurgy etc.? Again, the standard here is not merely “this proposal can be validated by the audit process” but rather “the audit process is capable in principle of finding this proposal in particular out of a collection of 1000 or 1,000,000 superficially-similar proposals, as being the only proposal in that collection that actually advances the project”. That’s much harder. Remember, as above, the AI isn’t trying to advance the project! Only the auditor cares about the project actually moving forward.

5.3.3 Relatedly, the AI will be especially handicapped in its ability to take actions based on their value-of-information. (I think.)

I think a central ingredient to getting things done is figuring things out, and that the process of figuring things out is both important and pretty externally-illegible.

For example, suppose the AI prints out the following proposal:

Step 19.57.3.2: I have numerous uncertainties about LiDAR-related laser safety standards. For example, I am not completely certain what is the official procedure for calculating the legal intensity limit of chirped laser pulses in certain scenarios that are relevant for heterodyne LiDAR. To remove all doubt, I propose to purchase the ANSI Z136.1 American National Standard For Safe Use Of Lasers, which is available for download from this site for $225.

The problem is:

The AI says it’s uncertain about the official procedure for calculating the legal intensity limit of chirped laser pulses. But is it really? Maybe it already knows the answer with ≈100% certainty, based on other things it has read, and it’s just lying!
- Maybe you’re thinking: Why would the AI lie? But that’s the wrong question! You should be asking: why would we ever expect the AI not to lie?! After all, the AI is trained to spit out whatever text proposals would score best in a supervisor audit. There is no a priori guarantee that the best-scoring such proposal is honest. See Section 5.1 above.
If we had sufficient interpretability, we could check directly whether the AI was in fact uncertain about this question or not, and thus we could incentivize the AI to produce honest self-reports. But we’re assuming we don’t have that kind of interpretability.
During the audit, we can see whether we (and/or our audit-helping AI assistants) would be uncertain about this question. But that’s an imperfect proxy for whether the AI is uncertain! It’s obviously possible for the AI to figure out things that the auditor hasn’t figured out, and vice-versa. Specifically:
- Why might the AI figure things out that the auditor hasn’t? One reason is that the AI has more “context” from its entire history of doing the project, as in Section 5.3.2 above.
- Why might the auditor figure things out that the AI hasn’t? One reason is that the auditor has (presumably) extra processing time and power for checking that step, compared to the AI. Another reason is that the auditing process comes with access to all the human supervisor’s idiosyncratic intuitions and life experience, which the AI does not have.
- There’s also probably just some randomness—“flashes of insight” and so on—such that you can’t reliably deduce what a future AI knows purely based on its training data / environment.
If we can’t solve this problem, then there’s no way to get an AI that proposes information-gathering steps that help it resolve its own confusions and uncertainties, because they help it resolve its own confusions and uncertainties. The best we can do is get an AI that proposes information-gathering steps that seem, from the auditor’s perspective, like the kind of thing that might help the AI resolve its confusions and uncertainties. I think that’s a very important difference.

5.3.4 The AI won’t take actions for human-illegible reasons (e.g. AlphaGo move 37) (I think)

AlphaGo Move 37 is a classic example of how AIs can take effective actions for human-illegible reasons. Indeed, expert Go players commenting on AlphaGo’s playing style said: “If I made those same moves… our teachers would slap our wrists.”

When we do Process-Based Supervision, we are ruling out the possibility that the AI will take good real-world actions for human-illegible reasons—or in other words, that the AI will understand things about how the world works that the supervisor doesn’t. This is mitigated by having very thorough and time-consuming and AI-assisted audits, but I don’t think it’s eliminated.

Maybe that’s what we want, but it sure is a tax on capabilities, seems to me.

5.4 Summary of this section

In Section 5.1, I agreed that the proposal solves the narrow problem that it purports to solve.

In Section 5.2, I claimed that, on my models, we would still need to solve the goal misgeneralization / inner alignment problem, presumably by using some form of interpretability to tinker with the AI’s internal motivations. And for both of the roadmaps that I like towards solving that problem, process-based supervision does not seem to make the path easier or more likely to succeed.

Meanwhile, as in Section 5.3, it seems to me that this kind of process-based supervision carries a very heavy alignment tax—even to the extent of my being skeptical that this kind of AI can get anything significant done at all.

So, all in all, I seem not to be interested in (Holden’s vision of) process-based supervision right now. I think I’m happy to (hopefully) understand it though. I want to have it at the back of my mind as a possible tool in the toolkit, in case I come across something that complements it.

I reiterate the warning at the top of Section 5 that if you have different assumptions from me about how future AGI will work, you’re entitled to come to a different conclusion.

I’m also still pretty concerned that all my complaints in this section have good counter-arguments, and that I wasn’t trying hard enough to find them.

(Thanks Holden Karnofsky, Joe Carlsmith, Seth Herd, and Justis Mills for critical comments on drafts.)

^{^}
I don’t know what constitutes a “substantial amount” of brainstorming within a single boxed brainstorming step, e.g. within step (A) of Box 1 above. I guess I’ll just say: if (A) of Box 1 involves an AI brainstorming for an hour with a giant scratchpad and free access to a cached copy of the internet, that’s more “substantial” than if (A) involves a single forward-pass through GPT-4, which in turn is more “substantial” than if (A) involves a single forward-pass through GPT-2. But I don’t know how to point to a specific place on this spectrum, below which I stop being concerned about the AI coming up with and executing dangerous out-of-the-box plans within a single boxed brainstorming step.
^{^}
The AI could also hack its way out of its boxed environment, if the box is not perfectly secure.
^{^}
Here’s a possible analogy—not sure if it’s helpful. In the (IMO unlikely) scenario where all of us are in a simulation, then we humans would be in a little boxed world, and we have never been directly rewarded for doing anything outside that little boxed world. And yet in real life, people do sometimes strategize about how to manipulate “basement reality”. Granted, most people don’t! It’s a pretty weird thing to do! But it’s possible. (And I suspect that it would be much more common if we all knew for a fact that “basement reality” exists, and if we had evidence-based theories about how one might manipulate it.)
^{^}
As in the text, when the intended reward function has undesired, salient-to-the-AI, proxies—proxies that correlate with the intended reward function in-distribution, but come apart out-of-distribution—then that’s generally bad. Holden comments that maybe this is less true under process-based supervision than outcome-based supervision, which would be a point in favor of process-based supervision.
I dunno, sure, that seems possible. But if so, I would describe the claim as “however bad process-based supervision is, outcome-based supervision is even worse”.
(Recall my argument from 5.2.1 above: Under process-based supervision, I think “the AI’s expectations about the long-term consequences of its actions” will be both salient-to-the-AI, and relevant to predicting audit results, based on the logic that those expectations correlate with the supervisor’s expectations about consequences, and those in turn are one of the contributors to audit results. So I expect goal-misgeneralization leading to AIs that have at least some longer-term consequentialist desires (among other desires) (if the AI is sufficiently competent). And those desires do indeed come apart from the intended reward out-of-distribution, e.g. they could contribute to the AI wanting to escape onto the internet.)
^{^}
See discussion of “ersatz interpretability” here. It’s mathematically closely related to the thing you can do in actor-critic RL by having a mutli-dimensional reward function that grounds a multi-dimensional value function, and then treating the individual entries of the multi-dimensional value function as hints about what’s going on. The details of how to get from that kind of mechanism to human social instincts are a thing I’m still working on; see my sketchy discussion here.
^{^}
How could your boss send the email anyway, when the world has just ended in a puff of smoke??

[-]Charlie Steiner3y42

I like it, this seems to have some hooks that I'll find fruitful to think about. Like how when the AI has different "sources of smarts" (RL vs. unusupervised learning plus finetuning vs. unsupervised learning used as a component in an agent with no RL), superficially similar alignment plans like "just supervise its proposed plans" might unfold via very different mechanisms.

AI ALIGNMENT FORUM
AF