# 22

Another stab at explaining Don't design agents which exploit adversarial inputs. This is not the follow-up post mentioned therein. That post will come next.

### More precise title: "Don't try directing a superintelligence to maximize your valuations of their plans using a consequentialist procedure."

After asking several readers for their understandings, I think that I didn't successfully communicate my points to many readers. I'm now trying again, because I think these points are deeply important. In particular, I think that my arguments rule out many target AI motivational structures, including approval-directed agents (over a rich action space), approval-based amplification (if the trained agent is supposed to be terminally motivated by the amplified overseer's ratings), and some kinds of indirect normativity.

# Background material

One motif in some AI alignment proposals is:

• An actor which proposes plans, and
• A grader which evaluates them.

For simplicity, imagine we want the AI to find a plan where it makes an enormous number of diamonds. We train an actor to propose plans which the grading procedure predicts lead to lots of diamonds.

In this setting, here's one way of slicing up the problem:

Outer alignment: Find a sufficiently good grader.

Inner alignment: Train the actor to propose plans which the grader rates as highly possible (ideally argmaxing on grader output, but possibly just intent alignment with high grader output).

This "grader optimization" paradigm ordains that the AI find plans which make the grader output good evaluations. An inner-aligned actor is singlemindedly motivated to find plans which are graded maximally well by the grader. Therefore, for any goal by which the grader may grade, an inner-aligned actor is positively searching for adversarial inputs which fool the grader into spitting out a high number!

In the diamond case, if the actor is inner-aligned to the grading procedure, then the actor isn't actually aligned towards diamond-production. The actor is aligned towards diamond-production as quoted via the grader's evaluations. In the end, the actor is aligned to the evaluations.

## Clarifications

1. Grader-optimization is about the intended agent motivational structure. It's about a trained agent which is trying to find plans which grade highly according to some criterion.
1. Grader-optimization is not about grading agents when you give them reward during training. EG "We watch the agent bump around and grade it on whether it touches a diamond; when it does, we give it +1 reward." This process involves the agent's cognition getting reshaped by policy gradients, e.g. upon receipt of +1 reward.
2. In policy gradient methods, reward chisels cognitive circuits into the agent. Therefore, the agent is being optimized by the reward signals, but the agent is not necessarily optimizing for the reward signals or for any grader function which computes those signals.
2. Grader-optimization is not about the actor physically tampering with e.g. the plan-diamondness calculator. The grading rule can be, "How highly would Albert Einstein rate this plan if he thought about it for a while?". Albert Einstein doesn't have to be alive in reality for that.

These will be elaborated later in the essay.

I'm going to try saying things, hoping to make something land. While I'll mostly discuss grader-optimization, I'll sometimes discuss related issues with argmaxing over all plans.

An agent which desperately and monomaniacally wants to optimize the mathematical (plan/state/trajectory)  (evaluation) "grader" function is not aligned to the goals we had in mind when specifying/training the grader (e.g. "make diamonds"), the agent is aligned to the evaluations of the grader (e.g. "a smart person's best guess as to how many diamonds a plan leads to").

Don't align an agent to evaluations which are only nominally about diamonds, and then expect the agent to care about diamonds! You wouldn't align an agent to care about cows and then be surprised that it didn't care about diamonds. Why be surprised here?

Grader-optimization fails because it is not the kind of thing that has any right to work. If you want an actor to optimize  but align it with evaluations of , you shouldn't be surprised if you can't get  out of that. In that situation, the actor doesn't give a damn about diamonds,[1] it cares about evaluations.

Rounding grader-optimization off to "Goodhart" might be descriptively accurate, but it also seems to miss useful detail and structure by applying labels too quickly. More concretely, "grade plans based on expected diamonds" and "diamonds" are not even close to each other. The former is not a close proxy for the latter, it's not that you're doing something which almost works but not quite, it's just not a sensible thing to even try to align an AI on.

We can also turn to thought experiments:

1. Consider two people who are fanatical about diamonds. One prefers pink diamonds, and one prefers white diamonds. AFAICT, their superintelligent versions both make diamonds.
2. Consider an AI aligned to evaluations of diamonds, versus the person who prefers white diamonds. AFAICT, the AI's superintelligent version will not make diamonds, while the person will.

Why? There's "goal divergence from 'true diamond-motivation'" in both cases, no? "The proxies are closer in case 1" is a very lossy answer. Better to ask "why do I believe what I believe? What, step-by-step, happens in case 1, compared to case 2? What mechanisms secretly generate my anticipations for these situations?"

We should not be constructing a computation that is trying to hurt us. At the point that computation is running, we've already done something foolish--willfully shot ourselves in the foot. Even if the AI doesn't find any way to do the bad thing, we are, at the very least, wasting computing power.

[...] If you're building a toaster, you don't build one element that heats the toast and then add a tiny refrigerator that cools down the toast.

In the intended motivational structure, the actor tries to trick the grader, and the grader tries to avoid being tricked. I think we can realize massive alignment benefits by not designing motivational architectures which require extreme robustness properties and whose parts work at internal cross-purposes. As I wrote to Wei Dai:

Argmax violates the non-adversarial principle and wastes computation. Argmax requires you to spend effort hardening your own utility function against the effort you're also expending searching across all possible inputs to your utility function (including the adversarial inputs!). For example, if I argmaxed over my own plan-evaluations, I'd have to consider the most terrifying-to-me basilisks possible, and rate none of them unusually highly. I'd have to spend effort hardening my own ability to evaluate plans, in order to safely consider those possibilities.

It would be far wiser to not consider all possible plans, and instead close off large parts of the search space. You can consider what plans to think about next, and how long to think, and so on. And then you aren't argmaxing. You're using resources effectively.

For example, some infohazardous thoughts exist (like hyper-optimized-against-you basilisks) which are dangerous to think about (although most thoughts are probably safe). But an agent which plans its next increment of planning using a reflective self-model is IMO not going to be like "hey it would be predicted-great if I spent the next increment of time thinking about an entity which is trying to manipulate me." So e.g. a reflective agent trying to actually win with the available resources, wouldn't do something dumb like "run argmax" or "find the plan which some part of me evaluates most highly.

Strong violation of the non-adversarial principle suggests that grader-optimization and argmax-over-all-plans are deeply and fundamentally unwise.

This isn't to say that argmaxing over all plans can't be safe, even in theory. There exist robust Platonic grader functions which assign highest expected utility to a non-bogus plan which we actually want. There might exist utility functions which are safe for AIXI to argmax.[2]

We are not going to find those globally-safe Platonic functions. We should not try to find them. It doesn't make sense to align an agent that way. Committing to this design pattern means committing to evaluate every possible plan the AI might come up with. In my opinion, that's a crazy commitment.

It's like saying, "What if I made a superintelligent sociopath who only cares about making toasters, and then arranged the world so that the only possible way they can make toasters is by making diamonds?". Yes, possibly there do exist ways to arrange the world so as to satisfy this strange plan. But it's just deeply unwise to try to do! Don't make them care about making toasters, or about evaluations of how many diamonds they're making.

If we want an agent to produce diamonds, then I propose we make it care about producing diamonds. How?[3] I have suggested one simple baseline approach which I do not presently consider to be fundamentally blocked.

But I suspect that, between me and other readers, what differs is more our models of intelligence. Perhaps some people have reactions like:

Sure, we know alignment is hard, it's hard to motivate agents without messing up their motivations. Old news. And yet you seem to think that that's an "artifact" of grader-optimization? What else could a smart agent be doing, if not optimizing some expected-utility function over all possible plans?

On my end, I have partial but detailed working models of how intelligence works and how values work, such that I can imagine cognition which is planning-based, agentic, and also not based on grader-optimization or global argmax over all plans. You'll read a detailed story in the next subsection.

## And people aren't grader-optimizers, either

Imagine someone who considers a few plans, grades them (e.g. "how good does my gut say this plan is?"), and chooses the best. They are not a grader-optimizer. They are not trying to navigate to the state where they propose and execute a plan which gets maximally highly rated by some evaluative submodule. They use a grading procedure to locally rate and execute plans, and may even locally think "what would make me feel better about this plan?", but the point of their optimization isn't "find the plan which makes me feel as good as globally possible."

Let's dive into concrete detail. Here's a story of how value-child might think:

An alternate mechanistic vision of how agents can be motivated to directly care about e.g. diamonds or working hard. In Don't design agents which exploit adversarial inputs, I wrote about two possible mind-designs:

Imagine a mother whose child has been goofing off at school and getting in trouble. The mom just wants her kid to take education seriously and have a good life. Suppose she had two (unrealistic but illustrative) choices.

1. Evaluation-child: The mother makes her kid care extremely strongly about doing things which the mom would evaluate as "working hard" and "behaving well."
2. Value-child: The mother makes her kid care about working hard and behaving well.

I explained how evaluation-child is positively incentivized to dupe his model of his mom and thereby exploit adversarial inputs to her cognition. This shows that aligning an agent to evaluations of good behavior is not even close to aligning an agent to good behavior

However, some commenters seemed maybe skeptical that value-child can exist, or uncertain how concretely that kind of mind works. I worry/suspect that many people have read shard theory posts without internalizing new ideas about how cognition can work, about how real-world caring can work on a mechanistic level. Where effective real-world cognition doesn't have to (implicitly) be about optimizing an expected utility function over all possible plans. This last sentence might have even seemed bizarre to you.

Here, then, is an extremely detailed speculative story for value-child's first day at school. Well, his first day spent with his newly-implanted "work hard" and "behave well" value shards.

Value-child gets dropped off at school. He recognizes his friends (via high-level cortical activations previously formed through self-supervised learning) and waves at them (friend-shard was left intact). They rush over to greet him. They start talking about Fortnite. Value-child cringes slightly as he predicts he will be more distracted later at school and, increasingly, put in a mental context where his game-shard takes over decision-making, which is reflectively-predicted to lead to him daydreaming during class. This is a negative update on the primary shard-relevant features for the day.

His general-purpose planning machinery generates an example hardworking-shard-desired terminal state: Paying rapt attention during Mr. Buck’s math class (his first class today). He currently predicts that while he is in Mr. Buck’s class later, he will still be somewhat distracted by residual game-related cognition causing him to loop into reward-predicted self-reinforcing thoughts.

He notices a surprisingly low predicted level for a variable (amount of game-related cognition predicted for future situation: Mr. Buck’s class) which is important to a currently activated shard (working hard). This triggers a previously learned query to his WM: “why are you making this prediction for this quantity?”. The WM responds with a few sources of variation, including how value-child is currently near his friends who are talking about Fortnite. In more detail, the WM models the following (most of it not directly translatable to English):

His friends’ utterances will continue to be about Fortnite. Their words will be processed and then light up Fortnite-related abstractions, which causes both prediction of more Fortnite-related observations and also increasingly strong activation of the game-shard. Due to previous reward events, his game-shard is shaped so as to bid up game-related thoughts, which are themselves rewarding events, which causes a positive feedback loop where he slightly daydreams about video games while his friends talk.

When class is about to start, his “get to class”-related cognition will be activated by his knowledge of the time and his WM indicating “I’m at school.” His mental context will slightly change, he will enter the classroom and sit down, and he will take out his homework. He will then pay token attention due to previous negative social-reward events around being caught off guard

[Exception thrown! The world model was concurrently coarsely predicting what it thinks will happen given his current real values (which include working hard). The coarse prediction clashes with the above cached prediction that he will only pay token attention in math class!

The WM hiccups on this point, pausing to more granularly recompute its predictions. It squashes the cached prediction that he doesn’t strongly care about paying attention in class. Since his mom installed a hard-working-shard and an excel-at-school shard, he will actively try to pay attention. This prediction replaces the cached prior prediction.]

However, value-child will still have game-related cognition activated, and will daydream. This decreases value-relevant quantities, like “how hard he will be working” and “how much he will excel” and “how much he will learn.”

This last part is antithetical to the new shards, so they bid down “Hang around friends before heading into school.” Having located a predicted-to-be-controllable source of negative influence on value-relevant outcomes, the shards bid for planning to begin. The implied causal graph is:

Continuing to hear friends talk about Fortnite
|
v
Distracted during class

So the automatic causality-noticing algorithms bid to knock out the primary modeled cause of the negative value-relevant influence. The current planning subgoal is set to: make causal antecedent false and reduce level of predicted distraction. Candidate concretization set to: get away from friends.

(The child at this point notices they want to get away from this discussion, that they are in some sense uncomfortable. They feel themselves looking for an excuse to leave the conversation. They don't experience the flurry of thoughts and computations described above. Subconscious computation is subconscious. Even conscious thoughts won't introspectively reveal their algorithmic underpinnings.)

“Hey, Steven, did you get problem #3 for math? I want to talk about it.” Value-child starts walking away.

Crucially, in this story, value-child cares about working hard in that his lines of cognition stream together to make sure he actually works hard in the future. He isn't trying to optimize his later evaluation of having worked hard. He isn't ultimately and primarily trying to come up with a plan which he will later evaluate as being a maximally hard-work-involving plan.

Value-child comes up with a hard-work plan as an effect of his cognition, not as a motivating cause—not because he only wants to come up with plans he himself will rate highly. He values working hard.

As a corollary, grader-optimization is not synonymous with planning. Grader-optimization is when high plan-evaluations are the motivating cause of planning, where "I found a plan which I think leads to diamond" is the terminal goal, and not just a side effect of cognition (as it is for values-child).

# Intended takeaways

I feel confident [~95%] that we will not train a grader which is "secured" against actor-level intelligences. Even if the grader is reasonably smarter than the actor [~90%].

That said, I think this pattern is extremely unwise, and alternative patterns AFAICT cleanly avoid incentivizing the agent to exploit adversarial inputs to the grader. Thus, I bid that we:

1. Give up on all schemes which involve motivating the agent to get high outputs from a grader function, including:
1. Approval-based amplification (if the trained agent is supposed to be terminally motivated by the amplified overseer's ratings),
2. Approval-directed agents,[4]
1. Although approval-directed agents are only searching over actions and not plans; action space is exponentially smaller than plan space. However, if the action space is rich and expressive enough to include e.g. 3-paragraph English descriptions, I think that there will be seriously adversarial actions which will be found and exploited by smart approval-directed agents.
2. Given a very small action space (e.g. ), the adversarial input issue should be pretty tame (which is strictly separate from other issues with this approach).
3. Indirect normativity in any form which points the AI's motivations so that it optimizes an idealized grader's evaluations.
2. This doesn't include (somehow) getting an AI which correctly computes what program would be recommended by AGI designers in an altruistic and superintelligent branch of humanity, and then the AI executes that program and shuts itself off without doing anything else.[5]
4. "Does the superintelligent ELK direct reporter say the diamond is in the room?"[6]
2. Don't try to make the actor/grader scheme more complicated in hopes of resolving the issue via that frame, via some clever-seeming variant of actor/grader. Don't add more graders, or try to ensure the grader is just really smart, or...
3. Give up on any scheme which requires you to adequately evaluate every single plan the AI is able to come up with. That's an optimizer's curse-maximizing design pattern. Find a better way to do things.
4. Stop thinking about argmax over all plans according to some criterion. That's not a limiting model of realistic embedded intelligence, and it also ensures that that the criterion has to penalize all of the worst adversarial inputs.

# Conclusion

I strongly hope that this essay clarifies my thoughts around grader-optimization and its attendant unwisdom. The design patterns of "care about evaluations of plans" and "optimize a utility function over all possible futures" seem unnatural and lead to enormous, apparently avoidable difficulties. I think there are enormous benefits to be reaped by considering a wider, more realistic range of possible minds.

While this essay detailed how value-child might think, I haven't yet focused on why I think value-child does better, or what the general principles may be. I'll speculate on that in the next essay.

Thanks to Charles Foster, Thomas Kwa, Garrett Baker, and tailcalled for thoughts.

## The point isn't "any argmax=bad"

Someone messaged me:

I was more commenting out of a feeling that your argument proved too much. As a stupid example, a grader can use the scoring rubric "score=1 if the plan is to sit on the chair and chew bubble gum in this extremely specific way, score=0 for every other possible plan in the universe", and then if you argmax, you get that specific thing.

And you can say "That’s not a central example", but I wasn't seeing what assumption you made that would exclude silly edge-cases like that.

I replied:

This is fair and I should have clarified. In fact, Evan Hubinger pointed out something like this a few months back but I... never got around to adding it to this article?

I agree that you can program in one or more desired action sequences into the utility function

My current guess at the rule is: We don't know how to design an argmax agent, operating in reality with a plan space over plans in reality, such that the agent chooses a plan which a) we ourselves could not have specified and b) does what we wanted. EG picking 5 flowers, or making 10 diamonds.

If you're just whitelisting a few desired plans, then of course optimizer's curse can't hurt you. The indicator function has hardcoded and sparsely defined support, there is nothing to dupe, no nontrivial grading rule to hack via adversarial inputs. But if you're trying to verify good outcomes which you couldn't have brought about yourself, I claim that that protection will evaporate and you will get instantly vaporized by the optimizer's curse at max intensity

Does that make more sense?

Like, consider the proposal "you grade whether the AI picked 5 flowers", and the AI optimizes for that evaluation. it's not that you "don't know what it means" to pick 5 flowers. It's not that you don't contain enough of the True Name of Flowers. It's that, in these design patterns, you aren't aligning the AI to flowers, you're aligning it to your evaluations, and your evaluations can be hacked to hell and back by plans which have absolutely nothing to do with flowers

I separately privately commented to tailcalled:

my point wasn't meant to be "argmax always bad", it's meant to be "argmax over all plans instantly ensures you have to grade the worst possible adversarial inputs." And so for any given cognitive setup, we can ask "what kinds, if any, of adversarial examples might this run into, and with what probability, and in what situations?"

EG if value-child is being fed observations by a hard-work-minimizer, he's in an adversarial regime and i do expect his lines of thought to hit upon adversarial inputs relative to his decision-making procedures. Such that he gets fooled.

But values-child is not, by his own purposes, searching for these adversarial inputs.

## Value-child is still vulnerable to adversarial inputs

In private communication (reproduced with permission), tailcalled wrote:

imagine value-child reads some pop-neuroscience, and gets a model of how distractions work in the brain

his WM might then end up with a "you haven't received neurosurgery to make you more hardworking" as a cause of getting distracted in class

and then he might request one of his friends to do neurosurgery on him, and then he would die because his friend can't do that safely

If I'm not misunderstanding value-child, then this is something that value-child could decide to do? And if I'm not misunderstanding the problem you are pointing at with argmax, then this seems like an instance of the problem? I.e. value-child's world-model overestimates the degree to which he can be made more-hardworking and avoid dying by having his friend poke around with sharp objects at his brain. So in using the world-model to search for a plan, he decides to ask his friend to poke around with sharp objects in his brain

I replied:

Yeah, I agree that he could be mistaken and take a dumb course of action. This is indeed an upwards evaluation error, so to speak. It's not that I think eg shard-agents can freely avoid serious upwards errors, it's that they aren't seeking them out on purpose. As I wrote to Daniel K in a recent comment:

One of the main threads is Don't design agents which exploit adversarial inputs. The point isn't that people can't or don't fall victim to plans which, by virtue of spurious appeal to a person's value shards, cause the person to unwisely pursue the plan. The point here is that (I claim) intelligent people convergently want to avoid this happening to them.

A diamond-shard will not try to find adversarial inputs to itself. That was my original point, and I think it stands.

Furthermore, I think that, in systems with multiple optimizers (eg shards), some optimizers can feed the other optimizers adversarial inputs. (Adversarial inputs are most common in the presence of an adversary, after all!)

A very rough guess at what this looks like: A luxury-good-shard proposes a golden-laptop buying plan, while emphasizing how this purchase stimulates the economy and so helps people. This plan was optimized to positively activate e.g. the altruism-shard, so as to increase the plan's execution probability. In humans, I think this is more commonly known as motivated reasoning.

So, even in value-child, adversarial inputs can still crop up, but via a different mechanism which should disappear once the agent gets smart enough to e.g. do an internal values handshake. As I said to Wei Dai:

I agree that humans sometimes fall prey to adversarial inputs...

However, this does not seem important for my (intended) original point. Namely, if you're trying to align e.g. a brute-force-search plan maximizer or a grader-optimizer, you will fail due to high-strength optimizer's curse forcing you to evaluate extremely scary adversarial inputs. But also this is sideways of real-world alignment, where realistic motivations may not be best specified in the form of "utility function over observation/universe histories."

# Appendix B: Prior work

Abram Demski writes about Everitt et al.'s Self-Modification of Policy and Utility Function in Rational Agents:

As a first example, consider the wireheading problem for AIXI-like agents in the case of a fixed utility function which we know how to estimate from sense data. As discussed in Daniel Dewey's Learning What to Value and other places, if you try to implement this by putting the utility calculation in a box which rewards an AIXI-like RL agent, the agent can eventually learn to modify or remove the box, and happily does so if it can get more reward by doing so. This is because the RL agent predicts, and attempts to maximize, reward received. If it understands that it can modify the reward-giving box to get more reward, it will.

We can fix this problem by integrating the same reward box with the agent in a better way. Rather than having the RL agent learn what the output of the box will be and plan to maximize the output of the box, we use the box directly to evaluate possible futures, and have the agent plan to maximize that evaluation. Now, if the agent considers modifying the box, it evaluates that future with the current box. The box as currently configured sees no advantage to such tampering. This is called an observation-utility maximizer (to contrast it with reinforcement learning). Daniel Dewey goes on to show that we can incorporate uncertainty about the utility function into observation-utility maximizers, recovering the kind of "learning what is being rewarded" that RL agents were supposed to provide[...]

Stable Pointers to Value: An Agent Embedded in Its Own Utility Function

The point of this post isn't just that e.g. value-child evaluates the future with his own values, as opposed to putting the utility calculation in a box. I'm not describing a failure of tampering with the grader. I'm describing a failure of optimizing the output of a box/grader, even if the box is directly evaluating possible futures. After all, evaluation-child uses the box to directly evaluate possible futures! Evaluation-child wants to maximize the evaluation of his model of his mother!

As described above, value-child is steered by his values. He isn't optimizing for the output of some module in his brain.

Grader optimization is about how the agent thinks, it's about the way in which they are motivated

### Scenario 1

Bill looks around the workshop. The windows are shattered. The diamonds—where are they..?!

Should he allocate more time to meta-planning—what thoughts should he think next? No. Time is very limited, and spending more time thinking now would lead to fewer expected-diamonds. He decides to simply wield the cognitive habits which his past mental training drilled to activate in this kind of mental context.

Police? Promising, but spend a few more seconds generating ideas to avoid automatic opportunity cost from prematurely committing to the first idea. [After all, doing otherwise historically led to fewer diamonds, which produced less cognition-update-quantity (i.e. "reward") than expected, and so his credit assignment chipped away at the impulse to premature action in this kind of situation.]

Generate alternate explanations for where the diamonds went? No, Bill's self-model expects this to slightly decrease probability of inferring in time where the diamonds went, and so Bill feels like avoiding that next thought.

...

No! Bill's cognition is shaped towards acquiring diamonds, his cognition reliably pulls him into futures where he has more diamonds. This is not grader-optimization. This is Bill caring about diamonds, not about his own evaluations of whether a plan will acquire diamonds.

### Scenario 2

Bill flops down on his bed. Finally, he has private time to himself. All he wants, all he's ever wanted, is to think that he's finally made it—that he can finally believe himself to have acquired real diamonds. He doesn't care how he does it. He just wants to believe, and that's it.

Bill has always been different, somehow. When he was a kid, Bill would imagine plans like "I go to school and also have tons of diamonds", and that would initially trick him into thinking that he'd found a plan which led to tons of diamonds.

But as he got older and smarter, he thought maybe he could do better. He started learning about psychology and neuroscience. He started guessing how his brain worked, how to better delude himself (the ultimate human endeavor).

...

Yes! Bill's optimizing for either his future physical evaluation of plan quality, or some Platonic formalization of "Did I come up with a plan I think is promising?". Which? The story is ambiguous. But the mark of grader-optimization is quite plain, as given by a plan-generator stretching its wits to maximize the output of a grader.

1. ^

The actor may give an instrumental damn about diamonds, because diamond-producing plans sometimes produce high evaluations. But in actor/grader motivational setups, an inner-aligned actor only gives a terminal damn about the evaluations.

2. ^

Although AIXI's epistemic prior is malign and possibly unsafe...

3. ^

But, you don't have to have another approach in mind in order to abandon grader-optimization. Here are some things I would ask myself, were I confused about how non-grader-optimizing agents might be motivated:

- "Hey, I realize some strangeness about this thing (grader-optimization) which I was trying to do. I wonder whether there are other perspectives or frame-shifts which would make this problem go away?"

- "I notice that I don't expect a paperclip-AI to resort to grader-optimization in order to implement its own unaligned values. What do I anticipate would happen, internally, to an AI as I trained it via some RL curriculum? If it cared about paperclips, how would that caring be implemented, mechanistically?"

- "Hm, this way of caring about things seems weird. In what ways is grader-optimization similar and dissimilar to the suspected ways in which human beings care about things?"

4. ^

Contrast with a quote from the original article:

Similarly, if [the actor] is smarter than [the grader] expects, the only problem is that [the actor] won’t be able to use all of his intelligence to devise excellent plans. This is a serious problem, but it can be fixed by trial and error—rather than leading to surprising failure modes.

5. ^

Not that I think this has a snowflake's chance in hell of working in time. But it seemed important to show that not all indirect normativity is grader-optimization.

6. ^

Earlier this year, I analyzed how brute-force plan search might exploit this scheme for using an ELK direct translator.

# 22

New Comment

We're building intelligent AI systems that help us do stuff. Regardless of how the AI's internal cognition works, it seems clear that the plans / actions it enacts have to be extremely strongly selected. With alignment, we're trying to ensure that they are strongly selected to produce good outcomes, rather than being strongly selected for something else. So for any alignment proposal I want to see some reason that argues for "good outcomes" rather than "something else".

In nearly all of the proposals I know of that seem like they have a chance of helping, at a high level the reason is "human(s) are a source of information about what is good, and this information influences what the AI's plans are selected for". (There are some cases based on moral realism.)

This is also the case with value-child: in that case, the mother is a source of info on what is good, she uses this to instill values in the child, those values then influence which plans value-child ends up enacting.

All such stories have a risk: what if the process of using [info about what is good] to influence [that which plans are selected for] goes wrong, and instead plans are strongly selected for some slightly-different thing? Then because optimization amplifies and value is fragile, the plans will produce bad outcomes.

I view this post as instantiating this argument for one particular class of proposals: cases in which we build an AI system that explicitly searches over a large space of plans, predicts their consequences, rates the consequences according to a prediction of what is "good", and executes the highest-scoring plan. In such cases, you can more precisely restate "plans are strongly selected for some slightly-different thing" to "the agent executes plans that cause upwards-errors in the prediction of what is good".

It's an important argument! If you want to have an accurate picture of how likely such plans are too work, you really need to consider this point!

The part where I disagree is where the post goes on to say "and so we shouldn't do this". My response: what is the alternative, and why does it avoid or lessen the more abstract risk above?

I'd assume that the idea is that you produce AI systems that are more like "value-child". Certainly I agree that if you successfully instill good values into your AI system, you have defused the risk argument above. But how did you do that? Why didn't we instead get "almost-value-child", who (say) values doing challenging things that require hard work, and so enrolls in harder and harder courses and gets worse and worse grades?

So far, this is a bit unfair to the post(s). It does have some additional arguments, which I'm going to rewrite in totally different language which I might be getting horribly wrong:

An AI system with a "direct (object-level) goal" is better than one with "indirect goals". Specifically, you could imagine two things: (a) plans are selected for a direct goal (e.g. "make diamonds") encoded inside the AI system, vs. (b) plans are selected for being evaluated as good by something encoded outside the AI system (e.g. "Alice's approval"). I think the idea is that indirect goals clearly have issues (because the AI system is incentivized to trick the evaluator), while the direct goal has some shot at working, so we should aim for the direct goal.

I don't buy this as stated; just as "you have a literally perfect overseer" seems theoretically possible but unrealistic, so too does "you instill the direct goal literally exactly correctly". Presumably one of these works better in practice than the other, but it's not obvious to me which one it is.

Separately, I don't see this as all that relevant to what work we do in practice: even if we thought that we should be creating an AI system with a direct goal, I'd still be interested in iterated amplification, debate, interpretability, etc, because all of those seem particularly useful for instilling direct goals (given the deep learning paradigm). In particular even with a shard lens I'd be thinking about "how do I notice if my agent grew a shard that was subtly different from what I wanted" and I'd think of amplifying oversight as an obvious approach to tackle this problem. Personally I think it's pretty likely that most of the AI systems we build and align in the near-to-medium term will have direct goals, even if we use techniques like iterated amplification and debate to build them.

Plan generation is safer. One theme is that with realistic agent cognition you only generate, say, 2-3 plans, and choose amongst those, which is very different from searching over all possible plans. I don't think this inherently buys you any safety; this just means that you now have to consider how those 2-3 plans were generated (since they are presumably not random plans). Then you could make other arguments for safety (idk if the post endorses any of these):

1. Plans are selected based on historical experience. Instead of considering novel plans where you are relying more on your predictions of how the plans will play out, the AI could instead only consider plans that are very similar to plans that have been tried previously (by humans or AIs), where we have seen how such plans have played out and so have a better idea of whether they are good or not. I think that if we somehow accomplished this it would meaningfully improve safety in the medium term, but eventually we will want to have very novel plans as well and then we'd be back to our original problem.
2. Plans are selected from amongst a safe subset of plans. This could in theory work, but my next question would be "what is this safe subset, and why do you expect plans to be selected from it?" That's not to say it's impossible, just that I don't see the argument for it.
3. Plans are selected based on values. In other words we've instilled values into the AI system, the plans are selected for those values. I'd critique this the same way as above, i.e. it's really unclear how we successfully instilled values into the AI system and we could have instilled subtly wrong values instead.
4. Plans aren't selected strongly. You could say that the 2-3 plans aren't strongly selected for anything, so they aren't likely to run into these issues. I think this is assuming that your AI system isn't very capable; this sounds like the route of "don't build powerful AI" (which is a plausible route).

In summary:

1. Intelligence => strong selection pressure => bad outcomes if the selection pressure is off target.
2. In the case of agents that are motivated to optimize evaluations of plans, this argument turns into "what if the agent tricks the evaluator".
3. In the case of agents that pursue values / shards instilled by some other process, this argument turns into "what if the values / shards are different from what we wanted".
4. To argue for one of these over the other, you need to compare these two arguments. However, this post is stating point 2 while ignoring point 3.
1. Intelligence => strong selection pressure => bad outcomes if the selection pressure is off target.
2. In the case of agents that are motivated to optimize evaluations of plans, this argument turns into "what if the agent tricks the evaluator".
3. In the case of agents that pursue values / shards instilled by some other process, this argument turns into "what if the values / shards are different from what we wanted".
4. To argue for one of these over the other, you need to compare these two arguments. However, this post is stating point 2 while ignoring point 3.

One thing that is not clear to me from your comment is what you make of Alex's argument (as I see it) to the extent that "evaluation goals" are further away from "direct goals" than "direct goals" are between themselves. If I run with this, it seems like an answer to your point 4 would be:

• with directly instilled goals, there will be some risk of discrepancy that can explode due to selection pressure;
• with evaluation based goals, there is the same discrepancy than between directly instilled goals (because it's hard to get your goal exactly right) plus an additional discrepancy between valuing "the evaluation of X" and valuing "X".

I'm curious what you think of this claim, and if that influences at all your take.

Sounds right. How does this answer my point 4?

I guess maybe you see two discrepancies vs one and conclude that two is worse than one? I don't really buy that, seems like it depends on the size of the discrepancies.

For example, if you imagine an AI that's optimizing for my evaluation of good, I think the discrepancy between "Rohin's directly instilled goals" and "Rohin's CEV" is pretty small and I am pretty happy to ignore it. (Put another way, if that was the only source of misalignment risk, I'd conclude misalignment risk was small and move on to some other area.) So the only one that matters in this case of grader optimization is the discrepancy between "plans Rohin evaluates as good" and "Rohin's directly instilled goals".

I interpret Alex as making an argument such that there is not just two vs one difficulties, but an additional difficulty. From this perspective, having two will be more of an issue than one, because you have to address strictly more things.

This makes me wonder though if there is not just some sort of direction question underlying the debate here. Because if you assume the "difficulties" are only positive numbers, then if the difficulty for the direct instillation is  and the one for the grader optimization is  , then there's no debate that the latter is bigger than the former.

But if you allow directionality (even in one dimension), then there's the risk that the sum leads to less difficulty in total (by having the  move in the opposite direction in one dimension). That being said, these two difficulties seem strictly additive, in the sense that I don't see (currently) how the difficulty of evaluation could partially cancel the difficulty of instillation.

Two responses:

1. Grader-optimization has the benefit that you don't have to specify what values you care about in advance. This is a difficulty faced by value-executors but not by grader-optimizers.
2. Part of my point is that the machinery you need to solve evaluation-problems is also needed to solve instillation-problems because fundamentally they are shadows of the same problem, so I'd estimate d_evaluation at close to 0 in your equations after you have dealt with d_instillation.

I understand you to have just said:

Having direct-values of "Rohin's current values" and "Rohin's CEV" both seem fine. There is, however, significant discrepancy between "grader-optimize the plans Rohin evaluates as good" and "directly-value Rohin's current values."

In particular, the first line seems to speculate that values-AGI is substantially more robust to differences in values. If so, I agree. You don't need "perfect values" in an AGI (but probably they have to be pretty good; just not adversarially good). Whereas strongly-upwards-misspecifying the Rohin-grader on a single plan of the exponential-in-time planspace will (almost always) ruin the whole ballgame in the limit.

the first line seems to speculate that values-AGI is substantially more robust to differences in values

The thing that I believe is that an intelligent, reflective, careful agent with a decisive strategic advantage (DSA) will tend to produce outcomes that are similar in value to that which would be done by that agent's CEV. In particular, I believe this because the agent is "trying" to do what its CEV would do, it has the power to do what its CEV would do, and so it will likely succeed at this.

I don't know what you mean by "values-AGI is more robust to differences in values". What values are different in this hypothetical?

I do think that values-AGI with a DSA is likely to produce outcomes similar to CEV-of-values-AGI

It is unclear whether values-AGI with a DSA is going to produce outcomes similar to CEV-of-Rohin (because this depends on how you built values-AGI and whether you successfully aligned it).

Strong-upvoted and strong-disagreevoted. Thanks so much for the thoughtful comment.

I'm rushing to get a lot of content out, so I'm going to summarize my main reactions now & will be happy to come back later.

• I wish you wouldn't use IMO vague and suggestive and proving-too-much selection-flavor arguments, in favor of a more mechanistic analysis.
• I consider your arguments to blur nearly-unalignable design patterns (e.g. grader optimization) with shard-based agents, and then comment that both patterns pose challenges, so can we really say one is better? More on this later.
• As Charles and Adam seem to say, you seem to be asking "how did you specify the values properly?" without likewise demanding "how do we inner-align the actor? How did we specify the grader?".
• Given an inner-aligned actor and a grader which truly cares about diamonds, you don't get an actor/grader which makes diamonds.
• Given a value-AGI which truly cares about diamonds, the AGI makes diamonds.
• If anything, the former seems to require more specification difficulty, and yet it still horribly fails.

just as "you have a literally perfect overseer" seems theoretically possible but unrealistic, so too does "you instill the direct goal literally exactly correctly". Presumably one of these works better in practice than the other, but it's not obvious to me which one it is.

You do not need an agent to have perfect values. As you commented below, a values-AGI with Rohin's current values seems about as good as a values-AGI with Rohin's CEV. Many foundational arguments are about grader-optimization, so you can't syntactically conclude "imperfect values means doom." That's true in the grader case, but not here.

That reasoning is not immediately applicable to "how stable is diamond-producing behavior to various perturbations of the agent's initial decision-influences (i.e. shards)?". That's all values are, on my terminology. Values are contextually activated influences on decision-making. That's it. Values are not the optimization target of the agent with those values. If you drop out or weaken the influence of IF plan can be easily modified to incorporate more diamonds, THEN do it, that won't necessarily mean the AI makes some crazy diamond-less universe. It means that it stops tailoring plans in a certain way, in a certain situation.

This is also why more than one person has "truly" loved their mother for more than a single hour (else their values might change away from true perfection). It's not like there's an "literally exactly correct" value-shard for loving someone.

This is also why values can be seriously perturbed but still end up OK. Imagine a value-shard which controls all decision-making when I'm shown a certain QR code, but which is otherwise inactive. My long-run outcomes probably wouldn't differ, and I expect the same for an AGI.

The value shards aren't getting optimized hard. The value shards are the things which optimize hard, by wielding the rest of the agent's cognition (e.g. the world model, the general-purpose planning API).

So, I'm basically asking that you throw an error and recheck your "selection on imperfection -> doom" arguments, as I claim many of these arguments reflect grader-specific problems.

Separately, I don't see this as all that relevant to what work we do in practice: even if we thought that we should be creating an AI system with a direct goal,

It is extremely relevant, unless we want tons of our alignment theory to be predicated on IMO confused ideas about how agent motivations work, or what values we want in an agent, or the relative amount of time we spend researching "objective robustness" (often unhelpful IMO) vs interpretability vs cognitive-update dynamics (e.g. what reward shaping does mechanistically to a network in different situations) vs... If we stay in the grader-optimization frame, I think we're going to waste a bunch of time figuring out how to get inexploitable graders.

It would be quite stunning if, after renouncing one high-level world-view of how agent motivations work, the optimal research allocation remained the same.

I agree that if you do IDA or debate or whatever, you get agents with direct goals. Which invalidates a bunch of analysis around indirect goals -- not only do I think we shouldn't design grader-optimizers, I think we thankfully won't get them.

I wish you wouldn't use IMO vague and suggestive and proving-too-much selection-flavor arguments, in favor of a more mechanistic analysis.

Can you name a way in which my arguments prove too much? That seems like a relatively concrete thing that we should be able to get agreement on.

You do not need an agent to have perfect values.

I did not claim (nor do I believe) the converse.

Many foundational arguments are about grader-optimization, so you can't syntactically conclude "imperfect values means doom." That's true in the grader case, but not here.

I disagree that this is true in the grader case. You can have a grader that isn't fully robust but is sufficiently robust that the agent can't exploit any errors it would make.

If you drop out or weaken the influence of IF plan can be easily modified to incorporate more diamonds, THEN do it, that won't necessarily mean the AI makes some crazy diamond-less universe.

The difficulty in instilling values is not that removing a single piece of the program / shard that encodes it will destroy the value. The difficulty is that when you were instilling the value, you accidentally rewarded a case where the agent tried a plan that produced pictures of diamonds (because you thought they were real diamonds), and now you've instilled a shard that upweights plans that produce pictures of diamonds. Or that you rewarded the agent for thoughts like "this will make pretty, transparent rocks" (which did lead to plans that produced diamonds), leading to shards that upweight plans that produce pretty, transparent rocks, and then later the agent tiles the universe with clear quartz.

The value shards are the things which optimize hard, by wielding the rest of the agent's cognition (e.g. the world model, the general-purpose planning API).

So, I'm basically asking that you throw an error and recheck your "selection on imperfection -> doom" arguments, as I claim many of these arguments reflect grader-specific problems.

I think that the standard arguments work just fine for arguing that "incorrect value shards -> doom", precisely because the incorrect value shards are the things that optimize hard.

(Here incorrect value shards means things like "the value shards put their influence towards plans producing pictures of diamonds" and not "the diamond-shard, but without this particular if clause".)

It is extremely relevant [...]

This doesn't seem like a response to the argument in the paragraph that you quoted; if it was meant to be then I'll need you to rephrase it.

We need to apply extremely strong selection to get the kind of agent we want, and the agent we want will itself need to be making decisions that are extremely optimized in order to achieve powerfully good outcomes. The question is about in what way that decision-making algorithm should be structured, not whether it should be optimized/optimizing at all. As a fairly close analogy, IMO a point in the Death With Dignity post was something like "for most people, the actually consequentialist-correct choice is NOT to try explicitly reasoning about consequences". Similarly, the best way for an agent to actually produce highly-optimized good-by-its-values outcomes through planning may not be by running an explicit search over the space of ~all plans, sticking each of them into its value-estimator, & picking the argmax plan.

I think there still may be some mixup between:

A. How does the cognition-we-intend-the-agent-to-have operate? (for ex. a plan grader + an actor that tries to argmax the grader, or a MuZero-like heuristic tree searcher, or a chain-of-thought LLM steered by normative self-talk, or something else)

B. How we get the agent to have the intended cognition?

By contrast, "What if the values/shards are different from what we wanted" (your summary point 3) is a question about B! Note that we have to confront B-like questions no matter how we answer A. If A = grader-optimization, there's an analogous question of "What if the grader is different from what we wanted? / What if the trained actor is different from what we wanted?". I don't really see an issue with this post focusing exclusively on the A-like dimension of the problem and ignoring the B-like dimension temporarily, especially if we expect there to be general purpose methods that work across different answers to A.

In the post TurnTrout is focused on A [...] He explicitly disclaims that he is not making arguments about B

I agree that's what the post does, but part of my response is that the thing we care about is both A and B, and the problems that arise for grader-optimization in A (highlighted in this post) also arise for value-instilling in B in slightly different form, and so if you actually want to compare the two proposals you need to think about both.

I'd be on board with a version of this post where the conclusion was "there are some problems with grader-optimization, but it might still be the best approach; I'm not making a claim on that one way or the other".

grader-optimization is a kind of cognition that works at cross purposes with itself, one that is an anti-pattern, one that an agent (even an unaligned agent) should discard upon reflection because it works against its own interests.

I didn't actually mention this in my comment, but I don't buy this argument:

Case 1: no meta cognition. Grader optimization only "works at cross purposes with itself" to the extent that the agent thinks that the grader might be mistaken about things. But it's not clear why this is the case: if the agent thinks "my grader is mistaken" that means there's some broader meta-cognition in the agent that does stuff based on something other than the grader. That meta-cognition could just not be there and then the agent would be straightforwardly optimizing for grader-outputs.

As a concrete example, AIXI seems to me like an example of grader-optimization (since the reward signal comes from outside the agent). I do not think AIXI would "do better according to its own interests" if it "discarded" its grader-optimization.

You can say something like "from the perspective of the human-AI system overall, having an AI motivated by grader-optimization is building a system that works at cross purposes itself", but then we get back to the response "but what is the alternative".

Case 2: with meta cognition. If we instead assume that there is some meta cognition reflecting on whether the grader might be mistaken, then it's not clear to me that this failure mode only applies to grader optimization; you can similarly have meta cognition reflecting on whether values are mistaken.

Suppose you instill diamond-values into an AI. Now the AI is thinking about how it can improve the efficiency of its diamond-manufacturing, and has an idea that reduces the necessary energy requirements at the cost of introducing some impurities. Is this good? The AI doesn't know; it's unsure what level of impurities is acceptable before the thing it is making is no longer a diamond. Efficiency is very important, even 0.001% improvement is a massive on an absolute scale given its fleets of diamond factories, so it spends some time reflecting on the concept of diamonds to figure out whether the impurities are acceptable.

It seems like you could describe this as "the AI's plans for improving efficiency are implicitly searching for errors in the concept of diamonds, and the AI has to spend extra effort hardening its concept of diamonds to defend against this attack". So what's the difference between this issue and the issue with grader optimization?

You can say something like "from the perspective of the human-AI system overall, having an AI motivated by grader-optimization is building a system that works at cross purposes itself", but then we get back to the response "but what is the alternative".

This is what I did intend, and I will affirm it. I don't know how your response amounts to "I don't buy this argument." Sounds to me like you buy it but you don't know anything else to do?

then we get back to the response "but what is the alternative".

In this post, I have detailed an alternative which does not work at cross-purposes in this way.

It seems like you could describe this as "the AI's plans for improving efficiency are implicitly searching for errors in the concept of diamonds, and the AI has to spend extra effort hardening its concept of diamonds to defend against this attack". So what's the difference between this issue and the issue with grader optimization?

1. Values-execution. Diamond-evaluation error-causing plans exist and are stumble-upon-able, but the agent wants to avoid errors.
2. Grader-optimization. The agent seeks out errors in order to maximize evaluations.

Sounds to me like you buy it but you don't know anything else to do?

Yes, and in particular I think direct-goal approaches do not avoid the issue. In particular, I can make an analogous claim for them:

"From the perspective of the human-AI system overall, having an AI motivated by direct goals is building a system that works at cross purposes with itself, as the human puts in constant effort to ensure that the direct goal embedded in the AI is "hardened" to represent human values as well as possible, while the AI is constantly searching for upwards-errors in the instilled values (i.e. things that score highly according to the instilled values but lowly according to the human)."

Like, once you broaden to the human-AI system overall, I think this claim is just "A principal-agent problem / Goodhart problem involves two parts of a system working at cross purposes with each other", which is both (1) true and (2) unavoidable (I think).

It seems like you could describe this as "the AI's plans for improving efficiency are implicitly searching for errors in the concept of diamonds, and the AI has to spend extra effort hardening its concept of diamonds to defend against this attack". So what's the difference between this issue and the issue with grader optimization?

1. Values-execution. Diamond-evaluation error-causing plans exist and are stumble-upon-able, but the agent wants to avoid errors.
2. Grader-optimization. The agent seeks out errors in order to maximize evaluations.

The part of my response that you quoted is arguing for the following claim:

If you are analyzing the AI system in isolation (i.e. not including the human), I don't see an argument that says [grader-optimization would violate the non-adversarial principle] and doesn't say [values-execution would violate the non-adversarial principle]".

As I understand it you are saying "values-execution wants to avoid errors but grader-optimization does not". But I'm not seeing it. As far as I can tell the more correct statements are "agents with metacognition about their grader / values can make errors, but want to avoid them" and "it is a type error to talk about errors in the grader / values for agents without metacognition about their grader / values".

(It is a type error in the latter case because what exactly are you measuring the errors with respect to? Where is the ground truth for the "true" grader / values? You could point to the human, but my understanding is that you don't want to do this and instead just talk about only the AI cognition.)

For reference, in the part that you quoted, I was telling a concrete story of a values-executor with metacognition, and saying that it too had to "harden" its values to avoid errors. I do agree that it wants to avoid errors. I'd be interested in a concrete example of a grader-optimizer with metacognition that that doesn't want to avoid errors in its grader.

Like, in what sense does Bill not want to avoid errors in his grader?

I don't mean that Bill from Scenario 2 in the quiz is going to say "Oh, I see now that actually I'm tricking myself about whether diamonds are being created, let me go make some actual diamonds now". I certainly agree that Bill isn't going to try making diamonds, but who said he should? What exactly is wrong with Bill's desire to think that he's made a bunch of diamonds? Seems like a perfectly coherent goal to me; it seems like you have to appeal to some outside-Bill perspective that says that actually the goal was making diamonds (in which case you're back to talking about the full human-AI system, rather than the AI cognition in isolation).

What I mean is that Bill from Scenario 2 might say "Hmm, it's possible that if I self-modify by sticking a bunch of electrodes in my brain, then it won't really be me who is feeling the accomplishment of having lots of diamonds. I should do a bunch of neuroscience and consciousness research first to make sure this plan doesn't backfire on me".

Imagine someone who considers a few plans, grades them (e.g. “how good does my gut say this plan is?”), and chooses the best. They are not a grader-optimizer. They are not trying to navigate to the state where they propose and execute a plan which gets maximally highly rated by some evaluative submodule. They use a grading procedure to locally rate and execute plans, and may even locally think “what would make me feel better about this plan?”, but the point of their optimization isn’t “find the plan which makes me feel as good as globally possible.”

The way I think about this situation for myself as a human is that the more plans I consider and the wider / more global my search process is, the more likely it is that I hit upon an especially good "out of the box" plan, but also the more likely it is that I hit upon some "adversarial input" (in quotes because I'm not sure what you or I mean by this) and end up doing something really bad. It seems there are two things I can do about this:

1. Try to intuitively or quantitatively optimize the search process itself, as far as how many plans to consider, where to direct the search, etc., to get the best trade off between the two outcomes.
2. Try to improve my evaluation process so that I can afford to do wider searches without taking excessive risk.

Do you have any objections/disagreements with this? Secondarily, if as a result of 1 and 2 I'm doing a fairly wide search and considering many plans, doesn't it stop making sense at some point to say "They are not a grader-optimizer."?

2. This doesn’t include (somehow) getting an AI which correctly computes what program would be recommended by AGI designers in an altruistic and superintelligent branch of humanity, and then the AI executes that program and shuts itself off without doing anything else.[5]

But isn't 1 here is at least as good as 2, since the CEV-universe-simulation could always compute X=[the program that would be recommended by AGI designers in an altruistic and superintelligent branch of humanity] then return 1 iff input-plan = 'run X then shuts itself off without doing anything else' (by doing a simple text match), 0 otherwise, so there's no chance of adversarial inputs? Not to say this is a realistic way of getting an aligned AGI, but just that your argument seems to be proving too much, if it's saying that 2 is safer/better than 1.

1. Try to improve my evaluation process so that I can afford to do wider searches without taking excessive risk.

Improve it with respect to what?

My attempt at a framework where "improving one's own evaluator" and "believing in adversarial examples to one's own evaluator" make sense:

• The agent's allegiance is to some idealized utility function  (like CEV).  The agent's internal evaluator  is "trying" to approximate  by reasoning heuristically.  So now we ask Eval to evaluate the plan "do argmax w.r.t. Eval over a bunch of plans".  Eval reasons that, due to the the way that Eval works, there should exist "adversarial examples" that score very highly on Eval but low on .  Hence, Eval concludes that  is low, where plan = "do argmax w.r.t. Eval".  So the agent doesn't execute the plan "search widely and argmax".
• "Improving " makes sense because Eval will gladly replace itself with  if it believes that  is a better approximation for  (and hence replacing itself will cause the outcome to score better on )

Are there other distinct frameworks which make sense here?  I look forward to seeing what design Alex proposes for "value child".

This is tempting, but the problem is that I don't know what my idealized utility function is (e.g., I don't have a specification for CEV that I think would be safe or ideal to optimize for), so what does it mean to try to approximate it? Or consider that I only read about CEV one day in a blog, so what was I doing prior to that? Or if I was supposedly trying to approximate CEV, I can change my mind about it if I realized that it's a bad idea, but how does that fit into the framework?

My own framework is something like this:

• The evaluation process is some combination of gut, intuition, explicit reasoning (e.g. cost-benefit analysis), doing philosophy, and cached answers.
• I think there are "adversarial inputs" because I've previously done things that I later regretted, due to evaluating them highly in ways that I no longer endorse. I can also see other people sometimes doing obviously crazy things (which they may or may not later regret). I can see people (including myself) being persuaded by propaganda / crazy memes, so there must be a risk of persuading myself with my own bad ideas.
• I can try to improve my evaluation process by doing things like
1. look for patterns in my and other people's mistakes
2. think about ethical dilemmas / try to resolve conflicts between my evaluative subprocesses
3. do more philosophy (think/learn about ethical theories, metaethics, decision theory, philosophy of mind, etc.)
4. talk (selectively) to other people
5. try to improve how I do explicit reasoning or philosophy

The way you write this (especially the last sentence) makes me think that you see this attempt as being close to the only one that makes sense to you atm. Which makes me curious:

• Do you think that you are internally trying to approximate your own ?
• Do you think that you have ever made the decision (either implicitly or explicitly) to not eval all or most plans because you don't trust your ability to do so for adversarial examples (as opposed to tractability issues for example)?
• Can you think of concrete instances where you improved your own Eval?
• Can you think of concrete instances where you thought you improved you own Eval but then regretted it later?
• Do you think that your own changes to your eval have been moving in the direction of your ?

Vivek -- I replied to your comment in appendix C of today's follow-up post, Alignment allows imperfect decision-influences and doesn't require robust grading

Do you have any objections/disagreements with this? Secondarily, if as a result of 1 and 2 I'm doing a fairly wide search and considering many plans, doesn't it stop making sense at some point to say "They are not a grader-optimizer."?

I wrote in the post:

Value-child comes up with a hard-work plan as an effect of his cognition, not as a motivating cause—not because he only wants to come up with plans he himself will rate highly. He values working hard.

As a corollary, grader-optimization is not synonymous with planning. Grader-optimization is when high plan-evaluations are the motivating cause of planning, where "I found a plan which I think leads to diamond" is the terminal goal, and not just a side effect of cognition (as it is for values-child).

Sorry if I'm just repeating something you read and understood, but I do feel like this criterion answers "no, this is still not grader-optimization; the effective search over lots of plans is still a side-effect of your cognition, not the terminal end."

In particular, note that the strategy you described would not strongly want to be given the actual-highest-rated plan--or maybe it would want to know more about the plan as a curiosity, but not in order to evaluate and execute that plan. That's one way in which saying "your strategy is not grader-optimization" constrains my anticipations in a useful-seeming way.

But isn't 1 here is at least as good as 2, since the CEV-universe-simulation could always compute X=[the program that would be recommended by AGI designers in an altruistic and superintelligent branch of humanity] then return 1 iff input-plan = 'run X then shuts itself off without doing anything else' (by doing a simple text match), 0 otherwise, so there's no chance of adversarial inputs? Not to say this is a realistic way of getting an aligned AGI, but just that your argument seems to be proving too much, if it's saying that 2 is safer/better than 1.

This is a good point. I'm wondering about the type of the (presumably Cartesian) interface between the CEV-sim and the actor. First, CEV-sim shouldn't be affectable by the input-plan unless and until they run some stats on it. Otherwise the actor could (maybe?) side-channel attack them via whatever computer registers the input-plan shows up in. And CEV-sim does have to infer what they're being used for, at each invocation of the grader (since they don't retain memory across counterfactuals).

a. That aside, if CEV-sim can just syntactically check whether the input-plan runs that, then your argument seems good.

b. If CEV-sim has to understand the actor's latent state context (is it all in the plan?), in order to make sure that the purported X-running plan isn't just running dangerous Y in another programming language... Seems like they can't do this.

I feel like we're dealing with (b) more than (a), so I'd say "no, 2 is safer than 1" tentatively.

> This doesn’t include (somehow) getting an AI which correctly computes what program would be recommended by AGI designers in an altruistic and superintelligent branch of humanity, and then the AI executes that program and shuts itself off without doing anything else.[5]

But isn't 1 here is at least as good as 2, since the CEV-universe-simulation could always compute X=[the program that would be recommended by AGI designers in an altruistic and superintelligent branch of humanity] then return 1 iff input-plan = 'run X then shuts itself off without doing anything else' (by doing a simple text match), 0 otherwise, so there's no chance of adversarial inputs? Not to say this is a realistic way of getting an aligned AGI, but just that your argument seems to be proving too much, if it's saying that 2 is safer/better than 1.

Is your issue here that there exist a specific CEV-universe-simulation that makes 1 just as safe as 2, by basically emulating the latter situation? If so, why do you think this is a point against Alex's claim(which strikes me more as saying "there are a lot more cases of 2. being safe than of 1.")?

Some notes based on a conversation with Alex (this comment is about conveying my viewpoint, not his):

Note that, while I will talk about systems with a planner and evaluator in this comment; I don't necessarily think that's how advanced AI systems will look; it's just a useful example that helps with illustration (and seems to be the model that Alex is using, based on this pseudocode).

Similarly note that, while I'm going to constantly say "it seems like a good grader-optimizer would not have this problem", I'm not imagining that we specifically try to build a grader-optimizer -- rather, I'm imagining that we use a robust grader to oversee the model during training, and let the motivations shake out however they shake out. I then claim that this plan seems reasonable, and as part of that I am arguing "if it turns out that this leads to a grader-optimizer for the robust grader we used during training, that seems okay".

Here's a bunch of arguments one could give for the superiority of values-executors:

One is doomed, the other isn't. I think the overall intuitive argument for Alex's position is "look, grader-optimizers seem obviously doomed, because the AI tries to and succeeds at tricking the grader. Values-executors might be hard but they don't seem obviously doomed".

Response: Grader-optimizers aren't doomed; common sense intuitions will lead you astray here -- you need to be thinking about sufficiently intelligent and competent graders which the AI can't trick. The rest of this comment is basically applying this same pattern to various different more precise arguments.

Type signature of graders. One thing that came up was that Alex was imagining that the grader looked only at finished plans (e.g. "synthesize DNA sequence <sequence> and release it in a pond"); in this case it seems particularly easy for the AI to trick the grader into accepting a plan that looks good but isn't actually good. Values-executors are less likely to have this failure mode, because the plan-evaluators (aka world model + shards) contains the information that the planner used to come up with the plan (e.g. in the example above they would likely know what the purpose of the synthesized DNA sequence is).

Response: I agree this is true as stated; this is basically the informed oversight problem. However, an actually good grader-optimizer would involve a grader that gets to access much more information -- not just the finished plan, but all of the reasoning that went into producing the plan, or even the entire training history of the model.

(You could still say "to get a good grader-optimizer we have to deal with the informed oversight problem, but for a values-executor you've just agreed that is automatically solved, so that's a reason to think values-executors are easier". Here I'd agree that once you have successfully built a values-executor, you don't have to solve informed oversight, but I think you need to deal with informed oversight to build the values-executors in the first place. If during training, your supposed diamond maximizer puts a picture of diamonds in front of its camera, and you reward it because you didn't know that's what it did, then you are accidentally building a diamond-picture-maximizer instead of a diamond-maximizer.)

Robustness during deployment. Here, the argument is that for a values-executor, you only need to give it robust rewards during training -- after that, you can deploy it in the world and don't need to be robust to any new plans it comes up with. In the diamond case, you can allow it to learn new knowledge, become smarter, etc and eventually it will tile the universe with diamonds. In contrast, with grader-optimizers, since the AI is maximizing the output of the grader, the grader has to be robust against all the plans the AI ever comes up with, including when the AI becomes way smarter. So the robustness requirements on the grader are way higher in the grader-optimizer case than in the values-executor case.

Response: It's not obviously fine to simply let the values-executor become smarter. (To see this, imagine giving humans the ability to edit their own brains -- I think several people wouldn't have enough caution / care to avoid destroying themselves with this ability.) To the extent the values-executor does well at this, I expect it is because it did things like (a) philosophical reflection on its values in the face of new knowledge and capabilities, and (b) thinking ahead of time about potential consequences of plans that could make it smarter, and avoiding plans which might change its values (like indiscriminate brain editing).

Similarly, an actually good grader-optimizer would involve a grader that also does these things. It wouldn't approve of plans that let the AI scale up its intelligence massively without correspondingly scaling up the grader. It would approve of plans where the AI thinks about how to improve the wisdom and intelligence of the grader, or about how to solve the alignment problem for a future successor AI.

Is it easier to instill these properties into a values-executor rather than a grader-optimizer? I don't think so; it seems like the easiest way to instill these properties is to learn them from humans / existing knowledge; that doesn't seem to advantage values-executors over grader-optimizers.

Possible modifications to values-executors. Following Alex's pseudocode for a diamond values-executor, imagine a system made up of a planner that suggests plans by querying a generative model, and an evaluator that evaluates plans by predicting the consequences of the plan and evaluating how good they are. (In the pseudocode a plan is evaluated as diamondShard(WM.getConseq(plan)).) Consider possible modifications to this system:

1. Instead of having potential plan modifications sampled from a generative model, the plans could be found by black-box search (or white-box gradient ascent) over which inputs would maximize the output of the evaluator. Response: I agree this makes it more likely to find some adversarial input to the evaluator, but that's not because you "turned it into a grader-optimizer", it's because you changed the planning algorithm (and in particular changed the distribution of plans considered by the planner so that adversarial plans were more likely to be considered).
2. Instead of having the evaluator rate plans based on how many diamonds are predicted to be created, we now have the evaluator rate plans based on its prediction of what the evaluator will say in the future about how many diamonds there are. This makes the agent worse. In particular, consider the plan "study the evaluator to find a way to trigger an underflow error that causes its output to wrap around to +inf" -- this plan would be rejected by the original agent, but would plausibly be executed by the new agent. (This is pretty similar to the wireheading literature, e.g. Learning What to Value and current-RF optimization.) Response: I think you have created a bad grader and that's why you're in trouble (just as you would be in trouble if you made a values-executor with bad values). When I imagine a grader-optimizer that is actually good, I imagine a grader that disapproves of plans of the form "search for errors in the grader" (and so those plans don't happen).

Conclusion. Putting it all together, here are a bunch of different things that can make your [planner + evaluator]-system safer:

1. Informed oversight. Ensure that the evaluator knows and understands the details about how the plan was generated. (Cf. the plagiarism example.)
2. Avoid being pushed OOD. Have the planner consider plans that are similar to those that the evaluator has had historical experience with, rather than searching over totally novel plans.
3. Current-RF optimization. The evaluator should evaluate plans using its current beliefs rather than the prediction of its future beliefs. (The latter means it will approve of plans that foreseeably manipulate it.)
4. Scaling safely. Ensure that as your AI system becomes more capable, there is also the wisdom to handle those new capabilities well. (As a particular example, the evaluator should reject plans that make the planner way more powerful without also making the evaluator a lot more powerful.)

I don't think these are specific to values-executors rather than grader-optimizers. If you do build a grader-optimizer, you do want to take care that your grader has all of these properties; but you will need to do similar things when building a values-executor.

Strong-upvote, strong disagreevote, thanks so much for writing this out :) I'm not going to reply more until I've cleared up this thread with you (seems important, if you think the pseudocode was a grader-optimizer, well... I think that isn't obviously doomed).

A few questions to better understand your frame:

• You mostly mention two outcomes for the various diamond-maximizer architectures: maximizing the number of diamonds produced and creating hypertuned-fooling-plans for the evaluator. If I could magically ensure that plan-space only contains plans that are not hypertuned-fooling-plans (they might try, but will most likely be figured out), would you say that then grader-optimization gives us an aligned AI? Or are there other failures modes that you see?
• Intuitively if maximizing the number of diamonds and maximizing the evaluation of the number of diamonds are not even close, I would expect multiple distinct failure modes "in-between".
• In your response to Wei Dai, I interpret you as making an uncompetitiveness claim for grader-optimization: that it will need to pay a cost in compute for both generating and pruning the adversarial examples that will make it cost more than alternative architectures. Why do you think that this cost isn't compensated by the fact that you're searching over more plans and so have access to more "good options" too?
• You're making strong claims about us needing to avoid as much as possible going on the route of grader optimization. Why do you expect that there is no clean/clear cut characterization of the set of adversarial plans (or a superset) that we could just forbid and then go on our merry way building grader optimizers?

Really appreciate the good questions!

If I could magically ensure that plan-space only contains plans that are not hypertuned-fooling-plans (they might try, but will most likely be figured out), would you say that then grader-optimization gives us an aligned AI? Or are there other failures modes that you see?

No, there are other failure modes due to unnaturality. Here's something I said in private communication:

Some of my unease comes from putting myself in the shoes of the grader.

>Be me
>time-limited simulation forced to grade how happy the world will be under some plan proposed by this monomaniacal "actor" AI that only wants my number to come out high
>tfw
>ok, whatever
>time to get to work
>...
>looks like this plan is pretty good
>helping at a soup kitchen
>holding doors open
>working with kids
>what number do I give it?
>Uh... 20?
>What if my other invocations got anchored on a higher number right before this process
>what's the shelling procedure
>...
>Maybe my credence that this is the most kind plan the actor can come up with
>hell, man, I don't know.
>.1?
>(The variance and path-dependence on the schelling procedure is going to be crazy.)

This seems like another way the grader gets "outmaneuvered", where a similarly sophisticated actor can abstractly peer down many execution paths and select for unwisdom and upwards errors

that it will need to pay a cost in compute for both generating and pruning the adversarial examples that will make it cost more than alternative architectures. Why do you think that this cost isn't compensated by the fact that you're searching over more plans and so have access to more "good options" too?

So, clarification: if I (not a grader-optimizer) wanted to become a grader-optimizer while pursuing my current goals, I'd need to harden my own evaluation procedures to keep up with my plan-search now being directed towards adversarial plan generation.

Furthermore, for a given designer-intended task (e.g. "make diamonds"), to achieve that with grader-optimization, the designer pays in the extra effort they need to harden the grader relative to just... not evaluating adversarial plans to begin with.

Given an already pointed-to/specified grader, the hardening is already baked in to that grader, and so both evaluation- and values-child should come out about the same in terms of compute usage.

I think that a values-executing AGI can also search over as many plans which actually make sense, I don't think its options are limited or anything. But it'll be generating different kinds of plans, using reflective reasoning to restrict its search to non-adversarial-to-own-values parts of plan space (e.g. "don't think about basilisks").

You're making strong claims about us needing to avoid as much as possible going on the route of grader optimization. Why do you expect that there is no clean/clear cut characterization of the set of adversarial plans (or a superset) that we could just forbid and then go on our merry way building grader optimizers?

1. I don't see why that should exist; any plan-inhabiting adversary wishes to fool the boundary of whatever rule you provide. EDIT: I'm most confident in this point if you want your AI to propose plans which you can't generate but can maybe verify.
2. See the last 10+years of alignment researchers failing to do this. Probably wise to not spend further increments of research time on such matters, once the fault is pointed out.

Thanks for taking time to answer my questions in detail!

Is it meant to point at the ability of the actor to make the plan more confusing/harder to evaluate? Meaning that you're pointing at the ability for the actor to "obfuscate" its plan in order to get high reward?

If so, it's not clear to me why this is valuable for the actor to do? How is it supposed to get better reward from confusion only? If it has another agenda (making paperclips instead of diamonds for example), then the obfuscation is clearly valuable to allow it to work on its main goal. But here its goal is to improve evaluation, and so confusion doesn't seem like it helps.

I think that a values-executing AGI can also search over as many plans which actually make sense, I don't think its options are limited or anything. But it'll be generating different kinds of plans, using reflective reasoning to restrict its search to non-adversarial-to-own-values parts of plan space (e.g. "don't think about basilisks").

This is the part where I currently cannot reconstruct your model (maybe because I haven't read shard theory in a while). Because from an abstract perspective, the trick part of starting from significantly more limited plans is how to expand the range of plans without allowing adversarial ones. And I don't have a model of what mechanism you think make it easier to go from the bottom up (from few plans accepted to more plans accepted) safely rather than from the top down (from all plans accepted to less plans accepted) safely.

About the "negative result" of not even trying to do grader optimization

1. I don't see why that should exist; any plan-inhabiting adversary wishes to fool the boundary of whatever rule you provide.
2. See the last 10+years of alignment researchers failing to do this. Probably wise to not spend further increments of research time on such matters, once the fault is pointed out.

But why don't you see that it should exist? What are the parts of your model that make this prediction? Because by definition, there is not a single innovation or progress that most people expected to exist. That's what it means to figure out something new. So you need an actual reason for why you're not falling into that trap of only being able to imagine what already exists.

Here are some guesses:

• Specific scientific claim that there should be no true name/clean concept of what is adversarial for a large class of cognitive architectures (in opposition to Natural Abstraction Hypothesis I guess?)
• Human limitation claim that even though there might be such a true name/clear concept, we have neither the time not the ability to get it right under the current conditions
• Security mindset claim that because the actor is allowed to get arbitrary smart, you should expect that if it optimizes against you, it wins.

For the second point, what do you make of the claim that there are ridiculously few researchers who even attempted to work on the problem in the last 10 years, compared to almost all easier fields of knowledge? Do you think that this should not alter the evidence of absence you get from the absence of evidence?

Is it meant to point at the ability of the actor to make the plan more confusing/harder to evaluate? Meaning that you're pointing at the ability for the actor to "obfuscate" its plan in order to get high reward?

No, the point is that the grader can only grade the current plan; it doesn't automatically know what its counterfactual branches output. The grader is scope-limited to its current invocation. This makes consistent grading harder (e.g. the soup-kitchen plan vs political activism, neither invocation knows what would be given by the other call to the grader, so they can't trivially agree on a consistent scale).

I think that grader-optimization is likely to fail catastrophically when the grader is (some combination of):

• more like “built / specified directly and exogenously by humans or other simple processes”, less like e.g. “a more and more complicated grader getting gradually built up through some learning process as the space-of-possible-plans gets gradually larger”
• more like “looking at the eventual consequences of the plan”, less like “assessing plans for deontology and other properties” (related post) (e.g. “That plan seems to pattern-match to basilisk stuff” could be a strike against a plan, but that evaluation is not based solely on the plan’s consequences.)
• more like “looking through tons of wildly-out-of-the-box plans”, less like “looking through a white-list of a small number of in-the-box plans”

Maybe we agree so far?

But I feel like this post is trying to go beyond that and say something broader, and I think that’s where I get off the boat.

I claim that maybe there’s a map-territory confusion going on. In particular, here are two possible situations:

• (A) Part of the AGI algorithm involves listing out multiple plans, and another part of the algorithm involves a “grader” that grades the plans.
• (B) Same as (A), but also assume that the high-scoring plans involve a world-model (“map”), and somewhere on that map is an explicit (metacognitive / reflective) representation of the “grader” itself, and the (represented) grader’s (represented) grade outputs (within the map) are identical to (or at least close to) the actual grader’s actual grades within the territory.

I feel like OP equivocates between these. When it’s talking about algorithms it seems to be (A), but when it’s talking about value-child and appendix C and so on, it seems to be (B).

In the case of people, I want to say that the “grader” is roughly “valence” / “the feeling that this is a good idea”.

I claim that (A), properly understood, should seem/feel almost tautological—like, it should be impossible to introspectively imagine (A) being false! It’s kinda the claim “People will do things that they feel motivated to do”, or something like that. By contrast, (B) is not tautological, or even true in general—it describes hedonists: “The person is thinking about how to get very positive valence on their own thoughts, and they’re doing whatever will lead to that”.

I think this is related to Rohin’s comment (“An AI system with a "direct (object-level) goal" is better than one with "indirect goals"”)—the AGI has a world-model / map, its “goals” are somewhere on the map (inevitably, I claim), and we can compare the option of “the goals are in the parts of the map that correspond to object-level reality (e.g. diamonds)”, versus “the goals are in the parts of the map that correspond to a little [self-reflective] portrayal of the AGI’s own evaluative module (or some other represented grader) outputting a high score”. That’s the distinction between (not-B) vs (B) respectively. But I think both options are equally (A).

(Sidenote: There are obvious reasons to think that (A) might lead to (B) in the context of powerful model-based RL algorithms. But I claim that this is not inevitable. I think OP would agree with that.)

As I read your comment, I kept expecting to find the point where we disagreed, but... I didn't really find one? I'm not saying "don't have (A) in the training goal" nor am I saying "don't let (A) be present in the AI's mind."

ETA tweaked for clarity.

For your quiz, could you give an example of something that is grader-optimization but which is not wireheading?

Alignment with platonic grader-output isn't wireheading. (I mentioned this variant in the second spoiler, for reference.)