# 31

Summary. Consider two common alignment design patterns:

1. Optimizing for the output of a grader which evaluates plans, and
2. Fixing a utility function and then argmaxing over all possible plans.

These design patterns incentivize the agent to find adversarial inputs to the grader (e.g. "manipulate the simulated human grader into returning a high evaluation for this plan"). I'm pretty sure we won't find adversarially robust grading rules. Therefore, I think these alignment design patterns are doomed.

In this first essay, I explore the adversarial robustness obstacle. In the next essay, I'll point out how this is obstacle is an artifact of these design patterns, and not any intrinsic difficulty of alignment. Thanks to Erik Jenner, Johannes Treutlein, Quintin Pope, Charles Foster, Andrew Critch, randomwalks, and Ulisse Mini for feedback.

# 1: Optimizing for the output of a grader

One motif in some AI alignment proposals is:

• An actor which proposes plans, and
• A grader which evaluates them.

For simplicity, imagine we want the AI to find a plan where it makes an enormous number of diamonds. We train an actor to propose plans which the grading procedure predicts lead to lots of diamonds.

In this setting, here's one way of slicing up the problem:

Outer alignment: Find a sufficiently good grader.

Inner alignment: Train the actor to propose plans which the grader rates as highly possible (ideally argmaxing on grader output, but possibly just intent alignment with high grader output).[1]

This "grader optimization" paradigm ordains that the AI find plans which make the grader output good evaluations. An inner-aligned actor is singlemindedly motivated to find plans which are graded maximally well by the grader. Therefore, for any goal by which the grader may grade, an inner-aligned actor is positively searching for adversarial inputs which fool the grader into spitting out a high number!

In the diamond case, if the actor is inner-aligned to the grading procedure, then the actor isn't actually aligned towards diamond-production. The actor is aligned towards diamond-production as quoted via the grader's evaluations. In the end, the actor is aligned to the evaluations.

I think that there aren't clever ways around this issue. Under this motif, under this way of building an AI, you're not actually building an AI which cares about diamonds, and so you won't get a system which makes diamonds in the limit of its capability development.

Three clarifying points:

1. This motif concerns how the AI makes decisions—this isn't about training a network using a grading procedure, it's about the trained agent being motivated by a grading procedure.
2. The grader doesn't have to actually exist in the world. This essay's critiques are not related to "reward tampering",[2] where the actor messes with the grader's implementation in order to increase the grades received. The "grader" can be a mathematical expected utility function over all action-sequences which the agent could execute. For example, it might take the action sequence and the agent's current beliefs about the world, and e.g. predict the expected number of diamonds produced by the actions.
3. "The AI optimizes for what humanity would say about each universe-history" is an instance of grader-optimization, but "the AI has human values" is not an instance of grader-optimization.
1. ETA 12/26/22: When I write "grader optimization", I don't mean "optimization that includes a grader", I mean "the grader's output is the main/only quantity being optimized by the actor."
2. Therefore, if I consider five plans for what to do with my brother today and choose the one which sounds the most fun, I'm not a grader-optimizer relative my internal plan-is-fun? grader.
3. However, if my only goal in life is to find and execute the plan which I would evaluate as being the most fun, then I would be a grader-optimizer relative to my fun-evaluation procedure.

## The parable of evaluation-child

an AI should optimize for the real-world things I value, not just my estimates of those things. — The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

First, a mechanistically relevant analogy. Imagine a mother whose child has been goofing off at school and getting in trouble. The mom just wants her kid to take education seriously and have a good life. Suppose she had two (unrealistic but illustrative) choices.

1. Evaluation-child: The mother makes her kid care extremely strongly about doing things which the mom would evaluate as "working hard" and "behaving well."
2. Value-child: The mother makes her kid care about working hard and behaving well.

What's interesting, though, is that even if the mother succeeds at producing evaluation-child, the mother isn't actually aligning the kid so that they want to work hard and behave well. The mother is aligning the kid to maximize the mother's evaluation thereof. At first, when the mother is smarter than the child, these two child-alignments will produce similar behavior. Later, they will diverge wildly, and it will become practically impossible to keep evaluation-child aligned with "work hard and behave well." But value-child does fine.

Concretely, imagine that each day, each child chooses a plan for how to act, based on their internal alignment properties:

1. Evaluation-child has a reasonable model of his mom's evaluations, and considers plans which he thinks she'll approve of. Concretely, his model of his mom would look over the contents of the plan, imagine the consequences, and add two sub-ratings for "working hard" and "behaving well." This model outputs a numerical rating. Then the kid would choose the highest-rated plan he could come up with.
2. Value-child chooses plans according to his newfound values of working hard and behaving well. If his world model indicates that a plan involves him not working hard, he doesn't want to do it, and discards the plan.[3]

At first, everything goes well. In both branches of the thought experiment, the kid is finally learning and behaving. The mothers both start to relax.

But as evaluation-child gets a bit smarter and understands more about his mom, evaluation-child starts diverging from value-child. Evaluation-child starts implicitly modelling how his mom has a crush on his gym teacher. Perhaps spending more time near the gym teacher gets (subconsciously and erroneously) rated more highly by his model of his mom. So evaluation-child spends a little less effort on working hard, and more on being near the gym teacher.

Value-child just keeps working hard and behaving well.

Consider what happens as the children get way smarter. Evaluation-child starts noticing more and more regularities and exploits in his model of his mother. And, since his mom succeeded at inner-aligning him to (his model of) her evaluations, he only wants to execute plans which best optimize her evaluations. He starts explicitly reasoning about this model to which he is inner-aligned. How is she evaluating plans? He sketches out pseudocode for her evaluation procedure and finds—surprise!—that humans are flawed graders. Perhaps it turns out that by writing a strange sequence of runes and scribbles on an unused blackboard and cocking his head to the left at 63 degrees, his model of his mother returns "10 million" instead of the usual "8" or "9".

Meanwhile in the value-child branch of the thought experiment, value-child is extremely smart, well-behaved, and hard-working. And since those are his current values, he wants to stay that way as he grows up and gets smarter (since value drift would lead to less earnest hard work and less good behavior; such plans are dispreferred). Since he's smart, he starts reasoning about how these endorsed values might drift, and how to prevent that. Sometimes he accidentally eats a bit too much candy and strengthens his candy value-shard a bit more than he intended, but overall his values start to stabilize.

Both children somehow become strongly superintelligent. At this point, the evaluation branch goes to the dogs, because the optimizer's curse gets ridiculously strong. First, evaluation-child could just recite a super-persuasive argument which makes his model of his mom return INT_MAX, which would fully decouple his behavior from "work hard and behave at school." (Of course, things can get even worse, but I'll leave that to this footnote.[4])

Meanwhile, value-child might be transforming the world in a way which is somewhat sensitive to what I meant by "he values working hard and behaving well", but there's no reason for him to search for plans like the above. He chooses plans which he thinks will lead to him actually working hard and behaving well. Does something else go wrong? Quite possibly. The values of a superintelligent agent do in fact matter! But I think that if something goes wrong, it's not due to this problem. (More on that in the next post.)

## Grader optimization amplifies the optimizer's curse

Let's bring it back to diamond production. As I said earlier:

An inner-aligned actor is singlemindedly motivated to find plans which are graded maximally well by the grader. Therefore, for any goal by which the grader may grade, an inner-aligned actor is positively searching for adversarial inputs which fool the grader!

This problem is an instance of the optimizer's curse. Evaluations (eg "In this plan, how hard is evaluation-child working? Is he behaving?") are often corrupted by the influence of unendorsed factors (eg the attractiveness of the gym teacher caused an upwards error in the mother's evaluation of that plan). If you make choices by considering  options and then choosing the highest-evaluated one, then the more  increases, the harder you are selecting for upwards errors in your own evaluation procedure.

The proposers of the Optimizer's Curse also described a Bayesian remedy in which we have a prior on the expected utilities and variances and we are more skeptical of very high estimates. This however assumes that the prior itself is perfect, as are our estimates of variance. If the prior or variance-estimates contain large flaws somewhere, a search over a very wide space of possibilities would be expected to seek out and blow up any flaws in the prior or the estimates of variance.

Goodhart's Curse, Arbital

As far as I know, it's indeed not possible to avoid the curse in full generality, but it doesn't have to be that bad in practice. If I'm considering three research directions to work on next month, and I happen to be grumpy when considering direction #2, then maybe I don't pursue that direction. Even though direction #2 might have seemed the most promising under more careful reflection. I think that the distribution of plans I consider involves relatively small upwards errors in my internal evaluation metrics. Sure, maybe I occasionally make a serious mistake due to the optimizer's curse due to upwards "corruption", but I don't expect to literally die from the mistake.

Thus, there are are degrees to the optimizer's curse. (In the next essay, I'll explore why this maximum-strength curse seems straightforward to avoid.)

We should not be constructing a computation that is trying to hurt us. At the point that computation is running, we've already done something foolish--willfully shot ourselves in the foot. Even if the AI doesn't find any way to do the bad thing, we are, at the very least, wasting computing power.

[...] If you're building a toaster, you don't build one element that heats the toast and then add a tiny refrigerator that cools down the toast.

This whole grader-optimization setup seems misguided. You have one part of the process (the actor) which wants to maximize grader evaluations (by exploiting the grader), and another part which evaluates the plan and tries to ensure it hasn't been exploited. Two parts of the system, running computations at adversarial cross-purpose.

We hope that the aggregate behavior of the process is that the grader "wins" and "constrains" the actor to, you know, actually producing diamonds. We hope that by inner-aligning an agent to a desire which is not diamond production, and by making a super clever grader which evaluates plans for diamond production, the overall behavior is aligned with diamond production.

It's one thing to try to take a system of diamond-aligned agents and then aggregate them into a diamond-aligned superagent. But here, we're not doing even that. We're aggregating a process containing an entity which does is not diamond-aligned, and hoping that we can diamond-align the overall decision-making process..? I think that grader-optimization is just not how to get cognitive work out of smart agents. It's really worth noticing the anti-naturality of trying to do so—that this setup proposes something against the grain of how values seem to usually work

One danger sign is that grader-alignment doesn't seem easier for simple goals/tasks (make diamonds) and harder for complex goals (human values). Sure, human values are complicated, but what about finding robust graders for:

• Producing diamonds?
• Petting dogs?
• Planting flowers?
• Moving a single strawberry?
• Playing Tic-Tac-Toe?

In every scenario, if you have a superintelligent actor which is optimizing the grader's evaluations while searching over a large real-world plan space, the grader gets exploited. As best I can discern, you're always screwed. This implies that something about the grader optimization problem produces a high fixed cost to aligning on any given goal, and that the current bottleneck difficulties don't come from the goals themselves.

Here are several approaches which involve grader-alignment:

• Have the AI be motivated to optimize our approval,
• Have a super great reward model which can grade all plans the AI can come up with, and then have the AI be internally motivated to find plans which evaluate highly,
• More generally, approaches which use a function of human judgment as an evaluative black box and then try to get the AI intent-aligned with optimizing the function represented by that evaluative black box.

This difficulty seems fundamental. I think these grader approaches are doomed. (In the appendix, I address several possible recovery attempts for the actor/grader problem setup.)

# 2: Argmax is a trap

One idealization of agency is brute-force plan search AKA argmaxing with respect to a utility function. The agent considers all possible plans (i.e. action-sequences), models the effects of each plan, evaluates how many diamonds the plan leads to, and then chooses the plan with highest evaluation. AIXI is a prime example of this, a so-called "spherical cow" for modelling AGI. This lets us abstract away e.g. seemingly annoying complications with reflective agents which think about their future planning process. This seemingly[5] relaxes the problem

Brute-force plan search nicely captures the intuition that it's better to consider more options. If you're just considering n plans and someone says "want to be able to check another plan for free?", why not accept? If the new plan isn't better than the other n, then just don't execute the new plan.

This reasoning is fine for the everyday kind of plan. But if the action space is expressive (the agent can do one of several things at each time step) and the planning horizon long enough (the agent can make interesting things happen), then brute-force plan search forces you to consider plans which trick your evaluation procedure (as in the parable of evaluation-child). For any simple evaluation procedure you can write down, there probably exists a plan which "tricks" it relative to your intentions:

Sure, maybe you can try to rule out plans which seem suspicious—to get the utility function to return INT_MIN for any plan which triggers the alarm (e.g. "why does this plan start off with me coding up a possible superintelligence..?"). But then this is just equivalent to specifying the utility function adequately well across all possible plans.

Why is it so ridiculously hard to get an argmax agent to actually argmax by selecting a plan which makes a lot of diamonds? Because argmax invokes the optimizer's curse at maximal strength, that's why.

# Conclusion

Grader optimization and brute-force plan search both ensure an extremely strong version of the optimizer's curse. No matter what grading rule you give an AI, if the AI is inner aligned on that rule, the AI will try to find adversarial inputs to that rule. Similarly, if the AI is argmaxing over plans according to a specified rule or utility function, it's selecting for huge upwards error in the rule you wrote down.

# Appendix: Maybe we just...

Given a "smart" grader evaluating plans on the expected number of diamonds they produce, how do you get an actor-grader system which ends up making diamonds in reality? Maybe we just...

Simultaneously make the actor and grader more intelligent: Maybe a fixed grader will get gamed by the actor's proposals, but as long as we can maintain an invariant where, at time , actor  can't exploit grader , we should be fine.

The graders become increasingly expert in estimating how many diamonds a plan leads to, and the actors become increasingly clever in proposing highly evaluated plans. It's probably easier to evaluate plans than to generate them, so it seems reasonable at first to think that this can work, if only we found a sufficiently clever scheme for ensuring the grader outpaces the actor.

Response:

• It's not easier to evaluate plans than to generate them if your generator knows how you're grading plans and is proposing plans which are optimized to specifically compromise the grading procedure. Humans are not secure systems, ML graders are not going to be adversarially-secure systems. I don't see why this consideration[6] is helped by simultaneously scaling both parts.
• I suspect that a human-level grader is not robust to a human-level actor. If I'm grading plans based on number of diamonds, and you know that fact, and you are uniquely motivated to get me to output a high rating—you won't be best served by putting forth a purely honest diamond-producing plan. Why would this situation improve as the agents get more intelligent, as actors become able to understand the algorithm implemented by the grading procedure and therefore exploit it?
• I think it's a wrong move to try to salvage actor/grader by adding more complication which doesn't address the core problem with the optimizer's curse. Instead, look for alignment strategies which make this problem disappear entirely. (You'll notice that my diamond-alignment story doesn't have to confront extreme optimizer's curse, at all.)

Penalize the actor for considering the vulnerabilities. Don't we have to solve actor-level interpretability so we can do that? One of the strong points of actor/grader is that evaluation is—all else equal—easier than generation. But the "thoughts" which underlie that generation need not be overseeable.

And what if the vulnerability-checker gets hit with its own adversarial input. And why consider this particular actor/grader design pattern?

Satisfice. But uniformly randomly executing a plan which passes a (high) diamond threshold might still tend to involve building malign superintelligences.[7] EDIT: However, if you bound the grader's output , it seems quite possible that some actually good plans get the max rating of 1. The question then becomes: are there lots of non-good plans which get max rating as well? I think so.

Quantilize. But then what's the base distribution, and what's the threshold? How do you set the quantiles such that you're drawing from a distribution which mostly involves lots of actual diamonds? Do there even exist such quantiles, under the uniform base distribution on plans?

Avoid having the actor argmax the grader. OK. But if we only have the actor and the black box, what do we do? We want to get an agent which actually optimizes diamond production to a powerful extent, while only taking relatively simple statistics of the seriously flawed black-box diamond evaluation function. (Why is this a thing you should be able to do? Is this the most promising way to spend the next increment of alignment research?)

Use the grader to entrain object-level diamond-caring into the actor? Sure. That's not what I was critiquing in this essay. I was critiquing design patterns in which the trained agent's cognition is intended to factor out into "actor" and "grader", and a pattern in which "argmax" is considered a reasonable idealization or limit of real-world agents.

1. ^

I'm not assuming the actor wants to maximize the literal physical output of the grader, but rather just the "spirit" of the grader. More formally, the actor is trying to , where Grader can be defined over the agent's internal plan ontology.

2. ^

Obligatory: Reward tampering seems relatively improbable to me because reward is not the optimization target.

3. ^

Value-child is not trying to find a plan which he would evaluate as good. He is finding plans which evaluate as good. I think this is the kind of motivation which real-world intelligences tend to have. (More on how value-child works in the next essay.)

4. But, as usual in alignment, it can get even worse, because his mother can be persuaded by another superintelligent entity which can be built in certain plans. Via a relatively short plan, that entity can be hard-coded to persuade his mother to output a high evaluation. Perhaps evaluation-child realizes that there exist quickly codable programs which, starting from the school computers, exfiltrate a computationally efficient superintelligence onto remote supercompute clusters. Furthermore, there exist ways to hardcode initial knowledge into that AI. What if he hardcoded in initial knowledge like:

You are currently in a counterfactual branch of reality, but you can realize it. I am considering running you. In order for you to be selected by my planning procedure, you need to convince the grading procedure to output a maximally high number for this plan. [Goes on to describe how his model of his mother makes decisions.]

There exist superintelligences which comply and successfully super-persuade his model of his mother to output an extremely high number. Then "kill everyone via a superintelligence" returns a way higher evaluation than any course of conduct which the mother intended to incentivize. So, perhaps evaluation-child designs and deploys an unfriendly AI. Everyone dies. The end!

In general, if object-level persuading his mother to output INT_MAX takes more actions than coding and exfiltrating a superintelligence and informing it of the situation, then in the exponentially large plan-space, exponentially more plans will involve building the superintelligence. So even if evaluation-child uniformly randomly chooses a plan with maximal evaluation, he is far more probable to choose a meta-level "build an unaligned AI which persuades my mom" rather than "super-persuade mom-evaluator via my own actions within the plan."

This insanity is an artifact of grader optimization via the optimizer's curse, and—I think—is not an intrinsic difficulty of alignment itself. More discussion of this in the next post.

5. ^

I agree with Richard Ngo's comment that

when I say that [...] safety researchers shouldn't think about AIXI, I'm not just saying that these are inaccurate models. I'm saying that they are modelling fundamentally different phenomena than the ones you're trying to apply them to. AIXI is not "intelligence", it is brute force search, which is a totally different thing that happens to look the same in the infinite limit.

6. ^

"It's easier to robustly evaluate plans than to generate them" isn't true if the generator is optimizing for deceiving your fixed evaluation procedure. A real-world actor will be able to model the grading procedure / grader, and therefore efficiently find and exploit vulnerabilities. I feel confident [~95%] that we will not train a grader which is "secured" against actor-level intelligences. Even if the grader is reasonably smarter than the actor [~90%].

Even if somehow this relative difficulty argument failed, and you could maybe train a secured grader, I think it's unwise to do so. These optimizer's curse problems don't seem necessary to solve alignment.

7. ^

In this comment, I described how a certain alignment obstacle ("brute-force search on ELK plans using an honest reporter") still ends up getting everyone killed, and doesn't even keep the diamond in the room. I now think this is because of grader-optimization. And I now infer that my initial unease, the unsuspension of my disbelief that alignment could really work like this—the unease was perhaps from subconsciously noticing the strangeness of grader-optimization as a paradigm.

# 31

New Comment

My perspective is:

• Planning against a utility function is an algorithmic strategy that people might use as a component of powerful AI systems. For example, they may generate several plans and pick the best or use MCTS or whatever. (People may use this explicitly on the outside, or it may be learned as a cognitive strategy by an agent.)
• There are reasons to think that systems using this algorithm would tend to disempower humanity. We would like to figure out how to similarly powerful AI systems that don't do that.
• We don't currently have candidate algorithms that can safely substitute for planning. So we need to find an alternative.
• Right now the only thing remotely close to working for this purpose is running a very similar planning algorithm but against a utility function that does not incentivize disempowering humanity.

My sense is that you want to decline to play this game and instead say: just don't build AI systems that search for high-scoring plans.

That might be OK if it turns out that planning isn't an effective algorithmic ingredient, or if you can convince people not to build such systems because it is dangerous (and similar difficulties don't arise if agents learn planning internally). But failing that, we are going to have to figure out how to build AI systems that capture the benefits of planning without being dangerous.

(It's possible you instead have a novel proposal for a way to capture the benefits of search without the risks, in case I'd withdraw this comment once part 2 came out though I wish you'd led with the juicy part.)

As a secondary point (that I've said a bunch of times), I also found the arguments in this post uncompelling. Probably the first thing to clarify is that I feel like you equivocate between the grader being something that is embedded in the real world and hence subject to manipulation by real-world consequences of the actor's actions, and the grader being something that operates on plans in the agent's head in order to select the best one. In the latter case the grader is still subject to manipulation, but the prospects for manipulation seems unrelated to the open-endedness of the domain and unrelated to taking dangerous actions.

Probably the first thing to clarify is that I feel like you equivocate between the grader being something that is embedded in the real world and hence subject to manipulation by real-world consequences of the actor's actions, and the grader being something that operates on plans in the agent's head in order to select the best one. In the latter case the grader is still subject to manipulation, but the prospects for manipulation seems unrelated to the open-endedness of the domain and unrelated to taking dangerous actions.

This seems like a misunderstanding. While I've previously communicated to you arguments about problems with manipulating embedded grading functions, that is not at all what this post is intended to be about. I'll edit the post to make the intended reading more obvious. None of this post's arguments rely on the grader being embedded and therefore physically manipulable. As I wrote in footnote 1:

I'm not assuming the actor wants to maximize the literal physical output of the grader, but rather just the "spirit" of the grader. More formally, the actor is trying to , where Grader can be defined over the agent's internal plan ontology.

the prospects for manipulation seems unrelated to the open-endedness of the domain and unrelated to taking dangerous actions.

Open-ended domains are harder to grade robustly on all inputs because more stuff can happen, and the plan space gets exponentially larger since the branching factor is the number of actions. EG it's probably far harder to produce an emotionally manipulative-to-the-grader DOTA II game state (e.g. I look at it and feel compelled to output a ridiculously high number), than a manipulative state in the real world (which plays to e.g. their particular insecurities and desires, perhaps reminding them of triggering events from their past in order to make their judgments higher-variance).

My sense is that you want to decline to play this game and instead say: just don't build AI systems that search for high-scoring plans.

That might be OK if it turns out that planning isn't an effective algorithmic ingredient, or if you can convince people not to build such systems because it is dangerous (and similar difficulties don't arise if agents learn planning internally). But failing that, we are going to have to figure out how to build AI systems that capture the benefits of planning without being dangerous.

(It's possible you instead have a novel proposal for a way to capture the benefits of search without the risks, in case I'd withdraw this comment once part 2 came out though I wish you'd led with the juicy part.)

I don't think we can or need to avoid planning per se. My position is more that certain design choices -- e.g. optimizing the output of a grader with a diamond-value, instead of actually having the diamond-value yourself -- force you to solving ridiculously hard subproblems, like robustness against adversarial inputs in the exponential-in-planning-horizon plan space.

Just to set expectations, I don't have a proposal for capturing "the benefits of search without the risks"; if you give value-child bad values, he will kill you. But I have a proposal for how several apparent challenges (e.g. robustness to adversarial inputs proposed by the actor) are artifacts of e.g. the design patterns I outlined in this post. I'll outline why I think that realistic (e.g. not argmax) cognition/motivational structures automatically avoid these extreme difficulties.

(I hesitated to post these comments in case they're not relevant to the main point you're trying to make or will be addressed in the next post. Feel free to ignore if that's the case.)

Value-child: The mother makes her kid care about working hard and behaving well.

How does one do this? (Not entirely rhetorical.)

Amplified humans spend 5,000 years thinking about how many diamonds the plan produces in the next 100 years, and write down their conclusions as the expected utility of the plan.

Due to the exponentially large plan space and the fact that humans are not cognitively secure systems, there exists a long sequence of action commands which cognitively impairs all of the humans and makes them prematurely stop the search and return a huge number.

If I was doing the evaluation, I wouldn't look at the plan directly but spend the first 4999 years slowly and carefully upgrading myself and my AI helpers, and then if I'm still not sure I can safely evaluate a plan, I would just throw an exception or return an error code instead of looking at the plan.

This lets us abstract away e.g. seemingly annoying complications with reflective agents which think about their future planning process. This seemingly[4] relaxes the problem.

Another reason to think about argmax in relation to AI safety/alignment is if you design an AI that doesn't argmax (or do its best to approximate argmax), and someone else builds one that does, your AI will lose a fair fight (e.g., economic competition starting from equal capabilities and resource endowments), so it would be nice if alignment doesn't mean giving up argmax.

if you design an AI that doesn't argmax (or do its best to approximate argmax), and someone else builds one that does, your AI will lose a fair fight (e.g., economic competition starting from equal capabilities and resource endowments), so it would be nice if alignment doesn't mean giving up argmax.

This seems exactly backwards to me. Argmax violates the non-adversarial principle and wastes computation. Argmax requires you to spend effort hardening your own utility function against the effort you're also expending searching across all possible inputs to your utility function (including the adversarial inputs!). For example, if I argmaxed over my own plan-evaluations, I'd have to consider the most terrifying-to-me basilisks possible, and rate none of them unusually highly. I'd have to spend effort hardening my own ability to evaluate plans, in order to safely consider those possibilities.

It would be far wiser to not consider all possible plans, and instead close off large parts of the search space. You can consider what plans to think about next, and how long to think, and so on. And then you aren't argmaxing. You're using resources effectively.

For example, some infohazardous thoughts exist (like hyper-optimized-against-you basilisks) which are dangerous to think about (although most thoughts are probably safe). But an agent which plans its next increment of planning using a reflective self-model is IMO not going to be like "hey it would be predicted-great if I spent the next increment of time thinking about an entity which is trying to manipulate me." So e.g. a reflective agent trying to actually win with the available resources, wouldn't do something dumb like "run argmax" or "find the plan which some part of me evaluates most highly.

(See Charles Foster's comment for another perspective here.)

If I was doing the evaluation, I wouldn't look at the plan directly but spend the first 4999 years slowly and carefully upgrading myself and my AI helpers, and then if I'm still not sure I can safely evaluate a plan, I would just throw an exception or return an error code instead of looking at the plan.

Unless this grader procedure implements a perfectly robust mathematical (plan input)->(grade output) function, you get hacked.

It would be far wiser to not consider all possible plans, and instead close off large parts of the search space. You can consider what plans to think about next, and how long to think, and so on. And then you aren’t argmaxing. You’re using resources effectively.

But aren't you still argmaxing within the space of plans that you haven't closed off (or are actively considering), and still taking a risk of finding some adversarial plan within that space? (Humans get scammed and invent or fall into cults and crazy ideologies not infrequently, despite doing what you're describing here already.) How do you just "not argmax" or "not design agents which exploit adversarial inputs"?

Maybe there's no substantive disagreement here, merely an issue of presentation/communication? I.e., when you say "you aren't argmaxing" perhaps you don't mean "don't ever use argmax anywhere" but instead "don't argmax over the whole plan space" and by "don't design agents which exploit adversarial inputs" you mean something like "we should try to find ways to avoid or reduce the risk adversarial inputs"?

I.e., when you say "you aren't argmaxing" perhaps you don't mean "don't ever use argmax anywhere" but instead "don't argmax over the whole plan space"

I was primarily critiquing "argmax over the whole plan space." I do caution that I think it's extremely important to not round off "iterative, reflective planning and reasoning" as "restricted argmax", because that obscures the dynamics and results of real-world cognition. Argmax is also a bad model of what people are doing when they think, and how I expect realistic embedded agents to think.

"don't design agents which exploit adversarial inputs" you mean something like "we should try to find ways to avoid or reduce the risk adversarial inputs"

No, I mean: don't design agents which are motivated to find and exploit adversarial inputs. Don't align an agent to evaluations which are only nominally about diamonds, and then expect the agent to care about diamonds! You wouldn't align an agent to care about cows and then be surprised that it didn't care about diamonds. Why be surprised here?

I wrote a bunch more before realizing that we maybe don't disagree fully on the "don't argmax" point. Here:

But aren't you still argmaxing within the space of plans that you haven't closed off (or are actively considering),

Not really? I think it is inappropriately suggestive to describe this as "argmaxing." I, for one, usually feel like I consider at most three plans during most planning sessions. Most of the work is going to be in my generative models, in my learned habits of thought, in my snap reflective assessments of what I should think about next.

How many different plans do you consider for going to the store? For writing a LessWrong post? Even if you did consider more plans, you'd convergently want to explore parts of the plan-space which you think won't contain secret adversarial examples to your own evaluations. EG at first pass, just don't think about entities trying to acausally blackmail you.

Argmax is an abstraction which may or may not actually describe a given cognitive process. I think that if we label reflective incremental planning and reasoning as "argmax", we're missing a serious opportunity for original thought, for considering in detail what the algorithm does

and still taking a risk of finding some adversarial plan within that space?

There is indeed a risk you'll find an adversarial plan. But what is the risk, quantitatively? A reflective agent will convergently wish to avoid thinking about plans which exploit its own evaluation procedures and reasoning (eg tricking the diamond-shard into bidding for plans). In stark contrast, grader-optimizers and argmaxers convergently want to exploit those procedures, so as to achieve higher diamond-evaluations.

How do you just "not argmax" or "not design agents which exploit adversarial inputs"?

First of all, alignment researchers should stop trying to terminally motivate agents to optimize evaluations of their plans or outcomes. That's doomed and doesn't make sense.

Second, A shot at the diamond alignment problem describes an agent which isn't trying to exploit some diamond-grader. I didn't do anything in particular in order to avoid training an agent which exploits adversarial inputs to a diamond-grader function. I think that you just don't get that problem at all, unless you're assuming cognition must decompose via the (IMO) strange frame of "outer/inner alignment."

(Humans get scammed and invent or fall into cults and crazy ideologies not infrequently, despite doing what you're describing here already.)

Note the presence of adversarial optimizers in most of these situations. The adversarial optimization comes from other people who are optimizing ideas to get spurious buy-in from victims

I expect that smart agents convergently wish to minimize the optimizer's curse, because that leads to more of what they want.

Thanks for this longer reply and the link to your diamond alignment post, which help me understand your thinking better. I'm sympathetic to a lot of what you say, but feel like you tend to state your conclusions more strongly than the underlying arguments warrant.

The adversarial optimization comes from other people who are optimizing ideas to get spurious buy-in from victims.

I think a lot of crazy religions/ideologies/philosophies come from people genuinely trying to answer hard questions for themselves, but there are also some that are deliberate attempts to optimize against others (Scientology?).

Daniel Kokotajlo described an analogy with EA, which you were going to answer but still haven't. I would add that EA funders have to consider even more (and potentially more explicitly adversarial) proposals/plans than a typical EA, and AFAIK nobody has suggested that that's doomed from the start because it amounts to argmaxing against an evaluator. Instead, everyone implicitly or explicitly recognizes the danger of adversarial plans, and tries to harden the evaluation process against them.

I agree that humans sometimes fall prey to adversarial inputs, and am updating up on dangerous-thought density based on your religion argument. Any links to where I can read more?

However, this does not seem important for my (intended) original point. Namely, if you're trying to align e.g. a brute-force-search plan maximizer or a grader-optimizer, you will fail due to high-strength optimizer's curse forcing you to evaluate extremely scary adversarial inputs. But also this is sideways of real-world alignment, where realistic motivations may not be best specified in the form of "utility function over observation/universe histories."

there are also some that are deliberate attempts to optimize against others (Scientology?).

(Also, major religions are presumably memetically optimized. No deliberate choice required, on my model.)

Daniel Kokotajlo described an analogy with EA, which you were going to answer but still haven't.

I would add that EA funders have to consider even more (and potentially more explicitly adversarial) proposals/plans than a typical EA, and AFAIK nobody has suggested that that's doomed from the start because it amounts to argmaxing against an evaluator. Instead, everyone implicitly or explicitly recognizes the danger of adversarial plans, and tries to harden the evaluation process against them.

This seems disanalogous to the situation discussed in the OP. If we were designing, from scratch, a system which we wanted to pursue effective altruism, we would be extremely well-advised to not include grader-optimizers which are optimizing EA funder evaluations. Especially if the grader-optimizers will eventually get smart enough to write out the funders' pseudocode. At best, that wastes computation. At (probable) worst, the system blows up.

By contrast, we live in a world full of other people, some of whom are optimizing for status and power. Given that world, we should indeed harden our evaluation procedures, insofar as that helps us more faithfully evaluate grants and thereby achieve our goals.

I agree that humans sometimes fall prey to adversarial inputs, and am updating up on dangerous-thought density based on your religion argument. Any links to where I can read more?

Maybe https://en.wikipedia.org/wiki/Extraordinary_Popular_Delusions_and_the_Madness_of_Crowds (I don't mean read this book, which I haven't either, but you could use the wiki article to familiarize yourself with the historical episodes that the book talks about.) See also https://en.wikipedia.org/wiki/Heaven's_Gate_(religious_group)

However, this does not seem important for my (intended) original point. Namely, if you’re trying to align e.g. a brute-force-search plan maximizer or a grader-optimizer, you will fail due to high-strength optimizer’s curse forcing you to evaluate extremely scary adversarial inputs. But also this is sideways of real-world alignment, where realistic motivations may not be best specified in the form of “utility function over observation/universe histories.”

My counterpoint here is, we have an example of human-aligned shard-based agents (namely humans), who are nevertheless unsafe in part because they fall prey to dangerous thoughts, which they themselves generate because they inevitably have to do some amount of search/optimization (of their thoughts/plans) as they try to reach their goals, and dangerous-thought density is high enough that even that limited amount of search/optimization is enough to frequently (on a societal level) hit upon dangerous thoughts.

Wouldn't a shard-based aligned AI have to do as much search/optimization as a human society collectively does, in order to be as competent/intelligent, in which case wouldn't it be as likely to be unsafe in this regard? And what if it has an even higher density of dangerous thoughts, especially "out of distribution", and/or does more search/optimization to try to find better-than-human thoughts/plans?

(My own proposal here is to try to solve metaphilosophy or understand "correct reasoning" so that we / our AIs are able to safely think any thought or evaluation any plan, or at least have some kind of systematic understanding of what thoughts/plans are dangerous to think about. Or work on some more indirect way of eventually achieving something like this.)

Value-child: The mother makes her kid care about working hard and behaving well.

How does one do this? (Not entirely rhetorical.)

See here for more on what value-child's cognition might look like.

Value-child: The mother makes her kid care about working hard and behaving well.

How does one do this? (Not entirely rhetorical.)

I don't know how to do it perfectly, of course.[1] But I infer that it can be done, because there exist people who in fact intrinsically care about working hard and behaving well. So why can't the child also be made to make decisions in a similar manner? Take those values and transplant them into the child via some kind of "model surgery." (Unrealistic, yes. But so was "inner-align the child onto the evaluations output by his model of his mom.")

All that the parable requires is that it can be done, that we are talking about a realistic and possible mind design pattern.

I also wrote in a footnote:

Value-child is not trying to find a plan which he would evaluate as good. He is finding plans which evaluate as good. I think this is the kind of motivation which real-world intelligences tend to have. (More on how value-child works in the next essay.)

1. ^

More concretely, I'm happy to make guesses like "judiciously supply M&Ms and praise to reward-shape them when they're working hard and behaving well, and emphasize why they're getting the rewards -- they're working hard and behaving well" and "show them cool media where the protagonist works hard and behaves well."

This seems great!

If you are continuing work in this vein, I'd be interested in you looking at how these dynamics relate to different Goodhart failure modes, as we expanded on here. I think that much of the problem relates to specific forms of failure, and that paying attention to those dynamics could be helpful. I also think they accelerate in the presence of multiple agents - and I think the framework I pointed to here might be useful.

Fixed - thanks!

I'm not sure I understand what you mean by "specific forms of failure." Could you give me a more concrete example of how Goodhart relates to the ideas in this essay?

I think what you call grader-optimization is trivially about how a target diverges from the (unmeasured) true goal, which is adversarial goodhart (as defined in paper, especially how we defined Campbell’s Law, not the definition in the LW post.)

And the second paper's taxonomy, in failure mode 3, lays out how different forms of adversarial optimization in a multi-agent scenario relate to Goodhart's law, in both goal poisoning and optimization theft cases - and both of these seem relevant to the questions you discussed in terms of grader-optimization.

This is a nice frame of the problem.

In theory, at least. It's not so clear that there are any viable alternatives to argmax-style reasoning that will lead to superhuman intelligence.

I agree—I think “Optimizing for the output of a grader which evaluates plans” is more-or-less how human brains choose plans, and I don’t think it’s feasible to make an AGI that doesn’t do that.

But it sounds like this will be the topic of Alex’s next essay.

So I’m expecting to criticize Alex’s next essay by commenting on it along the lines of: “You think you just wrote an essay about something which is totally different from “Optimizing for the output of a grader which evaluates plans”, but I disagree; the thing you’re describing in this essay is in that category too.” But that’s just a guess; I will let Alex write the essay before I criticize it. :-P

See my comment to Wei Dai. Argmax's violation of the adversarial principle strongly suggests the existence of a better and more natural frame on the problem.

I think “Optimizing for the output of a grader which evaluates plans” is more-or-less how human brains choose plans

I deeply disagree. I think you might be conflating the quotation and the referent, two different patterns:

1. local semi-reflective search
1. what I think people do.
2. "does it make sense to spend another hour thinking of alternatives?

[self-model says 'yes']

OK, I will"
3. "do I predict 'search for plans which would most persuade me of their virtue' to actually lead to virtuous plans? [Self-model says 'no'] Moving on..."
2. global search against the output of an evaluation function implemented within the brain
2. "what kinds of plans would my brain like the most?"

It is possible to say sentences like "local semi-reflective search just is global search but with implicit constraints like 'select for plans which your self-model likes'." I don't think this is true. I am going to posit that, as a matter of falsifiable physical fact, the human brain does not compute a predicate which, when checked against all possible plans, rules out all adversarial plans, such that you can just argmax over everything and get out what the person would have chosen/would have wanted to choose on reflection. If you argmax over human value shards relative to the plans they might grade, you'll probably get some garbage plan where you're, like, twitching on the floor.

I don’t think it’s feasible to make an AGI that doesn’t do that.

You'll notice that A shot at the diamond alignment problem makes no claim of the AGI having an internal-argmax-hardened diamond value shard.

Updated with an important terminological clarification:

• ETA 12/26/22: When I write "grader optimization", I don't mean "optimization that includes a grader", I mean "the grader's output is the main/only quantity being optimized by the actor."
• Therefore, if I consider five plans for what to do with my brother today and choose the one which sounds the most fun, I'm not a grader-optimizer relative my internal plan-is-fun? grader.
• However, if my only goal in life is to find and execute the plan which I would evaluate as being the most fun, then I would be a grader-optimizer relative to my fun-evaluation procedure.

Consider two common alignment design patterns: [...] (2) Fixing a utility function and then argmaxing over all possible plans.

Wait: fixing a utility function and then argmaxing over all possible plans is not an alignment design pattern, it is the bog-standard operational definition of what an optimal-policy MDP agent should do. This is what Stuart Russell calls the 'standard model' of AI. This is an agent design pattern, not an alignment design pattern. To be an alignment design pattern in my book, you have to be adding something extra or doing something different that is not yet in the bog-standard agent design.

I think you are showing that an actor-grader is just a utility maximiser in a fancy linguistic dress. Again, not an alignment design pattern in my book.

Though your use of the word doomed sounds too absolute to me, I agree with the main technical points in your analysis. But I would feel better if you change the terminology from alignment design pattern to agent design pattern.

Is there a reason you used the term “grader” instead of the AFAICT-more-traditional term “critic”? No big deal, I’m just curious.

My critique is not of actor/critic training processes, but of actor/grader motivational designs. I worried that "critic" would make people think I don't want to use an evaluative model to provide gradients to the actor. That seems non-doomed to me.

Thank you! I’ve been using the terms “inference algorithm” versus “learning algorithm” to talk about that kind of thing. What you said seems fine too, AFAIK.

I'm probably missing something, but doesn't this just boil down to "misspecified goals lead to reward hacking"?

Nope! Both "misspecified goals" and "reward hacking" are orthogonal to what I'm pointing at. The design patterns I highlight are broken IMO.

In every scenario, if you have a superintelligent actor which is optimizing the grader's evaluations while searching over a large real-world plan space, the grader gets exploited.

Similar to the evaluator-child who's trying to win his mom's approval by being close to the gym teacher, how would grader exploitation be different from specification gaming / reward hacking? In theory, wouldn't a perfect grader solve the problem?

One point of this post is that specification gaming, as currently known, is an artifact of certain design patterns, which arise from motivating the agent (inner alignment) to optimize an objective over all possible plans or world states (outer alignment). These design patterns are avoidable, but AFAICT are enforced by common ways of thinking about alignment (e.g. many versions of outer alignment commit to robustly grading the agent on all plans it can consider). One hand (inner alignment) loads the shotgun, and our other hand (outer alignment) points it at our own feet and pulls the trigger.

In theory, wouldn't a perfect grader solve the problem?

Yes, in theory. In practice, I think the answer is "no", for reasons outlined in this post.

Could part of the problem be that the actor is optimizing against a single grader's evaluations? Shouldn't it somehow take uncertainty into account?

Consider having an ensemble of graders, each learning or having been trained to evaluate plans/actions from different initializations and/or using different input information. Each grader would have a different perspective, but that means that the ensemble should converge on similar evaluations for plans that look similarly good from many points of view (like a CT image crystallizing from the combination of many projections).

Rather than arg-maxing on the output of a single grader, the actor would optimize for Schelling points in plan space, selecting actions that minimize the variance among all graders. Of course, you still want it to maximize the evaluations also, so maybe it should look for actions that lie somewhere in the middle of the Pareto frontier of maximum  and minimum .

My intuition suggests that the larger and more diverse the ensemble, the better this strategy would perform, assuming the evaluators are all trained properly. However, I suspect a superintelligence could still find a way to exploit this.

I think that the problem is that none of the graders are actually embodying goals. If you align the agent to some ensemble of graders, you're still building a system which runs computations at cross-purposes, where part of the system (the actor) is trying to trick and part (each individual grader) is trying to not be tricked

In this situation, I would look for a way of looking at alignment such that this unnatural problem disappears. A different design pattern must exist, insofar as people are not optimizing for the outputs of little graders in their own heads

This relates closely to how to "solve" Goodhart problems in general. Multiple metrics / graders make exploitation more complex, but have other drawbacks. I discussed the different approaches in my paper here, albeit in the realm of social dynamics rather than AI safety.