Rohin Shah

Research Scientist at DeepMind. Creator of the Alignment Newsletter.


Value Learning
Alignment Newsletter

Wiki Contributions


I wish you wouldn't use IMO vague and suggestive and proving-too-much selection-flavor arguments, in favor of a more mechanistic analysis. 

Can you name a way in which my arguments prove too much? That seems like a relatively concrete thing that we should be able to get agreement on.

You do not need an agent to have perfect values.

I did not claim (nor do I believe) the converse.

Many foundational arguments are about grader-optimization, so you can't syntactically conclude "imperfect values means doom." That's true in the grader case, but not here.

I disagree that this is true in the grader case. You can have a grader that isn't fully robust but is sufficiently robust that the agent can't exploit any errors it would make.  

If you drop out or weaken the influence of IF plan can be easily modified to incorporate more diamonds, THEN do it, that won't necessarily mean the AI makes some crazy diamond-less universe.

The difficulty in instilling values is not that removing a single piece of the program / shard that encodes it will destroy the value. The difficulty is that when you were instilling the value, you accidentally rewarded a case where the agent tried a plan that produced pictures of diamonds (because you thought they were real diamonds), and now you've instilled a shard that upweights plans that produce pictures of diamonds. Or that you rewarded the agent for thoughts like "this will make pretty, transparent rocks" (which did lead to plans that produced diamonds), leading to shards that upweight plans that produce pretty, transparent rocks, and then later the agent tiles the universe with clear quartz.

The value shards are the things which optimize hard, by wielding the rest of the agent's cognition (e.g. the world model, the general-purpose planning API). 

So, I'm basically asking that you throw an error and recheck your "selection on imperfection -> doom" arguments, as I claim many of these arguments reflect grader-specific problems.

I think that the standard arguments work just fine for arguing that "incorrect value shards -> doom", precisely because the incorrect value shards are the things that optimize hard.

(Here incorrect value shards means things like "the value shards put their influence towards plans producing pictures of diamonds" and not "the diamond-shard, but without this particular if clause".)

It is extremely relevant [...]

This doesn't seem like a response to the argument in the paragraph that you quoted; if it was meant to be then I'll need you to rephrase it.

When I talk about shard theory, people often seem to shrug and go "well, you still need to get the values perfect else Goodhart; I don't see how this 'value shard' thing helps."

I realize you are summarizing a general vibe from multiple people, but I want to note that this is not what I said. The most relevant piece from my comment is:

I don't buy this as stated; just as "you have a literally perfect overseer" seems theoretically possible but unrealistic, so too does "you instill the direct goal literally exactly correctly". Presumably one of these works better in practice than the other, but it's not obvious to me which one it is.

In other words: Goodhart is a problem with values-execution, and it is not clear which of values-execution and grader-optimization degrades more gracefully. In particular, I don't think you need to get the values perfect. I just also don't think you need to get the grader perfect in grader-optimization paradigms, and am uncertain about which one ends up being better.

Sounds to me like you buy it but you don't know anything else to do?

Yes, and in particular I think direct-goal approaches do not avoid the issue. In particular, I can make an analogous claim for them:

"From the perspective of the human-AI system overall, having an AI motivated by direct goals is building a system that works at cross purposes with itself, as the human puts in constant effort to ensure that the direct goal embedded in the AI is "hardened" to represent human values as well as possible, while the AI is constantly searching for upwards-errors in the instilled values (i.e. things that score highly according to the instilled values but lowly according to the human)."

Like, once you broaden to the human-AI system overall, I think this claim is just "A principal-agent problem / Goodhart problem involves two parts of a system working at cross purposes with each other", which is both (1) true and (2) unavoidable (I think).

It seems like you could describe this as "the AI's plans for improving efficiency are implicitly searching for errors in the concept of diamonds, and the AI has to spend extra effort hardening its concept of diamonds to defend against this attack". So what's the difference between this issue and the issue with grader optimization?

  1. Values-execution. Diamond-evaluation error-causing plans exist and are stumble-upon-able, but the agent wants to avoid errors.
  2. Grader-optimization. The agent seeks out errors in order to maximize evaluations. 

The part of my response that you quoted is arguing for the following claim:

If you are analyzing the AI system in isolation (i.e. not including the human), I don't see an argument that says [grader-optimization would violate the non-adversarial principle] and doesn't say [values-execution would violate the non-adversarial principle]".

As I understand it you are saying "values-execution wants to avoid errors but grader-optimization does not". But I'm not seeing it. As far as I can tell the more correct statements are "agents with metacognition about their grader / values can make errors, but want to avoid them" and "it is a type error to talk about errors in the grader / values for agents without metacognition about their grader / values".

(It is a type error in the latter case because what exactly are you measuring the errors with respect to? Where is the ground truth for the "true" grader / values? You could point to the human, but my understanding is that you don't want to do this and instead just talk about only the AI cognition.)

For reference, in the part that you quoted, I was telling a concrete story of a values-executor with metacognition, and saying that it too had to "harden" its values to avoid errors. I do agree that it wants to avoid errors. I'd be interested in a concrete example of a grader-optimizer with metacognition that that doesn't want to avoid errors in its grader.

Like, in what sense does Bill not want to avoid errors in his grader?

I don't mean that Bill from Scenario 2 in the quiz is going to say "Oh, I see now that actually I'm tricking myself about whether diamonds are being created, let me go make some actual diamonds now". I certainly agree that Bill isn't going to try making diamonds, but who said he should? What exactly is wrong with Bill's desire to think that he's made a bunch of diamonds? Seems like a perfectly coherent goal to me; it seems like you have to appeal to some outside-Bill perspective that says that actually the goal was making diamonds (in which case you're back to talking about the full human-AI system, rather than the AI cognition in isolation).

What I mean is that Bill from Scenario 2 might say "Hmm, it's possible that if I self-modify by sticking a bunch of electrodes in my brain, then it won't really be me who is feeling the accomplishment of having lots of diamonds. I should do a bunch of neuroscience and consciousness research first to make sure this plan doesn't backfire on me".

Two responses:

  1. Grader-optimization has the benefit that you don't have to define what values you care about in advance. This is a difficulty faced by value-executors but not by grader-optimizers.
  2. Part of my point is that the machinery you need to solve evaluation-problems is also needed to solve instillation-problems because fundamentally they are shadows of the same problem, so I'd estimate d_evaluation at close to 0 in your equations after you have dealt with d_instillation.

the first line seems to speculate that values-AGI is substantially more robust to differences in values

The thing that I believe is that an intelligent, reflective, careful agent with a decisive strategic advantage (DSA) will tend to produce outcomes that are similar in value to that which would be done by that agent's CEV. In particular, I believe this because the agent is "trying" to do what its CEV would do, it has the power to do what its CEV would do, and so it will likely succeed at this.

I don't know what you mean by "values-AGI is more robust to differences in values". What values are different in this hypothetical?

I do think that values-AGI with a DSA is likely to produce outcomes similar to CEV-of-values-AGI

It is unclear whether values-AGI with a DSA is going to produce outcomes similar to CEV-of-Rohin (because this depends on how you built values-AGI and whether you successfully aligned it).

Sounds right. How does this answer my point 4?

I guess maybe you see two discrepancies vs one and conclude that two is worse than one? I don't really buy that, seems like it depends on the size of the discrepancies.

For example, if you imagine an AI that's optimizing for my evaluation of good, I think the discrepancy between "Rohin's directly instilled goals" and "Rohin's CEV" is pretty small and I am pretty happy to ignore it. (Put another way, if that was the only source of misalignment risk, I'd conclude misalignment risk was small and move on to some other area.) So the only one that matters in this case of grader optimization is the discrepancy between "plans Rohin evaluates as good" and "Rohin's directly instilled goals".

In the post TurnTrout is focused on A [...] He explicitly disclaims that he is not making arguments about B

I agree that's what the post does, but part of my response is that the thing we care about is both A and B, and the problems that arise for grader-optimization in A (highlighted in this post) also arise for value-instilling in B in slightly different form, and so if you actually want to compare the two proposals you need to think about both.

I'd be on board with a version of this post where the conclusion was "there are some problems with grader-optimization, but it might still be the best approach; I'm not making a claim on that one way or the other".

grader-optimization is a kind of cognition that works at cross purposes with itself, one that is an anti-pattern, one that an agent (even an unaligned agent) should discard upon reflection because it works against its own interests.

I didn't actually mention this in my comment, but I don't buy this argument:

Case 1: no meta cognition. Grader optimization only "works at cross purposes with itself" to the extent that the agent thinks that the grader might be mistaken about things. But it's not clear why this is the case: if the agent thinks "my grader is mistaken" that means there's some broader meta-cognition in the agent that does stuff based on something other than the grader. That meta-cognition could just not be there and then the agent would be straightforwardly optimizing for grader-outputs.

As a concrete example, AIXI seems to me like an example of grader-optimization (since the reward signal comes from outside the agent). I do not think AIXI would "do better according to its own interests" if it "discarded" its grader-optimization.

You can say something like "from the perspective of the human-AI system overall, having an AI motivated by grader-optimization is building a system that works at cross purposes itself", but then we get back to the response "but what is the alternative".

Case 2: with meta cognition. If we instead assume that there is some meta cognition reflecting on whether the grader might be mistaken, then it's not clear to me that this failure mode only applies to grader optimization; you can similarly have meta cognition reflecting on whether values are mistaken.

Suppose you instill diamond-values into an AI. Now the AI is thinking about how it can improve the efficiency of its diamond-manufacturing, and has an idea that reduces the necessary energy requirements at the cost of introducing some impurities. Is this good? The AI doesn't know; it's unsure what level of impurities is acceptable before the thing it is making is no longer a diamond. Efficiency is very important, even 0.001% improvement is a massive on an absolute scale given its fleets of diamond factories, so it spends some time reflecting on the concept of diamonds to figure out whether the impurities are acceptable.

It seems like you could describe this as "the AI's plans for improving efficiency are implicitly searching for errors in the concept of diamonds, and the AI has to spend extra effort hardening its concept of diamonds to defend against this attack". So what's the difference between this issue and the issue with grader optimization?

We're building intelligent AI systems that help us do stuff. Regardless of how the AI's internal cognition works, it seems clear that the plans / actions it enacts have to be extremely strongly selected. With alignment, we're trying to ensure that they are strongly selected to produce good outcomes, rather than being strongly selected for something else. So for any alignment proposal I want to see some reason that argues for "good outcomes" rather than "something else".

In nearly all of the proposals I know of that seem like they have a chance of helping, at a high level the reason is "human(s) are a source of information about what is good, and this information influences what the AI's plans are selected for". (There are some cases based on moral realism.)

This is also the case with value-child: in that case, the mother is a source of info on what is good, she uses this to instill values in the child, those values then influence which plans value-child ends up enacting.

All such stories have a risk: what if the process of using [info about what is good] to influence [that which plans are selected for] goes wrong, and instead plans are strongly selected for some slightly-different thing? Then because optimization amplifies and value is fragile, the plans will produce bad outcomes.

I view this post as instantiating this argument for one particular class of proposals: cases in which we build an AI system that explicitly searches over a large space of plans, predicts their consequences, rates the consequences according to a prediction of what is "good", and executes the highest-scoring plan. In such cases, you can more precisely restate "plans are strongly selected for some slightly-different thing" to "the agent executes plans that cause upwards-errors in the prediction of what is good".

It's an important argument! If you want to have an accurate picture of how likely such plans are too work, you really need to consider this point!

The part where I disagree is where the post goes on to say "and so we shouldn't do this". My response: what is the alternative, and why does it avoid or lessen the more abstract risk above?

I'd assume that the idea is that you produce AI systems that are more like "value-child". Certainly I agree that if you successfully instill good values into your AI system, you have defused the risk argument above. But how did you do that? Why didn't we instead get "almost-value-child", who (say) values doing challenging things that require hard work, and so enrolls in harder and harder courses and gets worse and worse grades?

So far, this is a bit unfair to the post(s). It does have some additional arguments, which I'm going to rewrite in totally different language which I might be getting horribly wrong:

An AI system with a "direct (object-level) goal" is better than one with "indirect goals". Specifically, you could imagine two things: (a) plans are selected for a direct goal (e.g. "make diamonds") encoded inside the AI system, vs. (b) plans are selected for being evaluated as good by something encoded outside the AI system (e.g. "Alice's approval"). I think the idea is that indirect goals clearly have issues (because the AI system is incentivized to trick the evaluator), while the direct goal has some shot at working, so we should aim for the direct goal.

I don't buy this as stated; just as "you have a literally perfect overseer" seems theoretically possible but unrealistic, so too does "you instill the direct goal literally exactly correctly". Presumably one of these works better in practice than the other, but it's not obvious to me which one it is.

Separately, I don't see this as all that relevant to what work we do in practice: even if we thought that we should be creating an AI system with a direct goal, I'd still be interested in iterated amplification, debate, interpretability, etc, because all of those seem particularly useful for instilling direct goals (given the deep learning paradigm). In particular even with a shard lens I'd be thinking about "how do I notice if my agent grew a shard that was subtly different from what I wanted" and I'd think of amplifying oversight as an obvious approach to tackle this problem. Personally I think it's pretty likely that most of the AI systems we build and align in the near-to-medium term will have direct goals, even if we use techniques like iterated amplification and debate to build them.

Plan generation is safer. One theme is that with realistic agent cognition you only generate, say, 2-3 plans, and choose amongst those, which is very different from searching over all possible plans. I don't think this inherently buys you any safety; this just means that you now have to consider how those 2-3 plans were generated (since they are presumably not random plans). Then you could make other arguments for safety (idk if the post endorses any of these):

  1. Plans are selected based on historical experience. Instead of considering novel plans where you are relying more on your predictions of how the plans will play out, the AI could instead only consider plans that are very similar to plans that have been tried previously (by humans or AIs), where we have seen how such plans have played out and so have a better idea of whether they are good or not. I think that if we somehow accomplished this it would meaningfully improve safety in the medium term, but eventually we will want to have very novel plans as well and then we'd be back to our original problem.
  2. Plans are selected from amongst a safe subset of plans. This could in theory work, but my next question would be "what is this safe subset, and why do you expect plans to be selected from it?" That's not to say it's impossible, just that I don't see the argument for it.
  3. Plans are selected based on values. In other words we've instilled values into the AI system, the plans are selected for those values. I'd critique this the same way as above, i.e. it's really unclear how we successfully instilled values into the AI system and we could have instilled subtly wrong values instead.
  4. Plans aren't selected strongly. You could say that the 2-3 plans aren't strongly selected for anything, so they aren't likely to run into these issues. I think this is assuming that your AI system isn't very capable; this sounds like the route of "don't build powerful AI" (which is a plausible route).

In summary:

  1. Intelligence => strong selection pressure => bad outcomes if the selection pressure is off target.
  2. In the case of agents that are motivated to optimize evaluations of plans, this argument turns into "what if the agent tricks the evaluator".
  3. In the case of agents that pursue values / shards instilled by some other process, this argument turns into "what if the values / shards are different from what we wanted".
  4. To argue for one of these over the other, you need to compare these two arguments. However, this post is stating point 2 while ignoring point 3.

But unless that distinction is central to what you're trying to point to here

Yeah, I don't think it's central (and I agree that heuristics that rule out parts of the search space are very useful and we should expect them to arise).

First probable crux: at this point, I think one of my biggest cruxes with a lot of people is that I expect the capability level required to wipe out humanity, or at least permanently de-facto disempower humanity, is not that high. I expect that an AI which is to a +3sd intelligence human as a +3sd intelligence human is to a -2sd intelligence human would probably suffice, assuming copying the AI is much cheaper than building it.

This sounds roughly right to me, but I don't see why this matters to our disagreement?

For example, think about how humans most often deceive other humans: we do it mainly by deceiving ourselves, reframing our experiences and actions in ways which make us look good and then presenting that picture to others. Or, we instinctively behave in more prosocial ways when people are watching than when not, even without explicitly thinking about it. That's the sort of thing I expect to happen in AI, especially if we explicitly train with something like RLHF (and even moreso if we pass a gradient back through deception-detecting interpretability tools).

This also sounds plausible to me (though it isn't clear to me how exactly doom happens). For me the relevant question is "could we reasonably hope to notice the bad things by analyzing the AI and extracting its knowledge", and I think the answer is still yes.

I maybe want to stop saying "explicitly thinking about it" (which brings up associations of conscious vs subconscious thought, and makes it sound like I only mean that "conscious thoughts" have deception in them) and instead say that "the AI system at some point computes some form of 'reason' that the deceptive action would be better than the non-deceptive action, and this then leads further computation to take the deceptive action instead of the non-deceptive action".

Load More