## AI ALIGNMENT FORUMAF

Alex Turner

Alex Turner, postdoctoral researcher at the Center for Human-Compatible AI. Reach me at turner.alex[at]berkeley[dot]edu.

# Sequences

Thoughts on Corrigibility
The Causes of Power-seeking and Instrumental Convergence
Reframing Impact

# Wiki Contributions

Actual answer: Because the entire field of experimental psychology that's why.

This excerpt isn’t specific so it’s hard to respond, but I do think there’s a lot of garbage in experimental psychology (like every other field), and more specifically I believe that Eliezer has cited some papers in his old blog posts that are bad papers. (Also, even when experimental results are trustworthy, their interpretation can be wrong.) I have some general thoughts on the field of evolutionary psychology in Section 1 here.

Eliezer's reasoning is surprisingly weak here. It doesn't really interact with the strong mechanistic claims he's making ("Motivated reasoning is definitely built-in, but it's built-in in a way that very strongly bears the signature of 'What would be the easiest way to build this out of these parts we handily had lying around already'").

He just flatly states a lot of his beliefs as true:

the conventional explanation of which is that we have built-in cheater detectors. This is a case in point of how humans aren't blank slates and there's no reason to pretend we are.

Local invalidity via appeal to dubious authority; conventional explanations are often total bogus, and in particular I expect this one to be bogus.

But Eliezer just keeps stating his dubious-to-me stances as obviously True, without explaining how they actually distinguish between mechanistic hypotheses, or e.g. why he thinks he can get so many bits about human learning process hyperparameters from results like Wason (I thought it's hard to go from superficial behavioral results to statements about messy internals? & inferring "hard-coding" is extremely hard even for obvious-seeming candidates).

Similarly, in the summer (consulting my notes + best recollections here), he claimed ~"Evolution was able to make the (internal physiological reward schedule)  (learned human values) mapping predictable because it spent lots of generations selecting for alignability on caring about proximate real-world quantities like conspecifics or food" and I asked "why do you think evolution had to tailor the reward system specifically to make this possible? what evidence has located this hypothesis?" and he said "I read a neuroscience textbook when I was 11?", and stared at me with raised eyebrows.

I just stared at him with a shocked face. I thought, surely we're talking about different things. How could that data have been strong evidence for that hypothesis? Really, neuroscience textbooks provide huge evidence for evolution having to select the reward->value mapping into its current properties?

I also wrote in my journal at the time:

EY said that we might be able to find a learning process with consistent inner / outer values relations if we spent huge amounts of compute to evolve such a learning process in silicon

Eliezer seems to attach some strange importance to the learning process being found by evolution, even though the learning initial conditions screen off evolution's influence..? Like, what?

I still don't understand that interaction. But I've had a few interactions like this with him, where he confidently states things, and then I ask him why he thinks that, and offers some unrelated-seeming evidence which doesn't -- AFAICT -- actually discriminate between hypotheses.

That is to say, prior to "simulators" and "shard theory", a lot of focus was on utility-maximizers--agents that do things like planning or search to maximize a utility function; but planning, although instrumentally useful, is not strictly necessary for many intelligent behaviors, so we are seeing more focus on e.g. agents that enact learned policies in RL that do not explicitly maximize reward in deployment but try to enact policies that did so in training.

FYI I do expect planning for smart agents, just not something qualitatively alignment-similar to "argmax over crisp human-specified utility function." (In the language of the OP, I expect values-executors, not grader-optimizers.)

I have no way of knowing that increasing the candy-shard's value won't cause a phase shift that substantially increases the perceived value of the "kill all humans, take their candy" action plan. I ultimately care about the agent's "revealed preferences", and I am not convinced that those are smooth relative to changes in the shards.

I'm not either. I think there will be phase changes wrt "shard strengths" (keeping in mind this is a leaky abstraction), and this is a key source of danger IMO.

Basically my stance is "yeah there are going to be phase changes, but there are also many perturbations which don't induce phase changes, and I really want to understand which is which."

Lol, cool. I tried the "4 minute" challenge (without having read EY's answer, but having read yours).

Hill-climbing search requires selecting on existing genetic variance on alleles already in the gene pool. If there isn’t a local mutation which changes the eventual fitness of the properties which that genotype unfolds into, then you won’t have selection pressure in that direction. On the other hand, gradient descent is updating live on a bunch of data in fast iterations which allow running modifications over the parameters themselves. It’s like being able to change a blueprint for a house, versus being able to be at the house in the day and direct repair-people.

The changes happen online, relative to the actual within-cognition goings-on of the agent (e.g. you see some cheese, go to the cheese, get a policy gradient and become more likely to do it again). Compare that to having to try out a bunch of existing tweaks to a cheese-bumping-into agent (e.g. make it learn faster early in life but then get sick and die later), where you can’t get detailed control over its responses to specific situations (you can just tweak the initial setup).

Gradient descent is just a fundamentally different operation. You aren’t selecting over learning processes which unfold into minds, trying out a finite but large gene pool of variants, and then choosing the most self-replicating; you are instead doing local parametric search over what changes outputs on the training data. But RL isn’t even differentiable, you aren’t running gradients through it directly. So there isn’t even an analogue of “training data” in the evolutionary regime.

I think I ended up optimizing for "actually get model onto the page in 4 minutes" and not for "explain in a way Scott would have understood."

Yeah, this read really bizarrely to me. This is a good way of making sense of that section, maybe. But then I'm still confused why Scott concluded "oh I was just confused in this way" and then EY said "yup that's why you were confused", and I'm still like "nope Scott's question seems correctly placed; evolutionary history is indeed screened off by the runtime hyperparameterization and dataset."

Thanks for leaving this comment, I somehow only just now saw it.

Given your pseudocode it seems like the only point of planModificationSample is to produce plan modifications that lead to high outputs of self.diamondShard(self.WM.getConseq(plan)). So why is that not "optimizing the outputs of the grader as its main terminal motivation"?

I want to make a use/mention distinction. Consider an analogous argument:

"Given gradient descent's pseudocode it seems like the only point of backward is to produce parameter modifications that lead to low outputs of loss_fn. Gradient descent selects over all directional derivatives for the gradient, which is the direction of maximal loss reduction. Why is that not "optimizing the outputs of the loss function as gradient descent's main terminal motivation"?"[1]

Locally reducing the loss is indeed an important part of the learning dynamics of gradient descent, but this (I claim) has very different properties than "randomly sample from all global minima in the loss landscape" (analogously: "randomly sample a plan which globally maximizes grader output").

But I still haven't answered your broader  I think you're asking for a very reasonable definition which I have not yet given, in part because I've remained somewhat confused about the exact grader/non-grader-optimizer distinction I want to draw. At least, intensionally. (which is why I've focused on giving examples, in the hope of getting the vibe across.)

I gave it a few more stabs, and I don't think any of them ended up being sufficient. But here they are anyways:

1. A "grader-optimizer" makes decisions primarily on the basis of the outputs of some evaluative submodule, which may or may not be explicitly internally implemented. The decision-making is oriented towards making the outputs come out as high as possible.
2. In other words, the evaluative "grader" submodule is optimized against by the planning.

I wish I had a better intensional definition for you, but that's what I wrote immediately and I really better get through the rest of my comm backlog from last week.

Here are some more replies which may help clarify further.

in particular, such a system would produce plans like "analyze the diamond-evaluator to search for side-channel attacks that trick the diamond-evaluator into producing high numbers", whereas planModificationSample would not do that (as such a plan would be rejected by self.diamondShard(self.WM.getConseq(plan))).

Aside -- I agree with both bolded claims, but think they are separate facts. I don't think the second claim is the reason for the first being true. I would rather say, "plan would not do that, because self.diamondShard(self.WM.getConseq(plan)) would reject side-channel-attack plans."

Grader is complicit: In the diamond shard-shard case, the grader itself would say "yes, please do search for side-channel attacks that trick me"

No, I disagree with the bolded underlined part. self.diamondShardShard wouldn't be tricking itself, it would be tricking another evaluative module in the AI (i.e. self.diamondShard).[2]

If you want an AI which tricks the self.diamondShardShard, you'd need it to primarily use self.diamondShardShardShard to actually steer planning. (Or maybe you can find a weird fixed-point self-tricking shard, but that doesn't seem central to my reasoning; I don't think I've been imagining that configuration.)

and what modification would you have to make in order to make it a grader-optimizer with the grader self.diamondShard(self.WM.getConseq(plan))?

Oh, I would change self.diamondShard to self.diamondShardShard

1. ^

I think it's silly to say that GD has a terminal motivation, but I'm not intending to imply that you are silly to say that the agent has a terminal motvation.

2. ^

Or more precisely, self.diamondGrader -- since "self.diamondShard" suggests that the diamond-value directly bids on plans in the grader-optimizer setup. But I'll stick to self.diamondShardShard for now and elide this connotation.

Also in general I disagree about aligning agents to evaluations of plans being unnecessary. What you are describing here is just direct optimization. But direct optimization  -- i.e .effectively planning over a world model

the important thing to realise is that 'human values' do not really come from inner misalignment wrt our innate reward circuitry but rather are the result of a very long process of social construction influenced both by our innate drives but also by the game-theoretic social considerations needed to create and maintain large social groups, and that these value constructs have been distilled into webs of linguistic associations learnt through unsupervised text-prediction-like objectives which is how we practically interact with our values.

Most human value learning occurs through this linguistic learning grounded by our innate drives but extended to much higher abstractions by language.i.e. for humans we learn our values as some combination of bottom-up (how well do our internal reward evaluators in basal ganglia/hypothalamus) accord with the top-down socially constructed values) as well as top-down association of abstract value concepts with other more grounded linguistic concepts.

Can you give me some examples here? I don't know that I follow what you're pointing at.

Strong upvoted. I appreciate the strong concreteness & focus on internal mechanisms of cognition.

Thomas Kwa suggested that consequentialist agents seem to have less superficial (observation, belief state) -> action mappings. EG a shard agent might have:

1. An "it's good to give your friends chocolate" subshard
2. A "give dogs treats" subshard
3. -> An impulse to give dogs chocolate, even though the shard agent knows what the result would be

But a consequentialist would just reason about what happens, and not mess with those heuristics. (OFC, consequentialism would be a matter of degree)

In this way, changing a small set of decision-relevant features (e.g. "Brown dog treat" -> "brown ball of chocolate") changes the consequentialist's action logits a lot, way more than it changes the shard agent's logits. In a squinty, informal way, the (belief state -> logits) function has a higher Lipschitz constant/is more smooth for the shard agent than for the consequentialist agent.

So maybe one (pre-deception) test for consequentialist reasoning is to test sensitivity of decision-making to small perturbations in observation-space (e.g. dog treat -> tiny chocolate) but large perturbations in action-consequence space (e.g. happy dog -> sick dog). You could spin up two copies of the model to compare.

I think it's worth it to use magic as a term of art, since it's 11 fewer words than "stuff we need to remind ourselves we don't know how to do," and I'm not satisfied with "free parameters."

11 fewer words, but I don't think it communicates the intended concept!

If you have to say "I don't mean one obvious reading of the title" as the first sentence, it's probably not a good title. This isn't a dig -- titling posts is hard, and I think it's fair to not be satisfied with the one I gave. I asked ChatGPT to generate several new titles; lightly edited:

1. "Uncertainties left open by Shard Theory"
2. "Limitations of Current Shard Theory"
3. "Challenges in Applying Shard Theory"
4. "Unanswered Questions of Shard Theory"
5. "Exploring the Unknowns of Shard Theory"

After considering these, I think that "Reminder: shard theory leaves open important uncertainties" is better than these five, and far better than the current title. I think a better title is quite within reach.

But how do we learn that fact?

I didn't claim that I assign high credence to alignment just working out, I'm saying that it may as a matter of fact turn out that shard theory doesn't "need a lot more work," because alignment works out as a matter of fact from the obvious setups people try.

1. There's a degenerate version of this claim, where ST doesn't need more work because alignment is "just easy" for non-shard-theory reasons, and in that world ST "doesn't need more work" because alignment itself doesn't need more work.
2. There's a less degenerate version of the claim, where alignment is easy for shard-theory reasons -- e.g. agents robustly pick up a lot of values, many of which involve caring about us.

"Shard theory doesn't need more work" (in sense 2) could be true as a matter of fact, without me knowing it's true with high confidence. If you're saying "for us to become highly confident that alignment is going to work this way, we need more info", I agree.

But I read you as saying "for this to work as a matter of fact, we need X Y Z additional research":

At best we need more abstract thought about this issue in order to figure out what an approach might even look like, and at worst I think this is a problem the necessitates a different approach.

And I think this is wrong. 2 can just be true, and we won't justifiably know it. So I usually say "It is not known to me that I know how to solve alignment", and not "I don't know how to solve alignment."

Does that make sense?