Alex Turner, postdoctoral researcher at the Center for Human-Compatible AI. Reach me at turner.alex[at]berkeley[dot]edu.
That is to say, prior to "simulators" and "shard theory", a lot of focus was on utility-maximizers--agents that do things like planning or search to maximize a utility function; but planning, although instrumentally useful, is not strictly necessary for many intelligent behaviors, so we are seeing more focus on e.g. agents that enact learned policies in RL that do not explicitly maximize reward in deployment but try to enact policies that did so in training.
FYI I do expect planning for smart agents, just not something qualitatively alignment-similar to "argmax over crisp human-specified utility function." (In the language of the OP, I expect values-executors, not grader-optimizers.)
I have no way of knowing that increasing the candy-shard's value won't cause a phase shift that substantially increases the perceived value of the "kill all humans, take their candy" action plan. I ultimately care about the agent's "revealed preferences", and I am not convinced that those are smooth relative to changes in the shards.
I'm not either. I think there will be phase changes wrt "shard strengths" (keeping in mind this is a leaky abstraction), and this is a key source of danger IMO.
Basically my stance is "yeah there are going to be phase changes, but there are also many perturbations which don't induce phase changes, and I really want to understand which is which."
Lol, cool. I tried the "4 minute" challenge (without having read EY's answer, but having read yours).
Hill-climbing search requires selecting on existing genetic variance on alleles already in the gene pool. If there isn’t a local mutation which changes the eventual fitness of the properties which that genotype unfolds into, then you won’t have selection pressure in that direction. On the other hand, gradient descent is updating live on a bunch of data in fast iterations which allow running modifications over the parameters themselves. It’s like being able to change a blueprint for a house, versus being able to be at the house in the day and direct repair-people.
The changes happen online, relative to the actual within-cognition goings-on of the agent (e.g. you see some cheese, go to the cheese, get a policy gradient and become more likely to do it again). Compare that to having to try out a bunch of existing tweaks to a cheese-bumping-into agent (e.g. make it learn faster early in life but then get sick and die later), where you can’t get detailed control over its responses to specific situations (you can just tweak the initial setup).
Gradient descent is just a fundamentally different operation. You aren’t selecting over learning processes which unfold into minds, trying out a finite but large gene pool of variants, and then choosing the most self-replicating; you are instead doing local parametric search over what changes outputs on the training data. But RL isn’t even differentiable, you aren’t running gradients through it directly. So there isn’t even an analogue of “training data” in the evolutionary regime.
I think I ended up optimizing for "actually get model onto the page in 4 minutes" and not for "explain in a way Scott would have understood."
Yeah, this read really bizarrely to me. This is a good way of making sense of that section, maybe. But then I'm still confused why Scott concluded "oh I was just confused in this way" and then EY said "yup that's why you were confused", and I'm still like "nope Scott's question seems correctly placed; evolutionary history is indeed screened off by the runtime hyperparameterization and dataset."
Thanks for leaving this comment, I somehow only just now saw it.
Given your pseudocode it seems like the only point of
planModificationSample
is to produce plan modifications that lead to high outputs ofself.diamondShard(self.WM.getConseq(plan))
. So why is that not "optimizing the outputs of the grader as its main terminal motivation"?
I want to make a use/mention distinction. Consider an analogous argument:
"Given gradient descent's pseudocode it seems like the only point of backward
is to produce parameter modifications that lead to low outputs of loss_fn
. Gradient descent selects over all directional derivatives for the gradient, which is the direction of maximal loss reduction. Why is that not "optimizing the outputs of the loss function as gradient descent's main terminal motivation"?"[1]
Locally reducing the loss is indeed an important part of the learning dynamics of gradient descent, but this (I claim) has very different properties than "randomly sample from all global minima in the loss landscape" (analogously: "randomly sample a plan which globally maximizes grader output").
But I still haven't answered your broader I think you're asking for a very reasonable definition which I have not yet given, in part because I've remained somewhat confused about the exact grader/non-grader-optimizer distinction I want to draw. At least, intensionally. (which is why I've focused on giving examples, in the hope of getting the vibe across.)
I gave it a few more stabs, and I don't think any of them ended up being sufficient. But here they are anyways:
I wish I had a better intensional definition for you, but that's what I wrote immediately and I really better get through the rest of my comm backlog from last week.
Here are some more replies which may help clarify further.
in particular, such a system would produce plans like "analyze the diamond-evaluator to search for side-channel attacks that trick the diamond-evaluator into producing high numbers", whereas
planModificationSample
would not do that (as such a plan would be rejected byself.diamondShard(self.WM.getConseq(plan))
).
Aside -- I agree with both bolded claims, but think they are separate facts. I don't think the second claim is the reason for the first being true. I would rather say, "plan
would not do that, because self.diamondShard(self.WM.getConseq(plan))
would reject side-channel-attack plans."
Grader is complicit: In the diamond shard-shard case, the grader itself would say "yes, please do search for side-channel attacks that trick me"
No, I disagree with the bolded underlined part. self.diamondShardShard
wouldn't be tricking itself, it would be tricking another evaluative module in the AI (i.e. self.diamondShard
).[2]
If you want an AI which tricks the self.diamondShardShard
, you'd need it to primarily use self.diamondShardShardShard
to actually steer planning. (Or maybe you can find a weird fixed-point self-tricking shard, but that doesn't seem central to my reasoning; I don't think I've been imagining that configuration.)
and what modification would you have to make in order to make it a grader-optimizer with the grader
self.diamondShard(self.WM.getConseq(plan))
?
Oh, I would change self.diamondShard
to self.diamondShardShard
?
I think it's silly to say that GD has a terminal motivation, but I'm not intending to imply that you are silly to say that the agent has a terminal motvation.
Or more precisely, self.diamondGrader
-- since "self.diamondShard
" suggests that the diamond-value directly bids on plans in the grader-optimizer setup. But I'll stick to self.diamondShardShard
for now and elide this connotation.
Also in general I disagree about aligning agents to evaluations of plans being unnecessary. What you are describing here is just direct optimization. But direct optimization -- i.e .effectively planning over a world model
FWIW I don't consider myself to be arguing against planning over a world model.
the important thing to realise is that 'human values' do not really come from inner misalignment wrt our innate reward circuitry but rather are the result of a very long process of social construction influenced both by our innate drives but also by the game-theoretic social considerations needed to create and maintain large social groups, and that these value constructs have been distilled into webs of linguistic associations learnt through unsupervised text-prediction-like objectives which is how we practically interact with our values.
Most human value learning occurs through this linguistic learning grounded by our innate drives but extended to much higher abstractions by language.i.e. for humans we learn our values as some combination of bottom-up (how well do our internal reward evaluators in basal ganglia/hypothalamus) accord with the top-down socially constructed values) as well as top-down association of abstract value concepts with other more grounded linguistic concepts.
Can you give me some examples here? I don't know that I follow what you're pointing at.
Strong upvoted. I appreciate the strong concreteness & focus on internal mechanisms of cognition.
Thomas Kwa suggested that consequentialist agents seem to have less superficial (observation, belief state) -> action mappings. EG a shard agent might have:
But a consequentialist would just reason about what happens, and not mess with those heuristics. (OFC, consequentialism would be a matter of degree)
In this way, changing a small set of decision-relevant features (e.g. "Brown dog treat" -> "brown ball of chocolate") changes the consequentialist's action logits a lot, way more than it changes the shard agent's logits. In a squinty, informal way, the (belief state -> logits) function has a higher Lipschitz constant/is more smooth for the shard agent than for the consequentialist agent.
So maybe one (pre-deception) test for consequentialist reasoning is to test sensitivity of decision-making to small perturbations in observation-space (e.g. dog treat -> tiny chocolate) but large perturbations in action-consequence space (e.g. happy dog -> sick dog). You could spin up two copies of the model to compare.
I think it's worth it to use magic as a term of art, since it's 11 fewer words than "stuff we need to remind ourselves we don't know how to do," and I'm not satisfied with "free parameters."
11 fewer words, but I don't think it communicates the intended concept!
If you have to say "I don't mean one obvious reading of the title" as the first sentence, it's probably not a good title. This isn't a dig -- titling posts is hard, and I think it's fair to not be satisfied with the one I gave. I asked ChatGPT to generate several new titles; lightly edited:
- "Uncertainties left open by Shard Theory"
- "Limitations of Current Shard Theory"
- "Challenges in Applying Shard Theory"
- "Unanswered Questions of Shard Theory"
- "Exploring the Unknowns of Shard Theory"
After considering these, I think that "Reminder: shard theory leaves open important uncertainties" is better than these five, and far better than the current title. I think a better title is quite within reach.
But how do we learn that fact?
I didn't claim that I assign high credence to alignment just working out, I'm saying that it may as a matter of fact turn out that shard theory doesn't "need a lot more work," because alignment works out as a matter of fact from the obvious setups people try.
"Shard theory doesn't need more work" (in sense 2) could be true as a matter of fact, without me knowing it's true with high confidence. If you're saying "for us to become highly confident that alignment is going to work this way, we need more info", I agree.
But I read you as saying "for this to work as a matter of fact, we need X Y Z additional research":
At best we need more abstract thought about this issue in order to figure out what an approach might even look like, and at worst I think this is a problem the necessitates a different approach.
And I think this is wrong. 2 can just be true, and we won't justifiably know it. So I usually say "It is not known to me that I know how to solve alignment", and not "I don't know how to solve alignment."
Does that make sense?
Eliezer's reasoning is surprisingly weak here. It doesn't really interact with the strong mechanistic claims he's making ("Motivated reasoning is definitely built-in, but it's built-in in a way that very strongly bears the signature of 'What would be the easiest way to build this out of these parts we handily had lying around already'").
He just flatly states a lot of his beliefs as true:
Local invalidity via appeal to dubious authority; conventional explanations are often total bogus, and in particular I expect this one to be bogus.
But Eliezer just keeps stating his dubious-to-me stances as obviously True, without explaining how they actually distinguish between mechanistic hypotheses, or e.g. why he thinks he can get so many bits about human learning process hyperparameters from results like Wason (I thought it's hard to go from superficial behavioral results to statements about messy internals? & inferring "hard-coding" is extremely hard even for obvious-seeming candidates).
Similarly, in the summer (consulting my notes + best recollections here), he claimed ~"Evolution was able to make the (internal physiological reward schedule) ↦ (learned human values) mapping predictable because it spent lots of generations selecting for alignability on caring about proximate real-world quantities like conspecifics or food" and I asked "why do you think evolution had to tailor the reward system specifically to make this possible? what evidence has located this hypothesis?" and he said "I read a neuroscience textbook when I was 11?", and stared at me with raised eyebrows.
I just stared at him with a shocked face. I thought, surely we're talking about different things. How could that data have been strong evidence for that hypothesis? Really, neuroscience textbooks provide huge evidence for evolution having to select the reward->value mapping into its current properties?
I also wrote in my journal at the time:
Eliezer seems to attach some strange importance to the learning process being found by evolution, even though the learning initial conditions screen off evolution's influence..? Like, what?
I still don't understand that interaction. But I've had a few interactions like this with him, where he confidently states things, and then I ask him why he thinks that, and offers some unrelated-seeming evidence which doesn't -- AFAICT -- actually discriminate between hypotheses.