# Wiki Contributions

I gave this explanation at the start of the UDT1.1 post:

When describing UDT1 solutions to various sample problems, I've often talked about UDT1 finding the function S* that would optimize its preferences over the world program P, and then return what S* would return, given its input. But in my original description of UDT1, I never explicitly mentioned optimizing S as a whole, but instead specified UDT1 as, upon receiving input X, finding the optimal output Y* for that input, by considering the logical consequences of choosing various possible outputs. I have been implicitly assuming that the former (optimization of the global strategy) would somehow fall out of the latter (optimization of the local action) without having to be explicitly specified, due to how UDT1 takes into account logical correlations between different instances of itself. But recently I found an apparent counter-example to this assumption.

Then it would repeat the same process for t=1 and the copy. Conditioned on “I will see C” at t=1, it will conclude “I will see CO” with probability 1⁄2 by the same reasoning as above. So overall, it will assign:p(“I will see OO”) = 1⁄2,p(“I will see CO”) = 1⁄4,p(“I will see CC”) = 1⁄4

1. If we look at the situation in 0P, the three versions of you at time 2 all seem equally real and equally you, yet in 1P you weigh the experiences of the future original twice as much as each of the copies.
2. Suppose we change the setup slightly so that copying of the copy is done at time 1 instead of time 2. And at time 1 we show O to the original and C to the two copies, then at time 2 we show them OO, CO, CC like before. With this modified setup, your logic would conclude P(“I will see O”)=P(“I will see OO”)=P(“I will see CO”)=P(“I will see CC”)=1/3 and P(“I will see C”)=2/3. Right?
3. Similarly, if we change the setup from the original so that no observation is made at time 1, the probabilities also become P(“I will see OO”)=P(“I will see CO”)=P(“I will see CC”)=1/3.
4. Suppose we change the setup from the original so that at time 1, we make 999 copies of you instead of just 1 and show them all C before deleting all but 1 of the copies. Then your logic would imply P("I will see C")=.999 and therefore P(“I will see CO”)=P(“I will see CC”)=0.4995, and P(“I will see O”)=P(“I will see OO”)=.001.

This all make me think there's something wrong with the 1/2,1/4,1/4 answer and with the way you define probabilities of future experiences. More specifically, suppose OO wasn't just two letters but an unpleasant experience, and CO and CC are both pleasant experiences, so you prefer "I will experience CO/CC" to "I will experience OO". Then at time 0 you would be willing to pay to switch from the original setup to (2) or (3), and pay even more to switch to (4). But that seems pretty counterintuitive, i.e., why are you paying to avoid making observations in (3), or paying to make and delete copies of yourself in (4). Both of these seem at best pointless in 0P.

But every other approach I've seen or thought of also has problems, so maybe we shouldn't dismiss this one too easily based on these issues. I would be interested to see you work out everything more formally and address the above objections (to the extent possible).

Defining the semantics and probabilities of anticipation seems to be a hard problem. You can see some past discussions of the difficulties at The Anthropic Trilemma and its back-references (posts that link to it). (I didn't link to this earlier in case you already found a fresh approach that solved the problem. You may also want to consider not reading the previous discussions to avoid possibly falling into the same ruts.)

We can assign meanings to statements like “my sensor sees red” by picking out subsets of experiences, just as before.

How do you assign meaning to statements like "my sensor will see red"? (In the OP you mention "my sensors will see the heads side of the coin" but I'm not sure what your proposed semantics of such statements are in general.)

Also, here's an old puzzle of mine that I wonder if your line of thinking can help with: At time 1 you will be copied and the original will be shown "O" and the copy will be shown "C", then at time 2 the copy will be copied again, and the three of you will be shown "OO" (original), "CO" (original of copy), "CC" (copy of copy) respectively. At time 0, what are your probabilities for "I will see X" for each of the five possible values of X?

Yeah, this makes sense, thanks. I think I've read one or maybe both of your posts, which is probably why I started having second thoughts about my comment soon after posting it. :)

I agree with this point as stated, but think the probability is more like 5% than 0.1%.

How do you define or think about "light" in "light RLHF" when you make a statement like this, and how do you know that you're thinking about it the same way that Alex is? Is it a term of art that I don't know about, or has it been defined in a previous post?

In any case, I'd be interested in an explanation of what the differences are and why one type of training produces schemers and the other is very unlikely to.

Thinking over this question myself, I think I've found a reasonable answer. Still interested in your thoughts but I'll write down mine:

It seems like evolution "wanted" us to be (in part) reward-correlate maximizers (i.e., being a reward-correlate maximizer was adaptive in our ancestral environment), and "implemented" this by having our brains internally do "heavy RL" throughout our life. So we become reward-correlate maximizing agents early in life, and then when our parents do something like RLHF on top of that, we become schemers pretty easily.

So the important difference is that with "pretraining + light RLHF" there's no "heavy RL" step.

The strongest argument for reward-maximization which I’m aware of is: Human brains do RL and often care about some kind of tight reward-correlate, to some degree. Humans are like deep learning systems in some ways, and so that’s evidence that “learning setups which work in reality” can come to care about their own training signals.

Isn't there a similar argument for "plausible that we just keep scaling up networks, run pretraining + light RLHF, and then produce a schemer"? Namely the way we train our kids seems pretty similar to "pretraining + light RLHF" and we often do end up with scheming/deceptive kids. (I'm speaking partly from experience.) ETA: On second thought, maybe it's not that similar? In any case, I'd be interested in an explanation of what the differences are and why one type of training produces schemers and the other is very unlikely to.

Also, in this post you argue against several arguments for high risk of scheming/deception from this kind of training but I can't find where you talk about why you think the risk is so low ("not plausible"). You just say 'Undo the update from the “counting argument”, however, and the probability of scheming plummets substantially.' but why is your prior for it so low? I would be interested in whatever reasons/explanations you can share. The same goes for others who have indicated agreement with Alex's assessment of this particular risk being low.

How many alignment techniques presuppose an AI being motivated by the training signal (e.g. AI Safety via Debate)

It would be good to get a definitive response from @Geoffrey Irving or @paulfchristiano, but I don't think AI Safety via Debate presupposes an AI being motivated by the training signal. Looking at the paper again, there is some theoretical work that assumes "each agent maximizes their probability of winning" but I think the idea is sufficiently well-motivated (at least as a research approach) even if you took that section out, and simply view Debate as a way to do RL training on an AI that is superhumanly capable (and hence hard or unsafe to do straight RLHF on).

BTW what is your overall view on "scalable alignment" techniques such as Debate and IDA? (I guess I'm getting the vibe from this quote that you don't like them, and want to get clarification so I don't mislead myself.)

In reality, the problem faced by evolution and by SGD is much easier than this: producing systems that behave the right way in all scenarios they are likely to encounter. In virtue of their aligned behavior, these systems will be “aimed at the right things” in every sense that matters in practice.

I find this passage remarkable, given that so many people are choosing to to have few or no children that fertility has fallen to 0.78 in Korea and 1.0 in China. Presumably you're aware of these (or similar) facts and intended the meaning of this passage to be compatible with them, but I'm having trouble figuring out how...

By contrast, goal realism leads only to unfalsifiable speculation about an “inner actress” with utterly alien motivations.

In order for such speculation to be unfalsifiable, it seemingly has to be the case that we're unable to ever develop good enough interpretability tools to definitively say whether the AI in question has such internal motivations. This could well turn out to be true, but I don't understand how you're able to predict this now. (Or maybe you mean something else by "unfalsifiable" but I can't see what it could be. ETA: Maybe you mean "unfalsifiable with existing methods"?)

On the other hand, with your own proposed alignment method, we have to speculate about what scenarios an AI is likely to encounter. You could say that this is falsifiable (we just have to wait for the future to unfold), but is this actually an advantage?