Comments

Does anyone know why security amplification and meta-execution are rarely talked about these days? I did a search on LW and found just 1 passing reference to either phrase in the last 3 years. Is the problem not considered an important problem anymore? The problem is too hard and no one has new ideas? There are too many alignment problems/approaches to work on and not enough researchers?

Wei Dai1mo30

I gave this explanation at the start of the UDT1.1 post:

When describing UDT1 solutions to various sample problems, I've often talked about UDT1 finding the function S* that would optimize its preferences over the world program P, and then return what S* would return, given its input. But in my original description of UDT1, I never explicitly mentioned optimizing S as a whole, but instead specified UDT1 as, upon receiving input X, finding the optimal output Y* for that input, by considering the logical consequences of choosing various possible outputs. I have been implicitly assuming that the former (optimization of the global strategy) would somehow fall out of the latter (optimization of the local action) without having to be explicitly specified, due to how UDT1 takes into account logical correlations between different instances of itself. But recently I found an apparent counter-example to this assumption.

Wei Dai1mo31

Then it would repeat the same process for t=1 and the copy. Conditioned on “I will see C” at t=1, it will conclude “I will see CO” with probability 1⁄2 by the same reasoning as above. So overall, it will assign:p(“I will see OO”) = 1⁄2,p(“I will see CO”) = 1⁄4,p(“I will see CC”) = 1⁄4

  1. If we look at the situation in 0P, the three versions of you at time 2 all seem equally real and equally you, yet in 1P you weigh the experiences of the future original twice as much as each of the copies.
  2. Suppose we change the setup slightly so that copying of the copy is done at time 1 instead of time 2. And at time 1 we show O to the original and C to the two copies, then at time 2 we show them OO, CO, CC like before. With this modified setup, your logic would conclude P(“I will see O”)=P(“I will see OO”)=P(“I will see CO”)=P(“I will see CC”)=1/3 and P(“I will see C”)=2/3. Right?
  3. Similarly, if we change the setup from the original so that no observation is made at time 1, the probabilities also become P(“I will see OO”)=P(“I will see CO”)=P(“I will see CC”)=1/3.
  4. Suppose we change the setup from the original so that at time 1, we make 999 copies of you instead of just 1 and show them all C before deleting all but 1 of the copies. Then your logic would imply P("I will see C")=.999 and therefore P(“I will see CO”)=P(“I will see CC”)=0.4995, and P(“I will see O”)=P(“I will see OO”)=.001.

This all make me think there's something wrong with the 1/2,1/4,1/4 answer and with the way you define probabilities of future experiences. More specifically, suppose OO wasn't just two letters but an unpleasant experience, and CO and CC are both pleasant experiences, so you prefer "I will experience CO/CC" to "I will experience OO". Then at time 0 you would be willing to pay to switch from the original setup to (2) or (3), and pay even more to switch to (4). But that seems pretty counterintuitive, i.e., why are you paying to avoid making observations in (3), or paying to make and delete copies of yourself in (4). Both of these seem at best pointless in 0P.

But every other approach I've seen or thought of also has problems, so maybe we shouldn't dismiss this one too easily based on these issues. I would be interested to see you work out everything more formally and address the above objections (to the extent possible).

Wei Dai1mo40

Defining the semantics and probabilities of anticipation seems to be a hard problem. You can see some past discussions of the difficulties at The Anthropic Trilemma and its back-references (posts that link to it). (I didn't link to this earlier in case you already found a fresh approach that solved the problem. You may also want to consider not reading the previous discussions to avoid possibly falling into the same ruts.)

Wei Dai2mo60

We can assign meanings to statements like “my sensor sees red” by picking out subsets of experiences, just as before.

How do you assign meaning to statements like "my sensor will see red"? (In the OP you mention "my sensors will see the heads side of the coin" but I'm not sure what your proposed semantics of such statements are in general.)

Also, here's an old puzzle of mine that I wonder if your line of thinking can help with: At time 1 you will be copied and the original will be shown "O" and the copy will be shown "C", then at time 2 the copy will be copied again, and the three of you will be shown "OO" (original), "CO" (original of copy), "CC" (copy of copy) respectively. At time 0, what are your probabilities for "I will see X" for each of the five possible values of X?

Wei Dai2mo41

Yeah, this makes sense, thanks. I think I've read one or maybe both of your posts, which is probably why I started having second thoughts about my comment soon after posting it. :)

Wei Dai2mo67

I agree with this point as stated, but think the probability is more like 5% than 0.1%.

How do you define or think about "light" in "light RLHF" when you make a statement like this, and how do you know that you're thinking about it the same way that Alex is? Is it a term of art that I don't know about, or has it been defined in a previous post?

Wei Dai2mo20

In any case, I'd be interested in an explanation of what the differences are and why one type of training produces schemers and the other is very unlikely to.

Thinking over this question myself, I think I've found a reasonable answer. Still interested in your thoughts but I'll write down mine:

It seems like evolution "wanted" us to be (in part) reward-correlate maximizers (i.e., being a reward-correlate maximizer was adaptive in our ancestral environment), and "implemented" this by having our brains internally do "heavy RL" throughout our life. So we become reward-correlate maximizing agents early in life, and then when our parents do something like RLHF on top of that, we become schemers pretty easily.

So the important difference is that with "pretraining + light RLHF" there's no "heavy RL" step.

Wei Dai2mo811

The strongest argument for reward-maximization which I’m aware of is: Human brains do RL and often care about some kind of tight reward-correlate, to some degree. Humans are like deep learning systems in some ways, and so that’s evidence that “learning setups which work in reality” can come to care about their own training signals.

Isn't there a similar argument for "plausible that we just keep scaling up networks, run pretraining + light RLHF, and then produce a schemer"? Namely the way we train our kids seems pretty similar to "pretraining + light RLHF" and we often do end up with scheming/deceptive kids. (I'm speaking partly from experience.) ETA: On second thought, maybe it's not that similar? In any case, I'd be interested in an explanation of what the differences are and why one type of training produces schemers and the other is very unlikely to.

Also, in this post you argue against several arguments for high risk of scheming/deception from this kind of training but I can't find where you talk about why you think the risk is so low ("not plausible"). You just say 'Undo the update from the “counting argument”, however, and the probability of scheming plummets substantially.' but why is your prior for it so low? I would be interested in whatever reasons/explanations you can share. The same goes for others who have indicated agreement with Alex's assessment of this particular risk being low.

Wei Dai2mo1312

How many alignment techniques presuppose an AI being motivated by the training signal (e.g. AI Safety via Debate)

It would be good to get a definitive response from @Geoffrey Irving or @paulfchristiano, but I don't think AI Safety via Debate presupposes an AI being motivated by the training signal. Looking at the paper again, there is some theoretical work that assumes "each agent maximizes their probability of winning" but I think the idea is sufficiently well-motivated (at least as a research approach) even if you took that section out, and simply view Debate as a way to do RL training on an AI that is superhumanly capable (and hence hard or unsafe to do straight RLHF on).

BTW what is your overall view on "scalable alignment" techniques such as Debate and IDA? (I guess I'm getting the vibe from this quote that you don't like them, and want to get clarification so I don't mislead myself.)

Load More