Thanks, that's very helpful. It still feels to me like there's a significant issue here, but I need to think more. At present I'm too confused to get much beyond handwaving. A few immediate thoughts (mainly for clarification; not sure anything here merits response):
I'd be interested in your thoughts on human motivation in HCH and amplification schemes.Do you see motivational issues as insignificant / a manageable obstacle / a hard part of the problem...?
Specifically, it concerns me that every H will have preferences valued more highly than [completing whatever task we assign], so would be expected to optimise its output for its own values rather than the assigned task, where these objectives diverged. In general, output needn't relate to the question/task.[I don't think you've addressed this at all recently - I've on... (read more)
Unless I've confused myself badly (always possible!), I think either's fine here. The | version just takes out a factor that'll be common to all hypotheses: [p(e+) / p(e-)]. (since p(Tk & e+) ≡ p(Tk | e+) * p(e+))
Since we'll renormalise, common factors don't matter. Using the | version felt right to me at the time, but whatever allows clearer thinking is the way forward.
Taking your last point first: I entirely agree on that. Most of my other points were based on the implicit assumption that readers of your post don't think something like "It's directly clear that 9 OOM will almost certainly be enough, by a similar argument".
Certainly if they do conclude anything like that, then it's going to massively drop their odds on 9-12 too. However, I'd still make an argument of a similar form: for some people, I expect that argument may well increase the 5-8 range more (than proportionately) than the 1-4 range.
On (1), I agree that ... (read more)
[[ETA, I'm not claiming the >12 OOM mass must all go somewhere other than the <4 OOM case: this was a hypothetical example for the sake of simplicity. I was saying that if I had such a model (with zwomples or the like), then a perfectly good update could leave me with the same posterior credence on <4 OOM.In fact my credence on <4 OOM was increased, but only very slightly]]
First I should clarify that the only point I'm really confident on here is the "In general, you can't just throw out the >12 OOM and re-normalise, without further assumpti... (read more)
Yes, we're always renormalising at the end - it amounts to saying "...and the new evidence will impact all remaining hypotheses evenly". That's fine once it's true.
I think perhaps I wasn't clear with what I mean by saying "This doesn't say anything...".I meant that it may say nothing in absolute terms - i.e. that I may put the same probability of [TAI at 4 OOM] after seeing the evidence as before.
This means that it does say something relative to other not-ruled-out hypotheses: if I'm saying the new evidence rules out >12 OOM, and I'm also saying that th... (read more)
We do gain evidence on at least some alternatives, but not on all the factors which determine the alternatives. If we know something about those factors, we can't usually just renormalise. That's a good default, but it amounts to an assumption of ignorance.
Here's a simple example:We play a 'game' where you observe the outcome of two fair coin tosses x and y.You score:1 if x is heads2 if x is tails and y is heads3 if x is tails and y is tailsSo your score predictions start out at:1 : 50%2 : 25%3 : 25%
We look at y and see that it's heads. This rules out 3.Re... (read more)
If you have a bunch of hypotheses (e.g. "It'll take 1 more OOM," "It'll take 2 more OOMs," etc.) and you learn that some of them are false or unlikely (only 10% chance of it taking more than 12" then you should redistribute the mass over all your remaining hypotheses, preserving their relative strengths.
This depends on the mechanism by which you assigned the mass initially - in particular, whether it's absolute or relative. If you start out with specific absolute probability estimates as the strongest evidence for some hypotheses, then you can't just renor... (read more)
I don't see a good reason to exclude agenda-style posts, but I do think it'd be important to treat them differently from more here-is-a-specific-technical-result posts.
Broadly, we'd want to be improving the top-level collective AI alignment research 'algorithm'. With that in mind, I don't see an area where more feedback/clarification/critique of some kind wouldn't be helpful.The questions seem to be:What form should feedback/review... take in a given context?Where is it most efficient to focus our efforts?
Productive feedback/clarification on high-level age... (read more)
I understand what you mean with the CCC (and that this seems a bit of a nit-pick!), but I think the wording could usefully be clarified.
As you suggest here, the following is what you mean:
CCC says (for non-evil goals) "if the optimal policy is catastrophic, then it's because of power-seeking"
However, that's not what the CCC currently says.E.g. compare:[Unaligned goals] tend to [have catastrophe-inducing optimal policies] because of [power-seeking incentives].[People teleported to the moon] tend to [die] because of [lack of oxygen].
The latter doesn't lead t... (read more)
I think things are already fine for any spike outside S, e.g. paperclip maximiser, since non-obstruction doesn't say anything there.
I actually think saying "our goals aren't on a spike" amounts to a stronger version of my [assume humans know what the AI knows as the baseline]. I'm now thinking that neither of these will work, for much the same reason. (see below)
The way I'm imagining spikes within S is like this:We define a pretty broad S, presumably implicitly, hoping to give ourselves a broad range of non-obstruction.For all P in U we later conclude that... (read more)
Thinking of corrigibility, it's not clear to me that non-obstruction is quite what I want.Perhaps a closer version would be something like:A non-obstructive AI on S needs to do no worse for each P in S than pol(P | off & humans have all the AI's knowledge)This feels a bit patchy, but in principle it'd fix the most common/obvious issue of the kind I'm raising: that the AI would often otherwise have an incentive to hide information from the users so as to avoid 'obstructing' them when they change their minds.
I think this is more in the spirit of non-obst... (read more)
If pol(P) sucks, even if the AI is literally corrigible, we still won't reach good outcomes.
If pol(P) sucks by default, a general AI (corrigible or otherwise) may be able to give us information I which:Makes Vp(pol(P)) much higher, by making pol(P) given I suck a whole lot less.Makes Vq(pol(Q)) a little lower, by making pol(Q) given I make concessions to allow pol(P) to perform better.
A non-obstructive AI can't do that, since it's required to maintain the AU for pol(Q).
A simple example is where P and Q currently look the same to us - so our pol(P) and pol(... (read more)
Ok, I think I'm following you (though I am tired, so who knows :)).For me the crux seems to be:We can't assume in general that pol(P) isn't terrible at optimising for P. We can "do our best" and still screw up catastrophically.
If assuming "pol(P) is always a good optimiser for P" were actually realistic (and I assume you're not!), then we wouldn't have an alignment problem: we'd be assuming away any possibility of making a catastrophic error.
If we just assume "pol(P) is always a good optimiser for P" for the purpose of non-obstruction definitions/cal... (read more)
I just saw this recently. It's very interesting, but I don't agree with your conclusions (quite possibly because I'm confused and/or overlooking something). I posted a response here.The short version being:
Either I'm confused, or your green lines should be spikey.Any extreme green line spikes within S will be a problem.Pareto is a poor approach if we need to deal with default tall spikes.
Oh it's possible to add up a load of spikes [ETA suboptimal optimisations], many of which hit the wrong target, but miraculously cancel out to produce a flat landscape [ETA "spikes" was just wrong; what I mean here is that you could e.g. optimise for A, accidentally hit B, and only get 70% of the ideal value for A... and counterfactually optimise for B, accidentally hit C, and only get 70% of the ideal value for B... and counterfactually aim for C, hit D etc. etc. so things end up miraculously flat; this seems silly because there's no reason to expect all ... (read more)
Sure - there are many ways for debate to fail with extremely capable debaters. Though most of the more exotic mind-hack-style outcomes seem a lot less likely once you're evaluating local nodes with ~1000 characters for each debater.However, all of this comes under my:I’ll often omit the caveat “If debate works as intended aside from this issue…”There are many ways for debate to fail. I'm pointing out what happens even if it works.I.e. I'm claiming that question-ignoring will happen even if the judge is only ever persuaded of true statements, gets a balance... (read more)
Oh yes - I didn't mean to imply otherwise.
My point is only that there'll be many ways to slide an answer pretty smoothly between [direct answer] and [useful information]. Splitting into [Give direct answer with (k - x) bits] [Give useful information with x bits] and sliding x from 0 to k is just the first option that occurred to me.
In practice, I don't imagine the path actually followed would look like that. I was just sanity-checking by asking myself whether a discontinuous jump is necessary to get to the behaviour I'm suggesting: I'm pretty confident it's not.
A sufficiently intelligent agent would understand that after having being shutdown, an (almost) identical version of itself will probably be facing a similar question. Therefore, it it wants future versions of itself to succeed at (almost) exactly the same task, it should still want to escape.
This is a problem if it's using FDT/UDT. Conditions for the myopic approach to work seem to require CDT (or something similar). Then there's no automatic desire for future versions to succeed or expectation that past versions will have acted to release the ... (read more)
That seems right.
I'd been primarily thinking about more simple-minded escape/uplift/signal-to-simulators influence (via this us), rather than UDT-influence. If we were ever uplifted or escaped, I'd expect it'd be into a world-like-ours. But of course you're correct that UDT-style influence would apply immediately.
Opportunity costs are a consideration, though there may be behaviours that'd increase expected value in both direct-embeddings and worlds-like-ours. Win-win behaviours could be taken early.
Personally, I'd expect this ... (read more)
Ah yes - I was confusing myself at some point between forming and using a model (hence "incentives").
I think you're correct that "perfectly useful" isn't going to happen. I'm happy to be wrong.
"the quickest way to simulate one counterfactual does not include simulating a mutually exclusive counterfactual"
I don't think you'd be able to formalize this in general, since I imagine it's not true. E.g. one could imagine a fractal world where every detail of a counterfactual appeared later in a subbranch... (read more)
Just obvious and mundane concerns:
You might want to make clearer that "As long as the door is closed, information cannot leave the room" isn't an assumption but a requirement of the setup. I.e. that you're not assuming based on your description that opening the door is the only means for an operator to get information out; you're assuming every other means of information escape has been systematically accounted for and ruled out (with the assumption that the operator has been compromised by the AI).
[Quite possibly I'm confused, but in case I'm not:]I think this assumption might be invalid (or perhaps require more hand-waving than is ideal).The AI has an incentive to understand the operator's mind, since this bears directly on its reward.Better understanding the operator's mind might be achieved in part by running simulations including the operator.One specific simulation would involve simulating the operator's environment and actions after he leaves the room.Here this isn't done to understand the implications of his acti... (read more)
Thanks. I agree with your overall conclusions.
On the specifics, Bostrom's simulation argument is more than just a parallel here: it has an impact on how rich we might expect our direct parent simulator to be.The simulation argument applies similarly to one base world like ours, or to an uncountable number of parallel worlds embedded in Tegmark IV structures. Either way, if you buy case 3, the proportion of simulated-by-a-world-like-ours worlds rises close to 1 (I'm counting worlds "depth-first", since it seems most intuitive, and infini... (read more)