All of Joe_Collman's Comments + Replies

AMA: Paul Christiano, alignment researcher

Thanks, that's very helpful. It still feels to me like there's a significant issue here, but I need to think more. At present I'm too confused to get much beyond handwaving.

 A few immediate thoughts (mainly for clarification; not sure anything here merits response):

  • I had been thinking too much lately of [isolated human] rather than [human process].
  • I agree the issue I want to point to isn't precisely OOD generalisation. Rather it's that the training data won't be representative of the thing you'd like the system to learn: you want to convey X, and you
... (read more)
AMA: Paul Christiano, alignment researcher

I'd be interested in your thoughts on human motivation in HCH and amplification schemes.
Do you see motivational issues as insignificant / a manageable obstacle / a hard part of the problem...?

Specifically, it concerns me that every H will have preferences valued more highly than [completing whatever task we assign], so would be expected to optimise its output for its own values rather than the assigned task, where these objectives diverged. In general, output needn't relate to the question/task.
[I don't think you've addressed this at all recently - I've on... (read more)

4Paul Christiano17dI mostly don't think this thing is a major issue. I'm not exactly sure where I disagree, but some possibilities: * H isn't some human isolated from the world, it's an actual process we are implementing (analogous to the current workflow involving external contractors, lots of discussion about the labeling process and what values it might reflect, discussions between contractors and people who are structuring the model, discussions about cases where people disagree) * I don't think H is really generalizing OOD, you are actually collecting human data on the kinds of questions that matter (I don't think any of my proposals rely on that). So the scenario you are talking about is something like the actual people who are implementing H---real people who actually exist and we are actually working with---are being offered payments or extorted or whatever by the datapoints that the actual ML is giving them. That would be considered a bad outcome on many levels (e.g. man that sounds like it's going to make the job stressful), and you'd be flagging models that systematically produce such outputs (if all is going well they shouldn't be upweighted), and coaching contractors and discussing the interesting/tricky cases and so on. * H is just not making that many value calls, they are mostly implemented by the process that H answers. Similarly, we're just not offloading that much of the substantive work to H (e.g. they don't need to be super creative or wise, we are just asking them to help construct a process that responds appropriately to evidence). * I don't really know what kind of opportunity cost you have in mind. Yes, if we hire contractors and can't monitor their work they will sometimes do a sloppy job. And indeed if someone from an ML team is helping run an oversight process there might be some kinds of inputs where they don't care and slack off? But there seems to be a big mismatch between the way this s
Review of "Fun with +12 OOMs of Compute"

Unless I've confused myself badly (always possible!), I think either's fine here. The | version just takes out a factor that'll be common to all hypotheses: [p(e+) / p(e-)]. (since p(Tk & e+) ≡ p(Tk | e+) * p(e+))

Since we'll renormalise, common factors don't matter. Using the | version felt right to me at the time, but whatever allows clearer thinking is the way forward.

2Daniel Kokotajlo1moI'm probably being just mathematically confused myself; at any rate, I'll proceed with the p[Tk & e+] : p[Tk & e-] version since that comes more naturally to me. (I think of it like: Your credence in Tk is split between two buckets, the Tk&e+ and Tk&e- bucket, and then when you update you rule out the e- bucket. So what matters is the ratio between the buckets; if it's relatively high (compared to the ratio for other Tx's) your credence in Tk goes up, if it's relatively low it goes down. Anyhow, I totally agree that this ratio matters and that it varies with k. In particular here's how I think it should vary for most readers of my post: for k>12, the ratio should be low, like 0.1. for low k, the ratio should be higher. for middling k, say 6<k<13, the ratio should be in between. Thus, the update should actually shift probability mass disproportionately to the lower k hypotheses. I realize we are sort of arguing in circles now. I feel like we are making progress though. Also, separately, want to hop on a call with me sometime to sort this out? I've got some more arguments to show you...
Review of "Fun with +12 OOMs of Compute"

Taking your last point first: I entirely agree on that. Most of my other points were based on the implicit assumption that readers of your post don't think something like "It's directly clear that 9 OOM will almost certainly be enough, by a similar argument".

Certainly if they do conclude anything like that, then it's going to massively drop their odds on 9-12 too. However, I'd still make an argument of a similar form: for some people, I expect that argument may well increase the 5-8 range more (than proportionately) than the 1-4 range.

On (1), I agree that ... (read more)

3Daniel Kokotajlo1moWait, shouldn't it be the ratio p[Tk & e+] : p[Tk & e-]? Maybe both ratios work fine for our purposes, but I certainly find it more natural to think in terms of &.
Review of "Fun with +12 OOMs of Compute"

[[ETA, I'm not claiming the >12 OOM mass must all go somewhere other than the <4 OOM case: this was a hypothetical example for the sake of simplicity. I was saying that if I had such a model (with zwomples or the like), then a perfectly good update could leave me with the same posterior credence on <4 OOM.
In fact my credence on <4 OOM was increased, but only very slightly]]

First I should clarify that the only point I'm really confident on here is the "In general, you can't just throw out the >12 OOM and re-normalise, without further assumpti... (read more)

3Daniel Kokotajlo1moOK, thanks. 1. I concede that we're not in a position of complete ignorance w.r.t. the new evidence's impact on alternate hypotheses. However, the same goes for pretty much any argument anyone could make about anything. In my particular case I think there's some sense in which, plausibly, for most underlying views on timelines people will have, my post should cause an update more or less along the lines I described. (see below) 2. Even if I'm wrong about that, I can roll out the anti-spikiness argument to argue in favor of <7 OOMs, though to be fair I don't make this argument in the post. (The argument goes: If 60%+ of your probability mass is between 7 and 12 OOMs, you are being overconfident.) Argument that for most underlying views on timelines people will have, my post should cause an update more or less along the lines I described: --The only way for your credence in <7 to go down relative to your credence in7-12 after reading my post and (mostly) ruling out >12 hypotheses, is for the stuff you learn to also disproportionately rule out sub-hypotheses in the <7 range compared to sub-hypotheses in the 7-12 range. But this is a bit weird; my post didn't talk about the <7 range at all, so why would it disproportionately rule out stuff in that range? Like I said, it seems like (to a first approximation) the information content of my post was "12 OOMs is probably enough" and not something more fancy like "12 OOMs is probably enough BUT 6 is probably not enough." I feel unsure about this and would like to hear you describe the information content of the post, in your terms. --I actually gave an argument that this should increase your relative credence in <7 compared to 7-12, and it's a good one I think: The arguments that 12 OOMs are probably enough are pretty obviously almost as strong for 11 OOMs, and almost as strong as that for 10 OOMs, and so on. To put it another way, our distribution shouldn't have a sharp cliff at 12 OOMs; it should start descending seve
Review of "Fun with +12 OOMs of Compute"

Yes, we're always renormalising at the end - it amounts to saying "...and the new evidence will impact all remaining hypotheses evenly". That's fine once it's true.

I think perhaps I wasn't clear with what I mean by saying "This doesn't say anything...".
I meant that it may say nothing in absolute terms - i.e. that I may put the same probability of [TAI at 4 OOM] after seeing the evidence as before.

This means that it does say something relative to other not-ruled-out hypotheses: if I'm saying the new evidence rules out >12 OOM, and I'm also saying that th... (read more)

4Daniel Kokotajlo1moI think I'm just not seeing why you think the >12 OOM mass must all go somewhere than the <4 OOM (or really, I would argue, <7 OOM) case. Can you explain more? Maybe the idea is something like: There are two underlying variables, 'We'll soon get more ideas' and 'current methods scale.' If we get new ideas soon, then <7 are needed. If we don't but 'current methods scale' is true, 7-12 are needed. If neither variable is true then >12 is needed. So then we read my +12 OOMs post and become convinced that 'current methods scale.' That rules out the >12 hypothesis, but the renormalized mass doesn't go to <7 at all because it also rules out a similar-sized chunk of the <7 hypothesis (the chunk that involved 'current methods don't scale'). This has the same structure as your 1, 2, 3 example above. Is this roughly your view? If so, nice, that makes a fair amount of sense to me. I guess I just don't think that the "current methods scale" hypothesis is confined to 7-12 OOMs; I think it is a probability distribution that spans many OOMs starting with mere +1, and my post can be seen as an attempt to upper-bound how high the distribution goes--which then has implications for how low it goes also, if you want to avoid the anti-spikiness objection. Another angle: I could have made a similar post for +9 OOMs, and a similar one for +6 OOMs, and each would have been somewhat less plausible than the previous. But (IMO) not that much less plausible; if you have 80% credence in +12 then I feel like you should have at least 50% by +9 and at least, idk, 25% by +6. If your credence drops faster than that, you seem overconfident in your ability to extrapolate from current data IMO (or maybe not, I'd certainly love to hear your arguments!)
Review of "Fun with +12 OOMs of Compute"

We do gain evidence on at least some alternatives, but not on all the factors which determine the alternatives. If we know something about those factors, we can't usually just renormalise. That's a good default, but it amounts to an assumption of ignorance.

Here's a simple example:
We play a 'game' where you observe the outcome of two fair coin tosses x and y.
You score:
1 if x is heads
2 if x is tails and y is heads
3 if x is tails and y is tails

So your score predictions start out at:
1 : 50%
2 : 25%
3 : 25%

We look at y and see that it's heads. This rules out 3.
Re... (read more)

3Daniel Kokotajlo2moInteresting, hmm. In the 1-2-3 coin case, seeing that y is heads rules out 3, but it also rules out half of 1. (There are two 1 hypotheses, the yheads and the ytails version) To put it another way, terms P(yheads|1)=0.5. So we are ruling-out-and-renormalizing after all, even though it may not appear that way at first glance. The question is, is something similar happening with the AI OOMs? I think if the evidence leads us to think things like "This doesn't say anything about TAI at +4 OOM, since my prediction is based on orthogonal variables" then that's a point in my favor, right? Or is the idea that the hypotheses ruled out by the evidence presented in the post include all the >12OOM hypotheses, but also a decent chunk of the <6OOM hypotheses but not of the 7-12 OOM hypotheses such that overall the ratio of (our credence in 7-12 OOMs)/(our credence in 0 - 6 OOMs) increases? "This makes me near-certain that TAI will happen by +10 OOM, since the +12 OOM argument didn't require more than that" also seems like a point in my favor. FWIW I also had the sense that the +12OOM argument didn't really require 12 OOMs, it would have worked almost as well with 10.
Review of "Fun with +12 OOMs of Compute"

If you have a bunch of hypotheses (e.g. "It'll take 1 more OOM," "It'll take 2 more OOMs," etc.) and you learn that some of them are false or unlikely (only 10% chance of it taking more than 12" then you should redistribute the mass over all your remaining hypotheses, preserving their relative strengths.

This depends on the mechanism by which you assigned the mass initially - in particular, whether it's absolute or relative. If you start out with specific absolute probability estimates as the strongest evidence for some hypotheses, then you can't just renor... (read more)

2Daniel Kokotajlo2moI don't see why this is. From a bayesian perspective, alternative hypotheses being ruled out == gaining evidence for a hypothesis. In what sense have we not gained any information on the viability of approach X? We've learned that one of the alternatives to X (the at least 13 OOM alternative) won't happen.
Suggestions of posts on the AF to review

I don't see a good reason to exclude agenda-style posts, but I do think it'd be important to treat them differently from more here-is-a-specific-technical-result posts.

Broadly, we'd want to be improving the top-level collective AI alignment research 'algorithm'. With that in mind, I don't see an area where more feedback/clarification/critique of some kind wouldn't be helpful.
The questions seem to be:
What form should feedback/review... take in a given context?
Where is it most efficient to focus our efforts?

Productive feedback/clarification on high-level age... (read more)

The Catastrophic Convergence Conjecture

I understand what you mean with the CCC (and that this seems a bit of a nit-pick!), but I think the wording could usefully be clarified.

As you suggest here, the following is what you mean:

CCC says (for non-evil goals) "if the optimal policy is catastrophic, then it's because of power-seeking"

However, that's not what the CCC currently says.
E.g. compare:
[Unaligned goals] tend to [have catastrophe-inducing optimal policies] because of [power-seeking incentives].
[People teleported to the moon] tend to [die] because of [lack of oxygen].

The latter doesn't lead t... (read more)

A Critique of Non-Obstruction

I think things are already fine for any spike outside S, e.g. paperclip maximiser, since non-obstruction doesn't say anything there.

I actually think saying "our goals aren't on a spike" amounts to a stronger version of my [assume humans know what the AI knows as the baseline]. I'm now thinking that neither of these will work, for much the same reason. (see below)

The way I'm imagining spikes within S is like this:
We define a pretty broad S, presumably implicitly, hoping to give ourselves a broad range of non-obstruction.

For all P in U we later conclude that... (read more)

A Critique of Non-Obstruction

Thinking of corrigibility, it's not clear to me that non-obstruction is quite what I want.
Perhaps a closer version would be something like:
A non-obstructive AI on S needs to do no worse for each P in S than pol(P | off & humans have all the AI's knowledge)

This feels a bit patchy, but in principle it'd fix the most common/obvious issue of the kind I'm raising: that the AI would often otherwise have an incentive to hide information from the users so as to avoid 'obstructing' them when they change their minds.

I think this is more in the spirit of non-obst... (read more)

A Critique of Non-Obstruction

If pol(P) sucks, even if the AI is literally corrigible, we still won't reach good outcomes.

If pol(P) sucks by default, a general AI (corrigible or otherwise) may be able to give us information I which:
Makes Vp(pol(P)) much higher, by making pol(P) given I suck a whole lot less.
Makes Vq(pol(Q)) a little lower, by making pol(Q) given I make concessions to allow pol(P) to perform better.

A non-obstructive AI can't do that, since it's required to maintain the AU for pol(Q).

A simple example is where P and Q currently look the same to us - so our pol(P) and pol(... (read more)

A Critique of Non-Obstruction

Ok, I think I'm following you (though I am tired, so who knows :)).

For me the crux seems to be:
We can't assume in general that pol(P) isn't terrible at optimising for P. We can "do our best" and still screw up catastrophically.

If assuming "pol(P) is always a good optimiser for P" were actually realistic (and I assume you're not!), then we wouldn't have an alignment problem: we'd be assuming away any possibility of making a catastrophic error.

If we just assume "pol(P) is always  a good optimiser for P" for the purpose of non-obstruction definitions/cal... (read more)

2Alex Turner3moEr - non-obstruction is a conceptual frame for understanding the benefits we want from corrigibility. It is not a constraint under which the AI finds a high-scoring policy. It is not an approach to solving the alignment problem any more than Kepler's laws are an approach for going to the moon. Generally, broad non-obstruction seems to be at least as good as literal corrigibility. In my mind, the point of corrigibility is that we become more able to wield and amplify our influence through the AI. If pol(P) sucks, even if the AI is literally corrigible, we still won't reach good outcomes. I don't see how this kind of objection supports non-obstruction not being a good conceptual motivation for corrigibility in the real world, where pol is pretty reasonable for the relevant goals. I agree it's possible for pol to shoot itself in the foot, but I was trying to give an example situation. I was not claiming that for every possible pol, giving money is non-obstructive against P and -P. I feel like that misses the point, and I don't see how this kind of objection supports non-obstruction not being a good conceptual motivation for corrigibility. The point of all this analysis is to think about why we want corrigibility in the real world, and whether there's a generalized version of that desideratum. To remark that there exists an AI policy/pol pair which induces narrow non-obstruction, or which doesn't empower pol a whole lot, or which makes silly tradeoffs... I guess I just don't see the relevance of that for thinking about the alignment properties of a given AI system in the real world.
Non-Obstruction: A Simple Concept Motivating Corrigibility

I just saw this recently. It's very interesting, but I don't agree with your conclusions (quite possibly because I'm confused and/or overlooking something). I posted a response here.
The short version being:

Either I'm confused, or your green lines should be spikey.
Any extreme green line spikes within S will be a problem.
Pareto is a poor approach if we need to deal with default tall spikes.

A Critique of Non-Obstruction

Oh it's possible to add up a load of spikes [ETA suboptimal optimisations], many of which hit the wrong target, but miraculously cancel out to produce a flat landscape [ETA "spikes" was just wrong; what I mean here is that you could e.g. optimise for A, accidentally hit B, and only get 70% of the ideal value for A... and counterfactually optimise for B, accidentally hit C, and only get 70% of the ideal value for B... and counterfactually aim for C, hit D etc. etc. so things end up miraculously flat; this seems silly because there's no reason to expect all ... (read more)

Optimal play in human-judged Debate usually won't answer your question

Sure - there are many ways for debate to fail with extremely capable debaters. Though most of the more exotic mind-hack-style outcomes seem a lot less likely once you're evaluating local nodes with ~1000 characters for each debater.

However, all of this comes under my:
I’ll often omit the caveat “If debate works as intended aside from this issue…

There are many ways for debate to fail. I'm pointing out what happens even if it works.
I.e. I'm claiming that question-ignoring will happen even if the judge is only ever persuaded of true statements, gets a balance... (read more)

Optimal play in human-judged Debate usually won't answer your question

Oh yes - I didn't mean to imply otherwise.

My point is only that there'll be many ways to slide an answer pretty smoothly between [direct answer] and [useful information]. Splitting into [Give direct answer with (k - x) bits] [Give useful information with x bits] and sliding x from 0 to k is just the first option that occurred to me.

In practice, I don't imagine the path actually followed would look like that. I was just sanity-checking by asking myself whether a discontinuous jump is necessary to get to the behaviour I'm suggesting: I'm pretty confident it's not.

Problems with Counterfactual Oracles
A sufficiently intelligent agent would understand that after having being shutdown, an (almost) identical version of itself will probably be facing a similar question. Therefore, it it wants future versions of itself to succeed at (almost) exactly the same task, it should still want to escape.

This is a problem if it's using FDT/UDT. Conditions for the myopic approach to work seem to require CDT (or something similar). Then there's no automatic desire for future versions to succeed or expectation that past versions will have acted to release the ... (read more)

2Michaël Trazzi2yThe Asymptotically Unambitious AGI thread helped me clarify my thoughts, thanks. I agree that an optimal CDT agent won't think about future versions, and I don't see any optimization pressure towards escape message nor disproportionately common "escape message" regions. However, it still assumes we have access to this magic oracle that optimizes for R′=R.IE where E is the event where humans don't see the answer, IE its indicator function, and R the counterfactual reward (given by the automatic machine). If humans were able to build an oracle performing optimally (w.r.t R′) from day 1, then humans would be able to specify some kind of "god oracle". The rest of the design seems to be just "how to interact with a god oracle so that humans are not influenced by the answers". In practice, you'll want something that is able to learn from its (question, prediction, reward) history. That's why there is this automatic machine rewarding the oracle with some probability ϵ. In an online learning setting, most of the time the model gets r=0 (probability 1−ϵ), and it sometimes gets some useful feedback r>0 (probability ϵ). Therefore, if ϵ is small enough (comparable to the probability of hitting an escape message at random), then the learning gets extremely slow and the oracle might shoot at the escape action at random. Even worse, the (question, prediction, r=0) datapoints in the history could get so misleading that the oracle never learns anything. Now, if we're not in an online learning process but instead there is a separation between a "training phase" and a "deployment phase where the AI continue to learns with probability ϵ", then the setup is just "have a model that learns to do the useful stuff in sandbox, and then have the weights (almost) fixed in deployment" In short, I think the CDT setup without machine learning assumes the problem already solved, that online learning won't work and is unsafe, which leaves us with a "training then deployment" setup that isn't
Beyond Astronomical Waste

That seems right.

I'd been primarily thinking about more simple-minded escape/uplift/signal-to-simulators influence (via this us), rather than UDT-influence. If we were ever uplifted or escaped, I'd expect it'd be into a world-like-ours. But of course you're correct that UDT-style influence would apply immediately.

Opportunity costs are a consideration, though there may be behaviours that'd increase expected value in both direct-embeddings and worlds-like-ours. Win-win behaviours could be taken early.

Personally, I'd expect this ... (read more)

Asymptotically Unambitious AGI

Ah yes - I was confusing myself at some point between forming and using a model (hence "incentives").

I think you're correct that "perfectly useful" isn't going to happen. I'm happy to be wrong.

"the quickest way to simulate one counterfactual does not include simulating a mutually exclusive counterfactual"

I don't think you'd be able to formalize this in general, since I imagine it's not true. E.g. one could imagine a fractal world where every detail of a counterfactual appeared later in a subbranch... (read more)

Asymptotically Unambitious AGI

Just obvious and mundane concerns:

You might want to make clearer that "As long as the door is closed, information cannot leave the room" isn't an assumption but a requirement of the setup. I.e. that you're not assuming based on your description that opening the door is the only means for an operator to get information out; you're assuming every other means of information escape has been systematically accounted for and ruled out (with the assumption that the operator has been compromised by the AI).

Asymptotically Unambitious AGI

[Quite possibly I'm confused, but in case I'm not:]
I think this assumption might be invalid (or perhaps require more hand-waving than is ideal).

The AI has an incentive to understand the operator's mind, since this bears directly on its reward.
Better understanding the operator's mind might be achieved in part by running simulations including the operator.
One specific simulation would involve simulating the operator's environment and actions after he leaves the room.

Here this isn't done to understand the implications of his acti... (read more)

2michaelcohen2yI wouldn't really use the term "incentives" to describe the free-for-all among world-models as they compete to be maximum a posteriori. All they have to do is output observations and rewards in a distribution that matches the objective probabilities. But I think we arrive at the same possibility: you'll see in the algorithm for ν⋆ that it does simulate the outside-world. I do acknowledge in the paper that some of the outside-world simulation that a memory-based world-model does when it's following the "wrong path" may turn out to be useful; all that is required for the argument to go through is that this simulation is not perfectly useful--there is a shorter computation that accomplishes the same thing. I would love it if this assumption could look like: "the quickest way to simulate one counterfactual does not include simulating a mutually exclusive counterfactual" and make assumption 2 into a lemma that follows from it, but I couldn't figure out how to formalize this.
Beyond Astronomical Waste

Thanks. I agree with your overall conclusions.

On the specifics, Bostrom's simulation argument is more than just a parallel here: it has an impact on how rich we might expect our direct parent simulator to be.

The simulation argument applies similarly to one base world like ours, or to an uncountable number of parallel worlds embedded in Tegmark IV structures. Either way, if you buy case 3, the proportion of simulated-by-a-world-like-ours worlds rises close to 1 (I'm counting worlds "depth-first", since it seems most intuitive, and infini... (read more)

2Wei Dai2yI'm not sure it makes sense to talk about "expect" here. (I'm confused about anthropics and especially about first-person subjective expectations.) But if you take the third-person UDT-like perspective here, we're directly embedded in some hugely richer base structures, and also indirectly embedded via N levels of worlds-like-ours, and having more of the latter doesn't reduce how much value (in the UDT-utility sense) we can gain by influencing the former; it just gives us more options that we can choose to take or not. In other words, we always have the option of pretending the latter don't exist and just optimize for exerting influence via the direct embeddings. On second thought, it does increase the opportunity cost of exerting such influence, because we'd be spending resources in both the directly embedded worlds and the indirectly-embedded worlds to do that. To get around this, the eventual superintelligence doing this could wait until such a time in our universe that Bostrom's proposition 3 isn't true anymore (or true to a lesser extent) before trying to influence richer universes, since presumably only the historically interesting periods of our universe are heavily simulated by worlds-like-ours.