Matthew "Vaniver" Graves

Wiki Contributions


Worst-case thinking in AI alignment

When you’re considering between a project that gives us a boost in worlds where P(doom) was 50% and projects that help out in worlds where P(doom) was 1% or 99%, you should probably pick the first project, because the derivative of P(doom) with respect to alignment progress is maximized at 50%.

Many prominent alignment researchers estimate P(doom) as substantially less than 50%. Those people often focus on scenarios which are surprisingly bad from their perspective basically for this reason.

And conversely, people who think P(doom) > 50% should aim their efforts at worlds that are better than they expected.

This section seems reversed to me, unless I'm misunderstanding it. If "things as I expect" are P(doom) 99%, and "I'm pleasantly wrong about the usefulness of natural abstractions" is P(doom) 50%, the first paragraph suggests I should do the "better than expected" / "surprisingly good" world, because the marginal impact of effort is higher in that world. 

[Another way to think about it is surprising in the direction you already expect is extremizing, but logistic success has its highest derivative in the middle, i.e. is a moderating force.]

There is essentially one best-validated theory of cognition.

Why do they separate out the auditory world and the environment?

Christiano, Cotra, and Yudkowsky on AI progress

So it looks like the R-7 (which launched Sputnik) was the first ICBM, and the range is way longer than the V-2s of ~15 years earlier, but I'm not easily finding a graph of range over those intervening years. (And the R-7 range is only about double the range of a WW2-era bomber, which further smooths the overall graph.)

[And, implicitly, the reason we care about ICBMs is because the US and the USSR were on different continents; if the distance between their major centers was comparable to England and France's distance instead, then the same strategic considerations would have been hit much sooner.]

Christiano, Cotra, and Yudkowsky on AI progress

presumably we saw a discontinuous jump in flight range when Sputnik entered orbit.

While I think orbit is the right sort of discontinuity for this, I think you need to specify 'flight range' in a way that clearly favors orbits for this to be correct, mostly because about a month before was the manhole cover launched/vaporized with a nuke.

[But in terms of something like "altitude achieved", I think Sputnik is probably part of a continuous graph, and probably not the most extreme member of the graph?]

Yudkowsky and Christiano discuss "Takeoff Speeds"

your point is simply that it's hard to predict when that will happen when you just look at the Penn Treebank trend.

This is a big part of my point; a smaller elaboration is that it can be easy to trick yourself into thinking that, because you understand what will happen with PTB, you'll understand what will happen with economics/security/etc., when in fact you don't have much understanding of the connection between those, and there might be significant discontinuities. [To be clear, I don't have much understanding of this either; I wish I did!]

For example, I imagine that, by thirty years from now, we'll have language/code models that can do significant security analysis of the code that was available in 2020, and that this would have been highly relevant/valuable to people in 2020 interested in computer security. But when will this happen in the 2020-2050 range that seems likely to me? I'm pretty uncertain, and I expect this to look a lot like 'flicking a switch' in retrospect, even tho the leadup to flicking that switch will probably look like smoothly increasing capabilities on 'toy' problems.

[My current guess is that Paul / people in "Paul's camp" would mostly agree with the previous paragraph, except for thinking that it's sort of weird to focus on specifically AI computer security productivity, rather than the overall productivity of the computer security ecosystem, and this misplaced focus will generate the 'flipping the switch' impression. I think most of the disagreements are about 'where to place the focus', and this is one of the reasons it's hard to find bets; it seems to me like Eliezer doesn't care much about the lines Paul is drawing, and Paul doesn't care much about the lines Eliezer is drawing.]

However, I suspect that the revenue curve will look pretty continuous, now that it's gone from zero to one. Do you disagree?

I think I agree in a narrow sense and disagree in a broad sense. For this particular example, I expect OpenAI's revenues from GPT-3 to look roughly continuous now that they're selling/licensing it at all (until another major change happens; like, the introduction of a competitor would likely cause the revenue trend to change).

More generally, suppose we looked at something like "the total economic value of horses over the course of human history". I think we would see mostly smooth trends plus some implied starting and stopping points for those trends. (Like, "first domestication of a horse" probably starts a positive trend, "invention of stirrups" probably starts another positive trend, "introduction of horses to America" starts another positive trend, "invention of the automobile" probably starts a negative trend that ends with "last horse that gets replaced by a tractor/car".)

In my view, 'understanding the world' looks like having a causal model that you can imagine variations on (and have those imaginations be meaningfully grounded in reality), and I think the bits that are most useful for building that causal model are the starts and stops of the trends, rather than the smooth adoption curves or mostly steady equilibria in between. So it seems sort of backwards to me to say that for most of the time, most of the changes in the graph are smooth, because what I want out of the graph is to figure out the underlying generator, where the non-smooth bits are the most informative. The graph itself only seems useful as a means to that end, rather than an end in itself.

Yudkowsky and Christiano discuss "Takeoff Speeds"

it seems like extrapolating from the past still gives you a lot better of a model than most available alternatives.

My impression is that some people are impressed by GPT-3's capabilities, whereas your response is "ok, but it's part of the straight-line trend on Penn Treebank; maybe it's a little ahead of schedule, but nothing to write home about." But clearly you and they are focused on different metrics! 

That is, suppose it's the case that GPT-3 is the first successfully commercialized language model. (I think in order to make this literally true you have to throw on additional qualifiers that I'm not going to look up; pretend I did that.) So on a graph of "language model of type X revenue over time",  total revenue is static at 0 for a long time and then shortly after GPT-3's creation departs from 0.

It seems like the fact that GPT-3 could be commercialized in this way when GPT-2 couldn't is a result of something that Penn Treebank perplexity is sort of pointing at. (That is, it'd be hard to get a model with GPT-3's commercializability but GPT-2's Penn Treebank score.) But what we need in order for the straight line on PTB to be useful as a model for predicting revenue is to know ahead of time what PTB threshold you need for commercialization. 

And so this is where the charge of irrelevancy is coming from: yes, you can draw straight lines, but they're straight lines in the wrong variables. In the interesting variables (from the "what's the broader situation?" worldview), we do see discontinuities, even if there are continuities in different variables.

[As an example of the sort of story that I'd want, imagine we drew the straight line of ELO ratings for Go-bots, had a horizontal line of "human professionals" on that line, and were able to forecast the discontinuity in "number of AI wins against human grandmasters" by looking at straight-line forecasts in ELO.]

Ngo and Yudkowsky on AI capability gains

The mental move I'm doing for each of these examples is not imagining universes where addition/evolution/other deep theory is wrong, but imagining phenomena/problems where addition/evolution/other deep theory is not adapted. If you're describing something that doesn't commute, addition might be a deep theory, but it's not useful for what you want. 

Yeah, this seems reasonable to me. I think "how could you tell that theory is relevant to this domain?" seems like a reasonable question in a way that "what predictions does that theory make?" seems like it's somehow coming at things from the wrong angle.

Ngo and Yudkowsky on AI capability gains

And even if I feel what you're gesturing at, this sounds/looks like you're saying "even if my prediction is false, that doesn't mean that my theory would be invalidated". 

So, thermodynamics also feels like a deep fundamental theory to me, and one of the predictions it makes is "you can't make an engine more efficient than a Carnot engine." Suppose someone exhibits an engine that appears to be more efficient than a Carnot engine; my response is not going to be "oh, thermodynamics is wrong", and instead it's going to be "oh, this engine is making use of some unseen source."

[Of course, you can show me enough such engines that I end up convinced, or show me the different theoretical edifice that explains both the old observations and these new engines.]

What I want is a way of finding the parts of the theory/model/prediction that could actually invalidate it, because that's what we should be discussing really. (A difficulty might be that such theories are so fundamental and powerful than being able to see them makes it really hard to find any way they could go wrong and endanger the theory)

So, later Eliezer gives "addition" as an example of a deep fundamental theory. And... I'm not sure I can imagine a universe where addition is wrong? Like, I can say "you would add 2 and 2 and get 5" but that sentence doesn't actually correspond to any universes.

Like, similarly, I can imagine universes where evolution doesn't describe the historical origin of species in that universe. But I can't imagine universes where the elements of evolution are present and evolution doesn't happen.

[That said, I can imagine universes with Euclidean geometry and different universes with non-Euclidean geometry, so I'm not trying to claim this is true of all deep fundamental theories, but maybe the right way to think about this is "geometry except for the parallel postulate" is the deep fundamental theory.]

Ngo and Yudkowsky on AI capability gains

It's taking a massive massive failure and trying to find exactly the right abstract gloss to put on it that makes it sound like exactly the right perfect thing will be done next time.

I feel like Ngo didn't really respond to this?

Like, later he says: 

Right, I'm not endorsing this as my mainline prediction about what happens. Mainly what I'm doing here is highlighting that your view seems like one which cherrypicks pessimistic interpretations.

But... Richard, are you endorsing it as 'at all in line with the evidence?' Like, when I imagine living in that world, it doesn't have gain-of-function research, which our world clearly does. [And somehow this seems connected to Eliezer's earlier complaints, where it's not obvious to me that when you wrote the explanation, your next step was to figure out what that would actually imply and check if it were true or not.]

Adele Lopez's Shortform

So we would need to figure out how to robustly get an honest signal from such an experiment, which still seems quite hard. But perhaps it's easier than solving the full alignment problem before the first shot.

IMO this is a 'additional line of defense' boxing strategy instead of simplification. 

Note that in the traditional version, the 'dud' bit of the bomb can only be the trigger; a bomb that absorbs the photon but then explodes isn't distinguishable from a bomb that absorbs the photon and then doesn't explode (because of an error deeper in the bomb).

But let's suppose the quantum computing folks can come up with something like this, where we keep some branches entangled and run analysis of the AI code in only one branch, causing an explosion there but affecting the total outcome in all branches. [This seems pretty implausible to me that you manage to maintain entanglement despite that much impact on the external world, but maybe it's possible.] Then 1) as you point out, we need to ensure that the AI doesn't realize that what it needs to output in that branch and 2) need some sort of way to evaluate "did the AI pass our checks or not?". 

But, 2 is "the whole problem"!

Load More