Wiki Contributions


Christiano, Cotra, and Yudkowsky on AI progress

My understanding is that Sputnik was a big discontinuous jump in "distance which a payload (i.e. nuclear bomb) can be delivered" (or at least it was a conclusive proof-of-concept of a discontinuous jump in that metric). That metric was presumably under heavy optimization pressure at the time, and was the main reason for strategic interest in Sputnik, so it lines up very well with the preconditions for the continuous view.

Christiano, Cotra, and Yudkowsky on AI progress

My version of it (which may or may not be Paul's version) predicts that in domains where people are putting in lots of effort to optimize a metric, that metric will grow relatively continuously. In other words, the more effort put in to optimize the metric, the more you can rely on straight lines for that metric staying straight (assuming that the trends in effort are also staying straight).

This is super helpful, thanks. Good explanation.

With this formulation of the "continuous view", I can immediately think of places where I'd bet against it. The first which springs to mind is aging: I'd bet that we'll see a discontinuous jump in achievable lifespan of mice. The gears here are nicely analogous to AGI too: I expect that there's a "common core" (or shared cause) underlying all the major diseases of aging, and fixing that core issue will fix all of them at once, in much the same way that figuring out the "core" of intelligence will lead to a big discontinuous jump in AI capabilities. I can also point to current empirical evidence for the existence of a common core in aging, which might suggest analogous types of evidence to look at in the intelligence context.

Thinking about other analogous places... presumably we saw a discontinuous jump in flight range when Sputnik entered orbit. That one seems extremely closely analogous to AGI. There it's less about the "common core" thing, and more about crossing some critical threshold. Nuclear weapons and superconductors both stand out a-priori as places where we'd expect a critical-threshold-related discontinuity, though I don't think people were optimizing hard enough in superconductor-esque directions for the continuous view to make a strong prediction there (at least for the original discovery of superconductors).

Christiano, Cotra, and Yudkowsky on AI progress

Some thinking-out-loud on how I'd go about looking for testable/bettable prediction differences here...

I think my models overlap mostly with Eliezer's in the relevant places, so I'll use my own models as a proxy for his, and think about how to find testable/bettable predictions with Paul (or Ajeya, or someone else in their cluster).

One historical example immediately springs to mind where something-I'd-consider-a-Paul-esque-model utterly failed predictively: the breakdown of the Philips curve. The original Philips curve was based on just fitting a curve to inflation-vs-unemployment data; Friedman and Phelps both independently came up with theoretical models for that relationship in the late sixties ('67-'68), and Friedman correctly forecasted that the curve would break down in the next recession (i.e. the "stagflation" of '73-'75). This all led up to the Lucas Critique, which I'd consider the canonical case-against-what-I'd-call-Paul-esque-worldviews within economics. The main idea which seems transportable to other contexts is that surface relations (like the Philips curve) break down under distribution shifts in the underlying factors.

So, how would I look for something analogous to that situation in today's AI? We need something with an established trend, but where a distribution shift happens in some underlying factor. One possible place to look: I've heard that OpenAI plans to make the next generation of GPT not actually much bigger than the previous generation; they're trying to achieve improvement through strategies other than Stack More Layers. Assuming that's true, it seems like a naive Paul-esque model would predict that the next GPT is relatively unimpressive compared to e.g. the GPT2 -> GPT 3 delta? Whereas my models (or I'd guess Eliezer's models) would predict that it's relatively more impressive, compared to the expectations of Paul-esque models (derived by e.g. extrapolating previous performance as a function of model size and then plugging in actual size of the next GPT)? I wouldn't expect either view to make crisp high-certainty predictions here, but enough to get decent Bayesian evidence.

Other than distribution shifts, the other major place I'd look for different predictions is in the extent to which aggregates tell us useful things. The post got into that in a little detail, but I think there's probably still room there. For instance, I recently sat down and played with some toy examples of GDP growth induced by tech shifts, and I was surprised by how smooth GDP was even in scenarios with tech shifts which seemed very impactful to me. I expect that Paul would be even more surprised by this if he were to do the same exercise. In particular, this quote seems relevant:

the point is that housing and healthcare are not central examples of things that scale up at the beginning of explosive growth, regardless of whether it's hard or soft

It is surprisingly difficult to come up with a scenario where GDP growth looks smooth AND housing+healthcare don't grow much AND GDP growth accelerates to a rate much faster than now. If everything except housing and healthcare are getting cheaper, then housing and healthcare will likely play a much larger role in GDP (and together they're 30-35% already), eventually dominating GDP. This isn't a logical necessity; in principle we could consume so much more of everything else that the housing+healthcare share shrinks, but I think that would probably diverge from past trends (though I have not checked). What I actually expect is that as people get richer, they spend a larger fraction on things which have a high capacity to absorb marginal income, of which housing and healthcare are central examples.

If housing and healthcare aren't getting cheaper, and we're not spending a smaller fraction of income on them (by buying way way more of the things which are getting cheaper), then that puts a pretty stiff cap on how much GDP can grow.

Zooming out a meta-level, I think GDP is a particularly good example of a big aggregate metric which approximately-always looks smooth in hindsight, even when the underlying factors of interest undergo large jumps. I think Paul would probably update toward that view if he spent some time playing around with examples (similar to this post).

Similarly, I've heard that during training of GPT-3, while aggregate performance improves smoothly, performance on any particular task (like e.g. addition) is usually pretty binary - i.e. performance on any particular task tends to jump quickly from near-zero to near-maximum-level. Assuming this is true, presumably Paul already knows about it, and would argue that what matters-for-impact is ability at lots of different tasks rather than one (or a few) particular tasks/kinds-of-tasks? If so, that opens up a different line of debate, about the extent to which individual humans' success today hinges on lots of different skills vs a few, and in which areas.

Yudkowsky and Christiano discuss "Takeoff Speeds"

FWIW, I did not find this weirdly uncharitable, only mildly uncharitable. I have extremely wide error bars on what you have and have not read, and "Eliezer has not read any of the things on that list" was within those error bars. It is really quite difficult to guess your epistemic state w.r.t. specific work when you haven't been writing about it for a while.

(Though I guess you might have been writing about it on Twitter? I have no idea, I generally do not use Twitter myself, so I might have just completely missed anything there.)

Yudkowsky and Christiano discuss "Takeoff Speeds"

I feel like the debate between EY and Paul (and the broader debate about fast vs. slow takeoff) has been frustratingly much reference class tennis and frustratingly little gears-level modelling.

So, there's this inherent problem with deep gearsy models, where you have to convey a bunch of upstream gears (and the evidence supporting them) before talking about the downstream questions of interest, because if you work backwards then peoples' brains run out of stack space and they lose track of the whole multi-step path. But if you just go explaining upstream gears first, then people won't immediately see how they're relevant to alignment or timelines or whatever, and then lots of people just wander off. Then you go try to explain something about alignment or timelines or whatever, using an argument which relies on those upstream gears, and it goes right over a bunch of peoples' heads because they don't have that upstream gear in their world-models.

For the sort of argument in this post, it's even worse, because a lot of people aren't even explicitly aware that the relevant type of gear is a thing, or how to think about it beyond a rough intuitive level.

I first ran into this problem in the context of takeoff arguments a couple years ago, and wrote up this sequence mainly to convey the relevant kinds of gears and how to think about them. I claim that this (i.e. constraint slackness/tautness) is usually a good model for gear-type in arguments about reference-classes in practice: typically an intuitively-natural reference class is a set of cases which share some common constraint, and the examples in the reference class then provide evidence for the tautness/slackness of the constraint. For instance, in this post, Paul often points to market efficiency as a taut constraint, and Eliezer argues that constraint is not very taut (at least not in the way needed for the slow takeoff argument). Paul's intuitive estimates of tautness are presumably driven by things like e.g. financial markets. On the other side, Eliezer wrote Inadequate Equilibria to talk about how taut market efficiency is in general, including gears "further up" and more examples.

If you click through the link in the post to Intelligence Explosion Microeconomics, there's a lot of this sort of reasoning in it.

Ngo and Yudkowsky on AI capability gains

I'm guessing that a lot of the hidden work here and in the next steps would come from asking stuff like:

  • so I need to alter the bucket for each new idea, or does it instead fit in its current form each time?
  • does the mental act of finding that an idea fit into the bucket removes some confusion and clarifies, or is it just a mysterious answer?
  • Does the bucket become more simple and more elegant with each new idea that fit in it?

Sounds like you should try writing it.

Corrigibility Can Be VNM-Incoherent

Does broad corrigibility imply VNM-incoherence?

Yes, unless the state reward function is constant and we only demand weak corrigibility to all policies.

Given that this is the main result, I feel like the title "Corrigibility Can Be VNM-Incoherent" is rather dramatically understating the case. Maybe something like "Corrigibility Is Never Nontrivially VNM-Coherent In MDPs" would be closer. Or maybe just drop the hedging and say "Corrigibility Is Never VNM-Coherent In MDPs", since the constant-utility case is never interesting anyway.

Ngo and Yudkowsky on AI capability gains

Potentially important thing to flag here: at least in my mind, expected utility theory (i.e. the property Eliezer was calling "laser-like" or "coherence") and consequentialism are two distinct things. Consequentialism will tend to produce systems with (approximate) coherent expected utilities, and that is one major way I expect coherent utilities to show up in practice. But coherent utilities can in-principle occur even without consequentialism (e.g. conservative vector fields in physics), and consequentialism can in-principle not be very coherent (e.g. if it just has tons of resources and doesn't have to be very efficient to achieve a goal-state).

(I'm not sure whether Eliezer would agree with this. The thing-I-think-Eliezer-means-by-consequentialism does not yet have a good mathematical formulation which I know of, which makes it harder to check that two people even mean the same thing when pointing to the concept.)

Ngo and Yudkowsky on AI capability gains

To be clear, this part:

It's one of those predictions where, if it's false, then we've probably discovered something interesting - most likely some place where an organism is spending resources to do something useful which we haven't understood yet.

... is also intended as a falsifiable prediction. Like, if we go look at the anomaly and there's no new thing going on there, then that's a very big strike against expected utility theory.

This particular type of fallback-prediction is a common one in general: we have some theory which makes predictions, but "there's a phenomenon which breaks one of the modelling assumption in a way noncentral to the main theory" is a major way the predictions can fail. But then we expect to be able to go look and find the violation of that noncentral modelling assumption, which would itself yield some interesting information. If we don't find such a violation, it's a big strike against the theory.

Ngo and Yudkowsky on alignment difficulty

I do think alignment has a relatively-simple core. Not as simple as intelligence/competence, since there's a decent number of human-value-specific bits which need to be hardcoded (as they are in humans), but not enough to drive the bulk of the asymmetry.

(BTW, I do think you've correctly identified an important point which I think a lot of people miss: humans internally "learn" values from a relatively-small chunk of hardcoded information. It should be possible in-principle to specify values with a relatively small set of hardcoded info, similar to the way humans do it; I'd guess fewer than at most 1000 things on the order of complexity of a very fuzzy face detector are required, and probably fewer than 100.)

The reason it's less learnable than competence is not that alignment is much more complex, but that it's harder to generate a robust reward signal for alignment. Basically any sufficiently-complex long-term reward signal should incentivize competence. But the vast majority of reward signals do not incentivize alignment. In particular, even if we have a reward signal which is "close" to incentivizing alignment in some sense, the actual-process-which-generates-the-reward-signal is likely to be at least as simple/natural as actual alignment.

(I'll note that the departure from talking about Hidden Complexity here is mainly because competence in particular is a special case where "complexity" plays almost no role, since it's incentivized by almost any reward. Hidden Complexity is still usually the right tool for talking about why any particular reward-signal will not incentivize alignment.)

I suspect that Eliezer's answer to this would be different, and I don't have a good guess what it would be.

Load More