Edouard Harris

Independent researcher.

Wiki Contributions


AI Tracker: monitoring current and near-future risks from superscale models

Personally speaking, I think this is the subfield to be closely tracking progress in, because 1) it has far-reaching implications in the long term and 2) it has garnered relatively little attention compared to other subfields.

Thanks for the clarification — definitely agree with this.

If you'd like to visualize trends though, you'll need more historical data points, I think.

Yeah, you're right. Our thinking was that we'd be able to do this with future data points or by increasing the "density" of points within the post-GPT-3 era, but ultimately it will probably be necessary (and more compelling) to include somewhat older examples too.

AI Tracker: monitoring current and near-future risks from superscale models

Interesting; I hadn't heard of DreamerV2. From a quick look at the paper, it looks like one might describe it as a step on the way to something like EfficientZero. Does that sound roughly correct?

it would be great to see older models incorporated as well

We may extend this to older models in the future. But our goal right now is to focus on these models' public safety risks as standalone (or nearly standalone) systems. And prior to GPT-3, it's hard to find models whose public safety risks were meaningful on a standalone basis — while an earlier model could have been used as part of a malicious act, for example, it wouldn't be as central to such an act as a modern model would be.

Yudkowsky and Christiano discuss "Takeoff Speeds"

Yeah, these are interesting points.

Isn't it a bit suspicious that the thing-that's-discontinuous is hard to measure, but the-thing-that's-continuous isn't? I mean, this isn't totally suspicious, because subjective experiences are often hard to pin down and explain using numbers and statistics. I can understand that, but the suspicion is still there.

I sympathize with this view, and I agree there is some element of truth to it that may point to a fundamental gap in our understanding (or at least in mine). But I'm not sure I entirely agree that discontinuous capabilities are necessarily hard to measure: for example, there are benchmarks available for things like arithmetic, which one can train on and make quantitative statements about.

I think the key to the discontinuity question is rather that 1) it's the jumps in model scaling that are happening in discrete increments; and 2) everything is S-curves, and a discontinuity always has a linear regime if you zoom in enough. Those two things together mean that, while a capability like arithmetic might have a continuous performance regime on some domain, in reality you can find yourself halfway up the performance curve in a single scaling jump (and this is in fact what happened with arithmetic and GPT-3). So the risk, as I understand it, is that you end up surprisingly far up the scale of "world-ending" capability from one generation to the next, with no detectable warning shot beforehand.

"No one predicted X in advance" is only damning to a theory if people who believed that theory were making predictions about it at all. If people who generally align with Paul Christiano were indeed making predictions to the effect of GPT-3 capabilities being impossible or very unlikely within a narrow future time window, then I agree that would be damning to Paul's worldview. But -- and maybe I missed something -- I didn't see that. Did you?

No, you're right as far as I know; at least I'm not aware of any such attempted predictions. And in fact, the very absence of such prediction attempts is interesting in itself. One would imagine that correctly predicting the capabilities of an AI from its scale ought to be a phenomenally valuable skill — not just from a safety standpoint, but from an economic one too. So why, indeed, didn't we see people make such predictions, or at least try to?

There could be several reasons. For example, perhaps Paul (and other folks who subscribe to the "continuum" world-model) could have done it, but they were unaware of the enormous value of their predictive abilities. That seems implausible, so let's assume they knew the value of such predictions would be huge. But if you know the value of doing something is huge, why aren't you doing it? Well, if you're rational, there's only one reason: you aren't doing it because it's too hard, or otherwise too expensive compared to your alternatives. So we are forced to conclude that this world-model — by its own implied self-assessment — has, so far, proved inadequate to generate predictions about the kinds of capabilities we really care about.

(Note: you could make the argument that OpenAI did make such a prediction, in the approximate yet very strong sense that they bet big on a meaningful increase in aggregate capabilities from scale, and won. You could also make the argument that Paul, having been at OpenAI during the critical period, deserves some credit for that decision. I'm not aware of Paul ever making this argument, but if made, it would be a point in favor of such a view and against my argument above.)

Yudkowsky and Christiano discuss "Takeoff Speeds"

I think what gwern is trying to say is that continuous progress on a benchmark like PTB appears (from what we've seen so far) to map to discontinuous progress in qualitative capabilities, in a surprising way which nobody seems to have predicted in advance. Qualitative capabilities are more relevant to safety than benchmark performance is, because while qualitative capabilities include things like "code a simple video game" and "summarize movies with emojis", they also include things like "break out of confinement and kill everyone". It's the latter capability, and not PTB performance, that you'd need to predict if you wanted to reliably stay out of the x-risk regime — and the fact that we can't currently do so is, I imagine, what brought to mind the analogy between scaling and Russian roulette.

I.e., a straight line in domain X is indeed not surprising; what's surprising is the way in which that straight line maps to the things we care about more than X.

(Usual caveats apply here that I may be misinterpreting folks, but that is my best read of the argument.)

Yudkowsky and Christiano discuss "Takeoff Speeds"

Good catch! I didn't check the form. Yes you are right, the spoiler should say (1=Paul, 9=Eliezer) but the conclusion is the right way round.

Yudkowsky and Christiano discuss "Takeoff Speeds"

(Not being too specific to avoid spoilers) Quick note: I think the direction of the shift in your conclusion might be backwards, given the statistics you've posted and that 1=Eliezer and 9=Paul.

AI Tracker: monitoring current and near-future risks from superscale models

Thanks for the kind words and thoughtful comments.

You're absolutely right that expected ROI ultimately determines scale of investment. I agree on your efficiency point too: scaling and efficiency are complements, in the sense that the more you have of one, the more it's worth investing in the other.

I think we will probably include some measure of efficiency as you've suggested. But I'm not sure exactly what that will be, since efficiency measures tend to be benchmark-dependent so it's hard to get apples-to-apples here for a variety of reasons. (e.g., differences in modalities, differences in how papers record their results, but also the fact that benchmarks tend to get smashed pretty quickly these days, so newer models are being compared on a different basis from old ones.) Did you have any specific thoughts about this? To be honest, this is still an area we are figuring out.

On the ROI side: while this is definitely the most important metric, it's also the one with by far the widest error bars. The reason is that it's impossible to predict all the creative ways people will use these models for economic ends — even GPT-3 by itself might spawn entire industries that don't yet exist. So the best one could hope for here is something like a lower bound with the accuracy of a startup's TAM estimate: more art than science, and very liable to be proven massively wrong in either direction. (Disclosure: I'm a modestly prolific angel investor, and I've spoken to — though not invested in — several companies being built on GPT-3's API.)

There's another reason we're reluctant to publish ROI estimates: at the margin, these estimates themselves bolster the case for increased investment in scaling, which is concerning from a risk perspective. This probably wouldn't be a huge effect in absolute terms, since it's not really the sort of thing effective allocators weigh heavily as decision inputs, but there are scenarios where it matters and we'd rather not push our luck.

Thanks again!

A positive case for how we might succeed at prosaic AI alignment

Gotcha. Well, that seems right—certainly in the limit case.

A positive case for how we might succeed at prosaic AI alignment

Thanks, that helps. So actually this objection says: "No, the biggest risk lies not in the trustworthiness of the Bob you use as the input to your scheme, but rather in the fidelity of your copying process; and this is true even if the errors in your copying process are being introduced randomly rather than adversarially. Moreover, if you actually do develop the technical capability to reduce your random copying-error risk down to around the level of your Bob-trustworthiness risk, well guess what, you've built yourself an AGI. But since this myopic copying scheme thing seems way harder than the easiest way I can think of to build an AGI, that means a fortiori that somebody else built one the easy way several years before you built yours."

Is that an accurate interpretation?

A positive case for how we might succeed at prosaic AI alignment

This is a great thread. Let me see if I can restate the arguments here in different language:

  1. Suppose Bob is a smart guy whom we trust to want all the best things for humanity. Suppose we also have the technology to copy Bob's brain into software and run it in simulation at, say, a million times its normal speed. Then, if we thought we had one year between now and AGI (leaving aside the fact that I just described a literal AGI in the previous sentence), we could tell simulation-Bob, "You have a million subjective years to think of an effective pivotal act in the real world, and tell us how to execute it." Bob's a smart guy, and we trust him to do the right thing by us; he should be able to figure something out in a million years, right?
  2. My understanding of Evan's argument at this point would be: "Okay; so we don't have the technology to directly simulate Bob's brain. But maybe instead we can imitate its I/O signature by training a model against its actions. Then, because that model is software, we can (say) speed it up a million times and deal with it as if it was a high-fidelity copy of Bob's brain, and it can solve alignment / execute pivotal action / etc. for us. Since Bob was smart, the model of Bob will be smart. And since Bob was trustworthy, the model of Bob will be trustworthy to the extent that the training process we use doesn't itself introduce novel long-term dependencies that leave room for deception."
  3. Note that myopia — i.e., the purging of long term dependencies from the training feedback signal — isn't really conceptually central to the above scheme. Rather it is just a hack intended to prevent additional deception risks from being introduced through the act of copying Bob's brain. The simulated / imitated copy of Bob is still a full-blown consequentialist, with all the manifold risks that entails. So the scheme is basically a way of taking an impractically weak system that you trust, and overclocking it but not otherwise affecting it, so that it retains (you hope) the properties that made you trust it in the first place.
  4. At this point my understanding of Eliezer's counterargument would be: "Okay sure; but find me a Bob that you trust enough to actually put through this process. Everything else is neat, but it is downstream of that." And I think that this is correct and that it is a very, very strong objection, but — under certain sets of assumptions about timelines, alternatives, and counterfactual risks — it may not be a complete knock-down. (This is the "belling the cat" bit, I believe.)
  5. And at this point, maybe (?) Evan says, "But wait; the Bob-copy isn't actually a consequentialist because it was trained myopically." And if that's what Evan says, then I believe this is the point at which there is an empirically resolvable disagreement.

Is this roughly right? Or have I missed something?

Load More