Jacob Steinhardt

Wiki Contributions


Yudkowsky and Christiano discuss "Takeoff Speeds"

My basic take is that there will be lots of empirical examples where increasing model size by a factor of 100 leads to nonlinear increases in capabilities (and perhaps to qualitative changes in behavior). On median, I'd guess we'll see at least 2 such examples in 2022 and at least 100 by 2030.

At the point where there's a "FOOM", such examples will be commonplace and happening all the time. Foom will look like one particularly large phase transition (maybe 99th percentile among examples so far) that chains into more and more. It seems possible (though not certain--maybe 33%?) that once you have the right phase transition to kick off the rest, everything else happens pretty quickly (within a few days).

Is this take more consistent with Paul's or Eliezer's? I'm not totally sure. I'd guess closer to Paul's, but maybe the "1 day" world is consistent with Eliezer's?

(One candidate for the "big" phase transition would be if the model figures out how to go off and learn on its own, so that number of SGD updates is no longer the primary bottleneck on model capabilities. But I could also imagine us getting that even when models are still fairly "dumb".)

Discussion with Eliezer Yudkowsky on AGI interventions

Thanks. For time/brevity, I'll just say which things I agree / disagree with:

> sufficiently capable and general AI is likely to have property X as a strong default [...] 

I generally agree with this, although for certain important values of X (such as "fooling humans for instrumental reasons") I'm probably more optimistic than you that there will be a robust effort to get not-X, including by many traditional ML people. I'm also probably more optimistic (but not certain) that those efforts will succeed.

[inside view, modest epistemology]: I don't have a strong take on either of these. My main take on inside views is that they are great for generating interesting and valuable hypotheses, but usually wrong on the particulars.

> less weight on reasoning like 'X was true about AI in 1990, in 2000, in 2010, and in 2020; therefore X is likely to be true about AGI when it's developed

I agree, see my post On the Risks of Emergent Behavior in Foundation Models. In the past I think I put too much weight on this type of reasoning, and also think most people in ML put too much weight on it.

> MIRI thinks AGI is better thought of as 'a weird specific sort of AI', rather than as 'like existing AI but more so'.

Probably disagree but hard to tell. I think there will both be a lot of similarities and a lot of differences.

> AGI is mostly insight-bottlenecked (we don't know how to build it), rather than hardware-bottlenecked

Seems pretty wrong to me. We probably need both insight and hardware, but the insights themselves are hardware-bottlenecked: once you can easily try lots of stuff and see what happens, insights are much easier, see Crick on x-ray crystallography for historical support (ctrl+f for Crick).

> I'd want to look at more conceptual work too, where I'd guess MIRI is also more pessimistic than you

I'm more pessimistic than MIRI about HRAD, though that has selection effects. I've found conceptual work to be pretty helpful for pointing to where problems might exist, but usually relatively confused about how to address them or how specifically they're likely to manifest. (Which is to say, overall highly valuable, but consistent with my take above on inside views.)

[experiments are either predictable or uninformative]: Seems wrong to me. As a concrete example: Do larger models have better or worse OOD generalization? I'm not sure if you'd pick "predictable" or "uninformative", but my take is:
 * The outcome wasn't predictable: within ML there are many people who would have taken each side. (I personally was on the wrong side, i.e. predicting "worse".)
 * It's informative, for two reasons: (1) It shows that NNs "automatically" generalize more than I might have thought, and (2) Asymptotically, we expect the curve to eventually reverse, so when does that happen and how can we study it?

See also my take on Measuring and Forecasting Risks from AI, especially the section on far-off risks.

> Most ML experiments either aren't about interpretability and 'cracking open the hood', or they're not approaching the problem in a way that MIRI's excited by.

Would agree with "most", but I think you probably meant something like "almost all", which seems wrong. There's lots of people working on interpretability, and some of the work seems quite good to me (aside from Chris, I think Noah Goodman, Julius Adebayo, and some others are doing pretty good work).

Discussion with Eliezer Yudkowsky on AGI interventions

Not sure if this helps, and haven't read the thread carefully, but my sense is your framing might be eliding distinctions that are actually there, in a way that makes it harder to get to the bottom of your disagreement with Adam. Some predictions I'd have are that:

 * For almost any experimental result, a typical MIRI person (and you, and Eliezer) would think it was less informative about AI alignment than I would.
 * For almost all experimental results you would think they were so much less informative as to not be worthwhile.
 * There's a small subset of experimental results that we would think are comparably informative, and also a some that you would find much more informative than I would.

(I'd be willing to take bets on these or pick candidate experiments to clarify this.)

In addition, a consequence of these beliefs is that compared to me you think we should be spending way more time sitting around thinking about stuff, and way less time doing experiments, than I do.

I would agree with you that "MIRI hates all experimental work" / etc. is not a faithful representation of this state of affairs, but I think there is nevertheless an important disagreement MIRI has with typical ML people, and that the disagreement is primarily about what we can learn from experiments.

Experimentally evaluating whether honesty generalizes

Actually, another issue is that unsupervised translation isn't "that hard" relative to supervised translation--I think that you can get pretty far with simple heuristics, such that I'd guess making the model 10x bigger matters more than making the objective more aligned with getting the answer right (and that this will be true for at least a couple more 10x-ing of model size, although at some point the objective will matter more).

This might not matter as much if you're actually outputting explanations and not just translating from one language to another. Although it is probably true that for tasks that are far away from the ceiling, "naive objective + 10x larger model" will outperform "correct objective".

Experimentally evaluating whether honesty generalizes

Thanks Paul, I generally like this idea.

Aside from the potential concerns you bring up, here is the most likely way I could see this experiment failing to be informative: rather than having checks and question marks in your tables above, really the model's ability to solve each task is a question of degree--each table entry will be a real number between 0 and 1. For, say, tone, GPT-3 probably doesn't have a perfect model of tone, and would get <100% performance on a sentiment classification task, especially if done few-shot.

The issue, then, is that the "fine-tuning for correctness" and "fine-tuning for coherence" processes are not really equivalent--fine-tuning for correctness is in fact giving GPT-3 additional information about tone, which improves its capabilities. In addition, GPT-3 might not "know" exactly what humans mean by the word tone, and so fine-tuning for correctness also helps GPT-3 to better understand the question.

Given these considerations, my modal expectation is that fine-tuning for correctness will provide moderately better results than just doing coherence, but it won't be clear how to interpret the difference--maybe in both cases GPT-3 provides incoherent outputs 10% of the time, and then additionally coherent but wrong outputs 10% of the time when fine-tuned for correctness, but 17% of the time when fine-tuned only for coherence. What would you conclude from a result like that? I would still have found the experiment interesting, but I'm not sure I would be able to draw a firm conclusion.

So perhaps my main feedback would be to think about how likely you think such an outcome is, how much you mind that, and if there are alternative tasks that avoid this issue without being significantly more complicated.

AI x-risk reduction: why I chose academia over industry

This doesn't seem so relevant to capybaralet's case, given that he was choosing whether to accept an academic offer that was already extended to him.