Daniel Kokotajlo

Philosophy PhD student, worked at AI Impacts, now works at Center on Long-Term Risk. Research interests include acausal trade, timelines, takeoff speeds & scenarios, decision theory, history, and a bunch of other stuff. I subscribe to Crocker's Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html

Sequences

AI Timelines
Takeoff and Takeover in the Past and Future

Wiki Contributions

Comments

[AN #156]: The scaling hypothesis: a plan for building AGI

OK cool, sorry for the confusion. Yeah I think ESRogs interpretation of you was making a bit stronger claim than you actually were.

[AN #156]: The scaling hypothesis: a plan for building AGI

Huh, then it seems I misunderstood you then! The fourth bullet point claims that GPT-N will go on filling in missing words rather than doing a treacherous turn. But this seems unsupported by the argument you made, and in fact the opposite seems supported by the argument you made. The argument you made was:

There are several pretraining objectives that could have been used to train GPT-N other than next word prediction (e.g. masked language modeling). For each of these, there's a corresponding model that the resulting GPT-N would "try" to <do the thing in the pretraining objective>. These models make different predictions about what GPT-N would do off distribution. However, by claim 3 it doesn't matter much which pretraining objective you use, so most of these models would be wrong.

Seems to me the conclusion of this argument is that "In general it's not true that the AI is trying to achieve its training objective." The natural corollary is: We have no idea what the AI is trying to achieve, if it is trying to achieve anything at all. So instead of concluding "It'll probably just keep filling in missing words as in training" we should conclude "we have no idea what it'll do; treacherous turn is a real possibility because that's what'll happen for most goals it could have, and it may have a goal for all we know."

Open Problems with Myopia

Also: I think making sure our agents are DDT is probably going to be approximately as difficult as making them aligned. Related: Your handle for anthropic uncertainty is:

never reason about anthropic uncertainty. DDT agents always think they know who they are.

"Always think they know who they are" doesn't cut it; you can think you know you're in a simulation. I think a more accurate version would be something like "Always think that you are on an original planet, i.e. one in which life appeared 'naturally,' rather than a planet in the midst of some larger interstellar civilization, or a simulation of a planet, or whatever. Basically, you need to believe that you were created by humans but that no intelligence played a role in the creation and/or arrangement of the humans who created you. Or... no role other than the "normal" one in which parents create offspring, governments create institutions, etc. I think this is a fairly specific belief, and I don't think we have the ability to shape our AIs beliefs with that much precision, at least not yet.

rohinmshah's Shortform

OK, fair enough. But what if it writes, like, 20 posts in the first 20 days which are that good, but then afterwards it hits diminishing returns because the rationality-related points it makes are no longer particularly novel and exciting? I think this would happen to many humans if they could work at super-speed.

That said, I don't think this is that likely I guess... probably AI will be unable to do even three such posts, or it'll be able to generate arbitrary numbers of them. The human range is small. Maybe. Idk.

rohinmshah's Shortform

Ah right, good point, I forgot about cherry-picking. I guess we could make it be something like "And the blog post wasn't cherry-picked; the same system could be asked to make 2 additional posts on rationality and you'd like both of them also." I'm not sure what credence I'd give to this but it would probably be a lot higher than 10%.

Website prediction: Nice, I think that's like 50% likely by 2030.

Major research area: What counts as a major research area? Suppose I go calculate that Alpha Fold 2 has already sped up the field of protein structure prediction by 100x (don't need to do actual experiments anymore!), would that count? If you hadn't heard of AlphaFold yet, would you say it counted? Perhaps you could give examples of the smallest and easiest-to-automate research areas that you think have only a 10% chance of being automated by 2030.

20,000 LW karma: Holy shit that's a lot of karma for one year. I feel like it's possible that would happen before it's too late (narrow AI good at writing but not good at talking to people and/or not agenty) but unlikely. Insofar as I think it'll happen before 2030 it doesn't serve as a good forecast because it'll be too late by that point IMO.

Productivity tool UI's obsolete thanks to assistants: This is a good one too. I think that's 50% likely by 2030.

I'm not super certain about any of these things of course, these are just my wild guesses for now.

rohinmshah's Shortform

Nice! I really appreciate that you are thinking about this and making predictions. I want to do the same myself.

I think I'd put something more like 50% on "Rohin will at some point before 2030 read an AI-written blog post on rationality that he likes more than the typical LW >30 karma post." That's just a wild guess, very unstable.

Another potential prediction generation methodology: Name something that you think won't happen, but you think I think will.

BASALT: A Benchmark for Learning from Human Feedback

Going from zero to "produce an AI that learns the task entirely from demonstrations and/or natural language description" is really hard for the modern AI research hive mind. You have to instead give it a shaped reward, breadcrumbs along the way that are easier, (such as allowing handcrafted heuristics and such, and allowing knowledge of a particular target task) to get the hive mind started making progress.

Experimentally evaluating whether honesty generalizes

Thanks!

I like your breakdown of A-E, let's use that going forward.

It sounds like your view is: For "dumb" AIs that aren't good at reasoning, it's more likely that they'll just do B "directly" rather than do E-->D-->C-->B. Because the latter involves a lot of tricky reasoning which they are unable to do. But as we scale up our AIs and make them smarter, eventually the E-->D-->C-->B thing will be more likely than doing B "directly" because it works for approximately any long-term consequence (e.g. paperclips) and thus probably works for some extremely simple/easy-to-have goals, whereas doing B directly is an arbitrary/complex/specific goal that is thus unlikely.

(1) What I was getting at with the "Steps for Dummies" example is that maybe the kind of reasoning required is actually pretty basic/simple/easy and we are already in the regime where E-->D-->C-->B dominates doing B directly. One way it could be easy is if the training data spells it out nicely for the AI. I'd be interested to hear more about why you are confident that we aren't in this regime yet. Relatedly, what sorts of things would you expect to see AIs doing that would convince you that maybe we are in this regime?

(2) What about A? Doesn't the same argument for why E-->D-->C-->B dominates B eventually also work to show that it dominates A eventually?

Experimentally evaluating whether honesty generalizes

I'm not willing to bet yet, I feel pretty ignorant and confused about the issue. :) I'm trying to get more understanding of your model of how all this works. We've discussed:

A. "Do things you've read are instrumentally convergent"

B. "Tell the humans what they want to hear."

C. "Try to win at training."

D. "Build a model of the training process and use it to make predictions."

It sounds like you are saying A is the most complicated, followed by B and C, and then D is the least complicated. (And in this case the AI will know that winning at training means telling the humans what they want to hear. Though you also suggested the AI wouldn't necessarily understand the dynamics of the training process, so idk.)

To my fresh-on-this-problem eyes, all of these things seem equally likely to be the simplest. And I can tell a just-so story for why A would actually be the simplest; it'd be something like this: Suppose that somewhere in the training data there is a book titled "How to be a successful language model: A step-by-step guide for dummies." The AI has read this book many times, and understands it. In this case perhaps rather than having mental machinery that thinks "I should try to win at training. How do I do that in this case? Let's see... given what I know of the situation... by telling the humans what they want to hear!" it would instead have mental machinery that thinks "I should follow the Steps for Dummies. Let's see... given what I know of the situation... by telling the humans what they want to hear!" Because maybe "follow the steps for dummies" is a simpler, more natural concept for this dumb AI (given how prominent the book was in its training data) than "try to win at training." The just-so story would be that maybe something analogous to this actually happens, even though there isn't literally a Steps for Dummies book in the training data.

Load More