All of Daniel Kokotajlo's Comments + Replies

The Simulation Hypothesis Undercuts the SIA/Great Filter Doomsday Argument

I disagree that (B) is not decision-relevant and that (C) is. I'm not sure, haven't thought through all this yet, but that's my initial reaction at least.

AI takeoff story: a continuation of progress by other means

Thanks, this was a load of helpful clarification and justification!

APS-AI means Advanced, Planning, Strategically-Aware AI. Advanced means superhuman at some set of tasks (such as persuasion, strategy, etc.) that combines to enable the acquisition of power and resources, at least in today's world. The term & concept is due to Joe Carlsmith (see his draft report on power-seeking AI, he blogged about it a while ago).

2Edouard Harris17dNo problem, glad it was helpful! And thanks for the APS-AI definition, I wasn't aware of the term.
AI takeoff story: a continuation of progress by other means

Awesome! This is exactly the sort of thing I was hoping to inspire with this and this. In what follows, I'll list a bunch of thoughts & critiques:

1. It would be great if you could pepper your story with dates, so that we can construct a timeline and judge for ourselves whether we think things are happening too quickly or not.

2. Auto-generated articles and auto-generated videos being so popular that they crowd out most human content creators... this happens at the beginning of the story? I think already this is somewhat implausible and also very interes... (read more)

Hey Daniel — thanks so much for taking the time to write this thoughtful feedback. I really appreciate you doing this, and very much enjoyed your "2026" post as well. I apologize for the delay and lengthy comment here, but wanted to make sure I addressed all your great points.

1. It would be great if you could pepper your story with dates, so that we can construct a timeline and judge for ourselves whether we think things are happening too quickly or not.

I've intentionally avoided referring to absolute dates, other than by indirect implication (e.g. "iOS 19... (read more)

2Raven19dI interpreted the Medallion stuff as a hint that AGI was already loose and sucking up resources (money) to buy more compute for itself. But I'm not sure that actually makes sense, now that I think about it.
Paths To High-Level Machine Intelligence

I think this sequence of posts is underrated/underappreciated. I think this is because (A) it's super long and dense and (B) mostly a summary/distillation/textbook thingy rather than advancing new claims or arguing for something controversial. As a result of (a) and (b) perhaps it struggles to hold people's attention all the way through.

But that's a shame because this sort of project seems pretty useful to me. It seems like the community should be throwing money and interns at you, so that you can build a slick interactive website that displays the full gr... (read more)

3Aryeh Englander20dThanks Daniel for that strong vote of confidence! The full graph is in fact expandable / collapsible, and it does have the ability to display the relevant paragraphs when you hover over a node (although the descriptions are not all filled in yet). It also allows people to enter in their own numbers and spit out updated calculations, exactly as you described. We actually built a nice dashboard for that - we haven't shown it yet in this sequence because this sequence is mostly focused on phase 1 and that's for phase 2. Analytica does have a web version, but it's a bit clunky and buggy so we haven't used it so far. However, I was just informed that they are coming out with a major update soon that will include a significantly better web version, so hopefully we can do all this then. I certainly don't think we'd say no to additional funding or interns! We could certainly use them - there are quite a few areas that we have not looked into sufficiently because all of our team members were focused on other parts of the model. And we haven't gotten yet to much of the quantitative part (phase 2 as you called it), or the formal elicitation part.
[AN #164]: How well can language models write code?
I agree that humans would do poorly in the experiment you outline. I think this shows that, like the language model, humans-with-one-second do not "understand" the code.

Haha, good point -- yes. I guess what I should say is: Since humans would have performed just as poorly on this experiment, it doesn't count as evidence that e.g. "current methods are fundamentally limited" or "artificial neural nets can't truly understand concepts in the ways humans can" or "what goes on inside ANN's is fundamentally a different kind of cognition from what goes on inside biological neural nets" or whatnot.

4Rohin Shah1moOh yeah, I definitely agree that this is not strong evidence for typical skeptic positions (and I'd guess the authors would agree).
[AN #164]: How well can language models write code?

Thanks again for these newsletters and summaries! I'm excited about the flagship paper.


First comment: I don't think their experiment about code execution is much evidence re "true understanding."

Recall that GPT-3 has 96 layers and the biggest model used in this paper was smaller than GPT-3. Each pass through the network is therefore loosely equivalent to less than one second of subjective time, by comparison to the human brain which typically goes through something like 100 serial operations per second I think? Could be a lot more, I'm not sure. https://a... (read more)

4Rohin Shah1moI agree that humans would do poorly in the experiment you outline. I think this shows that, like the language model, humans-with-one-second do not "understand" the code. (Idk if you were trying to argue something else with the comparison, but I don't think it's clear that this is a reasonable comparison; there are tons of objections you could bring up. For example, humans have to work from pixels whereas the language model gets tokens, making its job much easier.) I didn't check the numbers, but that seems pretty reasonable. I think there's a question of whether it actually saves time in the current format -- it might be faster to simply write the program than to write down a clear natural language description of what you want along with test cases.
Forecasting Thread: AI Timelines

It's been a year, what do my timelines look like now?

My median has shifted to the left a bit, it's now 2030. However, I have somewhat less probability in the 2020-2025 range I think, because I've become more aware of the difficulties in scaling up compute. You can't just spend more money. You have to do lots of software engineering and for 4+ OOMs you literally need to build more chip fabs to produce more chips. (Also because 2020 has passed without TAI/AGI/etc., so obviously I won't put as much mass there...)

So if I were to draw a distribution it would look pretty similar, just a bit more extreme of a spike and the tip of the spike might be a bit to the right.

Thoughts on gradient hacking

Even in the simple case no. 1, I don't quite see why Evan isn't right yet.

It's true that deterministically failing will create a sort of wall in the landscape that the ball will bounce off of and then roll right back into as you said. However, wouldn't it also perhaps roll in other directions, such as perpendicular to the wall? Instead of getting stuck bouncing into the wall forever, the ball would bounce against the wall while also rolling in some other direction along it. (Maybe the analogy to balls and walls is leading me astray here?)

3Richard Ngo1moI discuss the possibility of it going in some other direction when I say "The two most salient options to me". But the bit of Evan's post that this contradicts is:
MIRI/OP exchange about decision theory
My own answer would be the EDT answer: how much does your decision correlate with theirs? Modulated by ad-hoc updatelessness: how much does that correlation change if we forget "some" relevant information? (It usually increases a lot.)

I found this part particularly interesting and would love to see a fleshed-out example of this reasoning so I can understand it better.

The Codex Skeptic FAQ

That's reasonable. OTOH if Codex is as useful as some people say it is, it won't just be 10% of active users buying subscriptions and/or subscriptions might cost more than $15/mo, and/or people who aren't active on GitHub might also buy subscriptions.

4Adam Shimi2moAgreed. Part of the difficulty here is that you want to find who will buy a subscription and keep it. I expect a lot of people to try it, and most of them to drop it (either because they don't like it or because it doesn't help them enough for their taste) but no idea how to Fermi estimate that number.
The Codex Skeptic FAQ

For context, GitHub has 60,000,000 users. If 10% of them buy a $15/mo subscription, that's a billion dollars a year in annual revenue. A billion dollars is about a thousand times more than the cost to create Codex. (The cost to train the model was negligible since it's only the 12B param version of GPT-3 fine-tuned. The main cost would be the salaries of the engineers involved, I imagine.)

3Adam Shimi2moMaybe I'm wrong, but my first reaction to your initial number is that users doesn't mean active users. I would expect a difference of an order of magnitude, which keeps your conclusion but just with a hundred times more instead of a thousand times more.
The Codex Skeptic FAQ

I'm extremely keen to hear from people who have used Codex a decent amount (or tried to) and decided it isn't worth it. Specifically, people who wouldn't pay $15/mo for a subscription to it. Anyone?

3Daniel Kokotajlo2moFor context, GitHub has 60,000,000 users. If 10% of them buy a $15/mo subscription, that's a billion dollars a year in annual revenue. A billion dollars is about a thousand times more than the cost to create Codex. (The cost to train the model was negligible since it's only the 12B param version of GPT-3 fine-tuned. The main cost would be the salaries of the engineers involved, I imagine.)
Analogies and General Priors on Intelligence

Thanks for doing this! I feel like this project is going to turn into a sort of wiki-like thing, very useful for people trying to learn more about AI risk and situate themselves within it. I think AI Impacts had ambitions to do something like this at one point.

Approaches to gradient hacking

I am keenly interested in your next project and will be happy to chat with you if you think that would be helpful. If not, I look forward to seeing the results!

1Adam Shimi2moSure. I started studying Bostrom's paper today; I'll send you a message for a call when I read and thought enough to have something interesting to share and debate.
Approaches to gradient hacking

Simpler: Ask the LW team to make various posts (this one, any others that seem iffy) non-scrapable, or not-easily-scrapable. I think there's a button they can press for that.

Approaches to gradient hacking

Great, now GPT-4 and beyond have an instruction manual in their training data! :D

...haha kidding not kidding...

I guess the risk is pretty minimal, since probably any given AI will be either dumb enough to not understand this post or smart enough to not need to read it.

3Adam Shimi2moA "simple" solution I just thought about:just convince people training AGIs to not scrap the AF or Alignment literature.
3Adam Shimi2moI plan to think about that (and infohazards in general) in my next period of research, starting tomorrow. ;) My initial take is that this post is fine because every scheme proposed is really hard and I'm pointing the difficulty. Two clear risks though: * An AGI using that thinking to make these approaches work * The AGI not making mistakes in gradient hacking because it knows to not use these strategies (which assumes a better one exists) (Note that I also don't expect GPT-N to be deceptive, though it might serve to bootstrap a potentially deceptive model)
Automating Auditing: An ambitious concrete technical research proposal

I regret saying anything! But since I'm in this hole now, might as well try to dig my way out:

IDK, "any methodology that wasn't reference-class forecasting" wasn't how I interpreted the original texts, but *shrugs.* But at any rate what you are doing here seems importantly different than the experiments and stuff in the original texts; it's not like I can point to those experiments with the planning fallacy or the biased pundits and be like "See! Evan's inside-view reason is less trustworthy than the more general thoughts he lists below; Evan should downwe... (read more)

Persuasion Tools: AI takeover without AGI or agency?

Thanks! The post was successful then. Your point about stickiness is a good one; perhaps I was wrong to emphasize the change in number of ideologies.

The "AI takeover without AGI or agency" bit was a mistake in retrospect. I don't remember why I wrote it, but I think it was a reference to this post which argues that what we really care about is AI-PONR, and AI takeover is just a prominent special case. It also might have been due to the fact that a world in which an ideology uses AI tools to cement itself and take over the world, can be thought of as a case... (read more)

Automating Auditing: An ambitious concrete technical research proposal

Nitpick: Your use of the term "Inside view" here is consistent with how people have come to use the term, often, in our community -- but has little to do with the original meaning of the term, as set out in Kahneman, Tetlock, Mellers, etc. I suggest you just use the term "real reason." I think people will know what you mean, in fact I think it's less ambiguous / prone to misinterpretation. Idk. It interests me that this use of "inside view" slipped past my net earlier, and didn't make it into my taxonomy.

Obviously your words are your own, you are free to ... (read more)

4Evan Hubinger2moIs it inconsistent with the original meaning of the term? I thought that the original meaning of inside view was just any methodology that wasn't reference-class forecasting—and I don't use the term “outside view” at all. Also, I'm not saying that “inside view” means “real reason,” but that my real reason in this case is my inside view.
Automating Auditing: An ambitious concrete technical research proposal

Cool stuff! Possibly naive question: Aren't there going to be a lot of false positives? Judge says: "Complete text the way an average human would." Attacker does something clever having to do with not knowing what transformers are. Auditor looks at GPT-3 and goes "Aha! It doesn't understand that a paperclip is bigger than a staple! Gotcha!" ... or any of millions of things like that, false positives, ways in which the model doesn't quite meet specification but for reasons independent of the attacker's action.

8Evan Hubinger2moYeah, that's a great question—I should have talked more about this. I think there are three ways to handle this sort of problem—and ideally we should do some combination of all three: 1. Putting the onus on the attacker. Probably the simplest way to handle this problem is just to have the attacker produce larger specification breaks than anything that exists in the model independently of the attacker. If it's the case that whenever the attacker does something subtle like you're describing the auditor just finds some other random problem, then the attacker should interpret that as a sign that they can get away with less subtle attacks and try larger modifications instead. 2. Putting the onus on the judge. Another way to address this sort of problem is just to have a better, more rigorous specification. For example, the specification could include a baseline model that the auditor is supposed to compare to but which they can only sample from not view the internals of (that way modifying an individual neuron in the baseline model isn't trivial for the auditor to detect). 3. Putting the onus on the auditor. The last way to handle this sort of problem is just to force the auditor to do better. There is a right answer to what behavior exists in the model that's not natural for language models to usually learn when trained on general webtext, and a valid way to handle this problem is to just keep iterating on the auditor until they figure out how to find that right answer. For example: if all your attacks are fine-tuning attacks on some dataset, there is a correct answer about what sort of dataset was fine-tuned on, and you can just force the auditor to try to find that correct answer.
Extrapolating GPT-N performance

Now that Codex/Copilot is out, wanna do the same analysis using that new data? We have another nice set of lines on a graph to extrapolate:

rohinmshah's Shortform

Perhaps I shouldn't have mentioned any of this. I also don't think you are doing anything epistemically unvirtuous. I think we are just bouncing off each other for some reason, despite seemingly being in broad agreement about things. I regret wasting your time.

That seems correct. But plausibly the best way for these AIs to fight propaganda is to respond with truthful counterarguments.
I don't really see "number of facts" as the relevant thing for epistemology. In my anecdotal experience, people disagree on values and standards of evidence, not on facts. AIs
... (read more)
4Rohin Shah2mo"Truthful counterarguments" is probably not the best phrase; I meant something more like "epistemically virtuous counterarguments". Like, responding to "what if there are long-term harms from COVID vaccines" with "that's possible but not very likely, and it is much worse to get COVID, so getting the vaccine is overall safer" rather than "there is no evidence of long-term harms".
rohinmshah's Shortform

I was confusing, sorry -- what I meant was, technically my story involves assumptions like the ones you list in the bullet points, but the way you phrase them is... loaded? Designed to make them seem implausible? idk, something like that, in a way that made me wonder if you had a different story in mind. Going through them one by one:

  • People are very deliberately building persuasion / propaganda tech (as opposed to e.g. people like to loudly state opinions and the persuasive ones rise to the top)
    • This is already happening in 2021 and previous, in my story it
... (read more)
4Rohin Shah2moI don't think it's designed to make them seem implausible? Maybe the first one? Idk, I could say that your story is designed to make them seem plausible (e.g. by not explicitly mentioning them as assumptions). I think it's fair to say it's "loaded", in the sense that I am trying to push towards questioning those assumptions, but I don't think I'm doing anything epistemically unvirtuous. This does not seem obvious to me (but I also don't pay much attention to this sort of stuff so I could be missing evidence that makes it very obvious). That seems correct. But plausibly the best way for these AIs to fight propaganda is to respond with truthful counterarguments. I don't really see "number of facts" as the relevant thing for epistemology. In my anecdotal experience, people disagree on values and standards of evidence, not on facts. AIs that can respond to anti-vaxxers in their own language seem way, way more impactful than what we have now. (I just tried to find the best argument that GMOs aren't going to cause long-term harms, and found nothing. We do at least have several arguments that COVID vaccines won't cause long-term harms. I armchair-conclude that a thing has to get to the scale of COVID vaccine hesitancy before people bother trying to address the arguments from the other side.)
rohinmshah's Shortform
I think it seems fine to raise the possibility and do more research (and for all I know CSET or GovAI has done this research) but at least under my beliefs the current action should not be "raise awareness", it should be "figure out whether the assumptions are justified".

That's all I'm trying to do at this point, to be clear. Perhaps "raise awareness" was the wrong choice of phrase.

Re: the object-level points: For how I see this going, see my vignette, and my reply to steve. The bullet points you put here make it seem like you have a different story in m... (read more)

4Rohin Shah2moExcellent :) (Link is broken, but I found the comment.) After reading that reply I still feel like it involves the assumptions I mentioned above. Maybe your point is that your story involves "silos" of Internet-space within which particular ideologies / propaganda reign supreme. I don't really see that as changing my object-level points very much but perhaps I'm missing something.
rohinmshah's Shortform
I should note that there's a big difference between "recommender systems cause polarization as a side effect of optimizing for engagement" and "we might design tools that explicitly aim at persuasion / propaganda". I'm confident we could (eventually) do the latter if we tried to; the question is primarily whether we will try to and if we do what it's effects will be.

Oh, then maybe we don't actually disagree that much! I am not at all confident that optimizing for engagement has the side effect of increasing polarization. It seems plausible but it's also ... (read more)

4Rohin Shah2moI think that story involves lots of assumptions I don't immediately believe (but don't disbelieve either): * People are very deliberately building persuasion / propaganda tech (as opposed to e.g. people like to loudly state opinions and the persuasive ones rise to the top) * Such people will quickly realize that AI will be very useful for this * They will actually try to build it (as opposed to e.g. raising a moral outcry and trying to get it banned) * The resulting AI system will in fact be very good at persuasion / propaganda * AI that fights persuasion / propaganda either won't be built or will be ineffective (my unreliable armchair reasoning suggests the opposite; it seems to me like right now human fact-checking labor can't keep up with human controversy-creating labor partly because humans enjoy the latter more than the former; this won't be true with AI) And probably there are a bunch of other assumptions I haven't even thought to question. I think it seems fine to raise the possibility and do more research (and for all I know CSET or GovAI has done this research) but at least under my beliefs the current action should not be "raise awareness", it should be "figure out whether the assumptions are justified".
Seeking Power is Convergently Instrumental in a Broad Class of Environments

Nice!

In a couple places you say things that seem like they assume the probability distribution over utility functions is uniform. Is this right, or am I misinterpreting you? See e.g.

Choose any atom in the universe. Uniformly randomly select another atom in the universe. It's about 10^117 times more likely that these atoms are the same, than that a utility function incentivizes "dying" instead of flipping pixel 2 at t=1.
...
Optimal policies for u-AOH will tend to look like random twitching. For example, if you generate a u-AOH by uniformly r
... (read more)
2Alex Turner2moThe results are not centrally about the uniform distribution [https://www.lesswrong.com/posts/b6jJddSvWMdZHJHh3/environmental-structure-can-cause-instrumental-convergence] . The uniform distribution result is more of a consequence of the (central) orbit result / scaling law for instrumental convergence. I gesture at the uniform distribution to highlight the extreme strength of the statistical incentives.
6Rohin Shah2moTypically the theorems say things like "for every reward function there exists N other reward functions that <seek power, avoid dying, etc>"; the theorems aren't themselves talking about probability distributions. The first example you give does only make sense given a uniform distribution, I believe. In the second example, when Alex uses the phrase "tends to", it means something like "there are N times as many reward functions that <do the thing> as reward functions that <don't do the thing>". But yes, if you want to interpret this as a probabilistic statement, you would use a uniform distribution. EDIT: This next paragraph is wrong; I'm leaving it in for the record. I still think it is likely that similar results would hold in practice though the case is not as strong as I thought (and the effect is probably also not as strong as I thought). They would change quantitatively, but the upshot would probably be similar. For example, for the Kolmogorov prior, you could prove theorems like "for every reward function that <doesn't do the thing>, there are N reward functions that <do the thing> that each have at most a small constant more complexity" (since you can construct them by taking the original reward function and then apply the relevant permutation / move through the orbit, and that second step has constant K-complexity). Alex sketches out a similar argument in this post [https://www.lesswrong.com/posts/b6jJddSvWMdZHJHh3/environmental-structure-can-cause-instrumental-convergence] .
rohinmshah's Shortform

Thanks for this Rohin. I've been trying to raise awareness about the potential dangers persuasion/propaganda tools, but you are totally right that I haven't actually done anything close to a rigorous analysis. I agree with what you say here that a lot of the typical claims being thrown around seem based more on armchair reasoning than hard data. I'd love to see someone really lay out the arguments and analyze them... My current take is that (some of) the armchair theories seem pretty plausible to me, such that I'd believe them unless the data contradicts. But I'm extremely uncertain about this.

4Rohin Shah2moI should note that there's a big difference between "recommender systems cause polarization as a side effect of optimizing for engagement" and "we might design tools that explicitly aim at persuasion / propaganda". I'm confident we could (eventually) do the latter if we tried to; the question is primarily whether we will try to and if we do what it's effects will be. Usually, for any sufficiently complicated question (which automatically includes questions about the impact of technologies used by billions of people, since people are so diverse), I think an armchair theory is only slightly better [https://www.lesswrong.com/posts/TmHRACaxXrLbXb5tS/rohinmshah-s-shortform?commentId=xiwm2vXSpK8Kovgp8] than a monkey throwing darts, so I'm more in the position of "yup, sounds plausible, but that doesn't constrain my beliefs about what the data will show and medium quality data will trump the theory no matter how it comes out".
Research agenda update

Thanks for the update! This helps me figure out and keep up with what you are doing.

Re the 1st-person problem: I vaguely recall seeing some paper or experiment where they used interpretability tools to identify some neurons corresponding to a learned model's concept of something, and then they... did some surgery to replace it with a different concept? And it mostly worked? I don't remember. Anyhow, it seems relevant. Maybe it's not too hard to identify the self-concept.

3gwern2mohttps://arxiv.org/abs/2104.08696 [https://arxiv.org/abs/2104.08696] ?
What 2026 looks like (Daniel's Median Future)

Strong-upvoted because this was exactly the sort of thing I was hoping to inspire with this post! Also because I found many of your suggestions helpful.

I think model size (and therefore model ability) probably won't be scaled up as fast as you predict, but maybe. I think getting models to understand video will be easier than you say it is. I also think that in the short term all this AI stuff will probably create more programming jobs than it destroys. Again, I'm not confident in any of this.

What 2026 looks like (Daniel's Median Future)

I suggest putting a sentence in about the point of the post / the methodology, e.g.: "This is part I of an attempt to write a detailed plausible future trajectory in chronological order, i.e. incrementally adding years to the story rather than beginning with the end in mind. The hope is to produce a nice complement to the more abstract discussions about timelines and takeoff that usually occur." If space is a concern then I'd prefer having this rather than the two sentences you wrote, since it doesn't seem as important to mention that it's my median or that it's qualitative.

What 2026 looks like (Daniel's Median Future)

I edited to add some stuff about GWP and training compute for the most expensive model.

I agree that this focuses on qualitative stuff, but that's only due to lack of good ideas for quantitative metrics worth tracking. I agree GWP and training compute are worth tracking, thank you for reminding me, I've edited to be more explicit.

6Rohin Shah2moI am not entirely sure why I didn't think of the number of parameters as a high-level metric. Idk, maybe because it was weaved into the prose I didn't notice it? My bad. (To be clear, this wasn't meant to be a critique, just a statement of what kind of forecast it was. I think it's great to have forecasts of this form too.) New planned summary:
What 2026 looks like (Daniel's Median Future)

Thanks--damn, I intended for it to be more quantitative, maybe I should go edit it.

In particular, I should clarify that nothing interesting is happening with world GDP in this story, and also when I say things like "the models are trillions of parameters now" I mean that to imply things about the training compute for the most expensive model... I'll go edit.

Are there any other quantitative metrics you'd like me to track? I'd be more than happy to go add them in!

4Daniel Kokotajlo2moI edited to add some stuff about GWP and training compute for the most expensive model. I agree that this focuses on qualitative stuff, but that's only due to lack of good ideas for quantitative metrics worth tracking. I agree GWP and training compute are worth tracking, thank you for reminding me, I've edited to be more explicit.
What 2026 looks like (Daniel's Median Future)

Thanks for the critique!

Propaganda usually isn't false, at least not false in a nonpartisan-verifiable way. It's more about what facts you choose to emphasize and how you present them. So yeah, each ideology/faction will be training "anti-propaganda AIs" that will filter out the propaganda and the "propaganda" produced by other ideologies/factions.

In my vignette so far, nothing interesting has happened to GDP growth yet.

I think stereotypes can develop quickly. I'm not saying it's super widespread and culturally significant, just that it blunts the hype a ... (read more)

What 2026 looks like (Daniel's Median Future)

Acknowledgments: There are a LOT of people to credit here: Everyone who came to Vignettes Workshop, the people at AI Impacts, the people at Center on Long-Term Risk, a few random other people who I talked to about these ideas, a few random other people who read my gdoc draft at various stages of completion... I'll mention Jonathan Uesato, Rick Korzekwa, Nix Goldowsky-Dill, Carl Shulman, and Carlos Ramirez in particular, but there are probably other people who influenced my thinking even more who I'm forgetting. I'm sorry.

Footnotes:

  1. The first half was writte
... (read more)
[AN #159]: Building agents that know how to experiment, by training on procedurally generated games

Thanks! I'm not sure I understand your argument, but I think that's my fault rather than yours, since tbh I don't fully understand the Mingard et al paper itself, only its conclusion.

[AN #159]: Building agents that know how to experiment, by training on procedurally generated games

I wonder if grokking is evidence for, or against, the Mignard et al view that SGD on big neural nets is basically a faster approximation of rejection sampling. Here's an argument that it's evidence against:

--Either the "grokked algorithm circuit" is simpler, or not simpler, than the "memorization circuit."

--If it's simpler, then rejection sampling would reach the grokked algorithm circuit prior to reaching the memorization circuit, which is not what we see.

--If it's not simpler, then rejection sampling would briefly stumble across the grokked algorithm cir... (read more)

4Rohin Shah2moI feel like everyone is taking the SGD = rejection sampling view way too seriously. From the Mingard et al paper: The first order effect is what lets you conclude that when you ask GPT-3 a novel question like "how many bonks are in a quoit", that it has never been trained on, you can expect that it won't just start stringing characters together in a random way, but will probably respond with English words. The second order effects could be what tells you whether or not it is going to respond with "there are three bonks in a quoit" or "that's a nonsense question". (Or maybe not! Maybe random sampling has a specific strong posterior there, and SGD does too! But it seems hard to know one way or the other.) Most alignment-relevant properties seem like they are in this class. Grokking occurs in a weird special case where it seems there's ~one answer that generalizes well and has much higher prior, and everything else is orders of magnitude less likely. I don't really see why you should expect that results on MNIST should generalize to this situation.
How much compute was used to train DeepMind's generally capable agents?

I did, sorry -- I guesstimated FLOP/step and then figured parameters is probably a bit less than 1 OOM less than that. But since this is recurrent maybe it's even less? IDK. My guesstimate is shitty and I'd love to see someone do a better one!

How much compute was used to train DeepMind's generally capable agents?

Michael Dennis tells me that population-based training typically sees strong diminishing returns to population size, such that he doubts that there were more than one or two dozen agents in each population/generation. This is consistent with AlphaStar I believe, where the number of agents was something like that IIRC...

Anyhow, suppose 30 agents per generation. Then that's a cost of $5,000/mo x 1.3 months x 30 agents = $195,000 to train the fifth generation of agents. The previous two generations were probably quicker and cheaper. In total the price is prob... (read more)

4gwern3moMakes sense given the spinning-top [https://arxiv.org/abs/2004.09468] topology of games. These tasks are probably not complex enough to need a lot of distinct agents/populations to traverse the wide part to reach the top where you then need little diversity to converge on value-equivalent models. One observation: you can't run SC2 environments on a TPU, and when you can pack the environment and agents together onto a TPU and batch everything with no copying, you use the hardware closer to its full potential [https://www.gwern.net/notes/Faster#gwern-notes-sparsity], see the Podracer [https://arxiv.org/abs/2104.06272#deepmind] numbers.
How much compute was used to train DeepMind's generally capable agents?

Also for comparison, I think this means these models were about twice as big as AlphaStar. That's interesting.

How much compute was used to train DeepMind's generally capable agents?

I have a guesstimate for number of parameters, but not for overall compute or dollar cost:

Each agent was trained on 8 TPUv3's, which cost about $5,000/mo according to a quick google, and which seem to produce 90 TOPS, or about 10^14 operations per second. They say each agent does about 50,000 steps per second, so that means about 2 billion operations per step. Each little game they play lasts 900 steps if I recall correctly, which is about 2 minutes of subjective time they say (I imagine they extrapolated from what happens if you run the game at a speed su... (read more)

3Jaime Sevilla3moDo you mind sharing your guesstimate on number of parameters? Also, do you have per chance guesstimates on number of parameters / compute of other systems?
2Daniel Kokotajlo3moMichael Dennis tells me that population-based training typically sees strong diminishing returns to population size, such that he doubts that there were more than one or two dozen agents in each population/generation. This is consistent with AlphaStar I believe, where the number of agents was something like that IIRC... Anyhow, suppose 30 agents per generation. Then that's a cost of $5,000/mo x 1.3 months x 30 agents = $195,000 to train the fifth generation of agents. The previous two generations were probably quicker and cheaper. In total the price is probably, therefore, something like half a million dollars of compute? This seems surprisingly low to me. About one order of magnitude less than I expected. What's going on? Maybe it really was that cheap. If so, why? Has the price dropped since AlphaStar? Probably... It's also possible this just used less compute than AlphaStar did...
2Daniel Kokotajlo3moAlso for comparison, I think this means these models were about twice as big as AlphaStar. That's interesting.
Did they or didn't they learn tool use?

Also, what does the "1G 38G 152G" mean in the image? I can't tell. I would have thought it means number of games trained on, or something, except that at the top it says 0-Shot.

DeepMind: Generally capable agents emerge from open-ended play

Thanks! This is exactly the sort of thoughtful commentary I was hoping to get when I made this linkpost.

--I don't see what the big deal is about laws of physics. Humans and all their ancestors evolved in a world with the same laws of physics; we didn't have to generalize to different worlds with different laws. Also, I don't think "be superhuman at figuring out the true laws of physics" is on the shortest path to AIs being dangerous. Also, I don't think AIs need to control robots or whatnot in the real world to be dangerous, so they don't even need to be ... (read more)

I don't see what the big deal is about laws of physics. Humans and all their ancestors evolved in a world with the same laws of physics; we didn't have to generalize to different worlds with different laws. Also, I don't think "be superhuman at figuring out the true laws of physics" is on the shortest path to AIs being dangerous. Also, I don't think AIs need to control robots or whatnot in the real world to be dangerous, so they don't even need to be able to understand the true laws of physics, even on a basic level.

The entire novelty of this work revol... (read more)

[AN #156]: The scaling hypothesis: a plan for building AGI

OK cool, sorry for the confusion. Yeah I think ESRogs interpretation of you was making a bit stronger claim than you actually were.

[AN #156]: The scaling hypothesis: a plan for building AGI

Huh, then it seems I misunderstood you then! The fourth bullet point claims that GPT-N will go on filling in missing words rather than doing a treacherous turn. But this seems unsupported by the argument you made, and in fact the opposite seems supported by the argument you made. The argument you made was:

There are several pretraining objectives that could have been used to train GPT-N other than next word prediction (e.g. masked language modeling). For each of these, there's a corresponding model that the resulting GPT-N would "try" to <do the thing i
... (read more)
4Rohin Shah3mo?? I said nothing about a treacherous turn? And where did I say it would go on filling in missing words? EDIT: Ah, you mean the fourth bullet point in ESRogs response. I was thinking of that as one example of how such reasoning could go wrong, as opposed to the only case. So in that case the model_1 predicts a treacherous turn confidently, but this is the wrong epistemic state to be in because it is also plausible that it just "fills in words" instead. Isn't that effectively what I said? (I was trying to be more precise since "achieve its training objective" is ambiguous, but given what I understand you to mean by that phrase, I think it's what I said?) This seems reasonable to me (and seems compatible with what I said)
Open Problems with Myopia

Also: I think making sure our agents are DDT is probably going to be approximately as difficult as making them aligned. Related: Your handle for anthropic uncertainty is:

never reason about anthropic uncertainty. DDT agents always think they know who they are.

"Always think they know who they are" doesn't cut it; you can think you know you're in a simulation. I think a more accurate version would be something like "Always think that you are on an original planet, i.e. one in which life appeared 'naturally,' rather than a planet in the midst of some larger in... (read more)

Load More