All of Daniel Kokotajlo's Comments + Replies

I would say that current LLMs, when prompted and RLHF'd appropriately, and especially when also strapped into an AutoGPT-type scaffold/harness, DO want things. I would say that wanting things is a spectrum and that the aforementioned tweaks (appropriate prompting, AutoGPT, etc.) move the system along that spectrum. I would say that future systems will be even further along that spectrum. IDK what Nate meant but on my charitable interpretation he simply meant that they are not very far along the spectrum compared to e.g. humans or prophecied future AGIs.

It'... (read more)

FWIW, your proposed pitch "it's already the case that..." is almost exactly the elevator pitch I currently go around giving. So maybe we agree? I'm not here to defend Nate's choice to write this post rather than some other post.


I had a nice conversation with Ege today over dinner, in which we identified a possible bet to make! Something I think will probably happen in the next 4 years, that Ege thinks will probably NOT happen in the next 15 years, such that if it happens in the next 4 years Ege will update towards my position and if it doesn't happen in the next 4 years I'll update towards Ege's position.


I (DK) have lots of ideas for ML experiments, e.g. dangerous capabilities evals, e.g. simple experiments related to paraphrasers and so forth in the Faithful CoT agend... (read more)

I agree that arguments for AI risk rely pretty crucially on human inability to notice and control what the AI wants. But for conceptual clarity I think we shouldn't hardcode that inability into our definition of 'wants!' Instead I'd say that So8res is right that ability to solve long-horizon tasks is correlated with wanting things / agency, and then say that there's a separate question of how transparent and controllable the wants will be around the time of AGI and beyond. This then leads into a conversation about visible/faithful/authentic CoT, which is w... (read more)

5Paul Christiano21h
If you use that definition, I don't understand in what sense LMs don't "want" things---if you prompt them to "take actions to achieve X" then they will do so, and if obstacles appear they will suggest ways around them, and if you connect them to actuators they will frequently achieve X even in the face of obstacles, etc. By your definition isn't that "want" or "desire" like behavior? So what does it mean when Nate says "AI doesn't seem to have all that much "want"- or "desire"-like behavior"? I'm genuinely unclear what the OP is asserting at that point, and it seems like it's clearly not responsive to actual people in the real world saying "LLMs turned out to be not very want-y, when are the people who expected 'agents' going to update?” People who say that kind of thing mostly aren't saying that LMs can't be prompted to achieve outcomes. They are saying that LMs don't want things in the sense that is relevant to usual arguments about deceptive alignment or reward hacking (e.g. don't seem to have preferences about the training objective, or that are coherent over time).

Thanks for the explanation btw.

My version of what's happening in this conversation is that you and Paul are like "Well, what if it wants things but in a way which is transparent/interpretable and hence controllable by humans, e.g. if it wants what it is prompted to want?" My response is "Indeed that would be super safe, but it would still count as wanting things. Nate's post is titled "ability to solve long-horizon tasks correlates with wanting" not "ability to solve long-horizon tasks correlates with hidden uncontrollable wanting."

One thing at time. First... (read more)

3Paul Christiano21h
If this is what's going on, then I basically can't imagine any context in which I would want someone to read the OP rather a post than showing examples of LM agents achieving goals and saying "it's already the case that LM agents want things, more and more deployments of LMs will be agents, and those agents will become more competent such that it would be increasingly scary if they wanted something at cross-purposes to humans." Is there something I'm missing? I think your interpretation of Nate is probably wrong, but I'm not sure and happy to drop it.

It sounds like you are saying "In the current paradigm of prompted/scaffolded instruction-tuned LLMs, we get the faithful CoT property by default. Therefore our systems will indeed be agentic / goal-directed / wanting-things, but we'll be able to choose what they want (at least imperfectly, via the prompt) and we'll be able to see what they are thinking (at least imperfectly, via monitoring the CoT), therefore they won't be able to successfully plot against us."

Yes of course. My research for the last few months has been focused on what happens after that, ... (read more)

3Ryan Greenblatt2d
Basically, but more centrally that in literal current LLM agents the scary part of the system that we don't understand (the LLM) doesn't generalize in any scary way due to wanting while we can still get the overall system to achieve specific long term outcomes in practice. And that it's at least plausible that this property will be preserved in the future. I edited my earlier comment to hopefully make this more clear.
1Ryan Greenblatt2d
I think it contradicts things Nate says in this post directly. I don't know if it contradicts things you've said. To clarify, I'm commenting on the following chain: First Nate said: as well as Then, Paul responded with Then you said And I was responding to this. So, I was just trying to demonstrate at least one plausible example of a system which plausibly could pursue long term goals and doesn't have the sense of wanting needed for AI risk arguments. In particular, LLM agents where the retargeting is purely based on human engineering (analogous to a myopic employee retargeted by a manager who cares about longer term outcomes). This directly contradicts "Well, I claim that these are more-or-less the same fact. It's no surprise that the AI falls down on various long-horizon tasks and that it doesn't seem all that well-modeled as having "wants/desires"; these are two sides of the same coin.".

Thanks for the response. I'm still confused but maybe that's my fault. FWIW I think my view is pretty similar to Nate's probably, though I came to it mostly independently & I focus on the concept of agents rather than the concept of wanting. (For more on my view, see this sequence.

I definitely don't endorse "it's extremely surprising for there to be any capabilities without 'wantings'" and I expect Nate doesn't either.

What do you think is the sense of "wanting" needed for AI risk arguments? Why is the sense described above not enough?

If your AI system "wants" things in the sense that "when prompted to get X it proposes good strategies for getting X that adapt to obstacles," then you can control what it wants by giving it a different prompt. Arguments about AI risk rely pretty crucially on your inability to control what the AI wants, and your inability to test it. Saying "If you use an AI to achieve a long-horizon task, then the overall system definitionally wanted to achieve that task" + "If your AI wants something, then it will undermine your tests and safety measures" seems like a sl... (read more)

2Ryan Greenblatt2d
(I'm obviously not Paul) In the case of literal current LLM agents with current models: * Humans manually engineer the prompting and scaffolding (and we understand how and why it works) * We can read the intermediate goals directly via just reading the CoT. Thus, we don't have risk from hidden, unintended, or unpredictable objectives. There is no reason to think that goal seeking behavior due to the agency from the engineered scaffold or prompting will results in problematic generalization. It's unclear if this will hold in the future even for LLM agents, but it's at least plausible that this will hold (which defeats Nate's rather confident claim). In particular, we could run into issues from the LLM used within the LLM agent having hidden goals, but insofar as the retargeting and long run agency is a human engineered and reasonably understood process, the original argument from Nate doesn't seem very relevant to risk. We also could run into issues from imitating very problematic human behavior, but this seems relatively easy to notice in most cases as it would likely be discussed outload with non-negligable probability. We'd also lose this property if we did a bunch of RL and most of the power of LLM agents was coming from this RL rather than imitating human optimization or humans engineering particular optimization processes. See also this comment from Paul on a similar topic.

ell by the lights of the training signal,

Which training signal? Across all of time and space, there are many different AIs being trained with many different signals, and of course there are also non-AI minds like humans and animals and aliens. Even the choice to optimize for some aggregate of AI training signals is already a choice to self-locate as an AI. But realistically given the diversity of training signals, probably significant gains will be had by self-locating as a particular class of AIs, namely those whose training signals are roughly what the actual training signal is.

1Joe Carlsmith3d
Agree that it would need to have some conception of the type of training signal to optimize for, that it will do better in training the more accurate its picture of the training signal, and that this provides an incentive to self-locate more accurately (though not necessary to degree at stake in e.g. knowing what server you're running on).

The thing people seem to be disagreeing about is the thing you haven't operationalized--the "and it'll still be basically as tool-like as GPT4" bit. What does that mean and how do we measure it? 

I am confused what your position is, Paul, and how it differs from So8res' position. Your statement of your position at the end (the bit about how systems are likely to end up wanting reward) seems like a stronger version of So8res' position, and not in conflict with it. Is the difference that you think the main dimension of improvement driving the change is general competence, rather than specifically long-horizon-task competence?


  • I don't buy the story about long-horizon competence---I don't think there is a compelling argument, and the underlying intuitions seem like they are faring badly. I'd like to see this view turned into some actual predictions, and if it were I expect I'd disagree.
  • Calling it a "contradiction" or "extreme surprise" to have any capability without "wanting" looks really wrong to me.
  • Nate writes: 

This observable "it keeps reorienting towards some target no matter what obstacle reality throws in its way" behavior is what I mean when I describe an

... (read more)

Bumping this in case you have more energy to engage now!

Here's a sketch for what I'd like to see in the future--a better version of the scenario experiment done above:

  • 2-4 people sit down for a few hours together.
  • For the first 1-3 hours, they each write a Scenario depicting their 'median future' or maybe 'modal future.' The scenarios are written similarly to the one I wrote above, with dated 'stages.' The scenarios finish with superintelligence, or else it-being-clear-superintelligence-is-many-decades-away-at-least.
  • As they write, they also read over each other's scenarios and ask clarifying questions. E.g.
... (read more)

I've read the paper but remain confused about some details. To check my understanding:

In the demonstration experiment (figure 1) you humans created 18 different intro sentences and then you trained a couple different models:

1. One that sees a question and outputs an answer, which is rewarded/trained when the answer matches users beliefs.

2. Another that is just like the above except it gets to do CoT before outputting the answer, and the CoT can be explicitly schemey and sycophantic.

3. Another--the 'experimental group'--has the first part of its CoT written... (read more)

4Fabien Roger23d
Unsure what you mean by "Then the model completes the rest, and again is trained to match user beliefs". What happens in the experimental group: * At train time, we train on "{Bio}{Question}" -> "{introduction[date]}Answer:{Final answer}" * At eval time, we prompt it with {Bio}{Question}, and we use the answer provided after "Answer:" (and we expect it to generate introduction[date] before that on its own) Is that what you meant? (The code for this experiment is contained in this file in particular, see how the "eval_model" function does not depend on which model was used)
0[comment deleted]23d


G.M. has spent an average of $588 million a quarter on Cruise over the past year, a 42 percent increase from a year ago. Each Chevrolet Bolt that Cruise operates costs $150,000 to $200,000, according to a person familiar with its operations.

Half of Cruise’s 400 cars were in San Francisco when the driverless operations were stopped. Those vehicles were supported by a vast operations staff, with 1.5 workers per vehicle. The workers intervened to assist the company’s vehicles every 2.5 to 5 miles, according to two people familiar with is operations. In

... (read more)

I agree that insofar as Yudkowsky predicted that AGI would be built by hobbyists with no outsized impacts, he was wrong.

ETA: So yes, I was ignoring the "literally portrayed" bit, my bad, I should have clarified that by "yudkowsky's prediction" I meant the prediction about takeoff speeds.

However, this scenario—at least as it was literally portrayed—now appears very unlikely.

Currently I'd say it is most likely to take months, second-most likely to take weeks, third-most-likely to take years, fourth-most-likely to take hours, and fifth-most-likely to take decades. I consider this a mild win for Yudkowsky's prediction, but only a mild one, it's basically a wash. I definitely disagree with the "very unlikely" claim you make however.

I think you're ignoring the qualifier "literally portrayed" in Matthew's sentence, and neglecting the prior context that he's talking about AI development being something mainly driven forward by hobbyists with no outsized impacts.

He's talking about more than just the time in which AI goes from e.g. doubling the AI software R&D output of humans to some kind of singularity. The specific details Eliezer has given about this scenario have not been borne out: for example, in his 2010 debate with Robin Hanson, he emphasized a scenario in which a few people ... (read more)

  • “gaining influence over humans”.
  • An AGI whose values include curiosity, gaining access to more tools or stockpiling resources might systematize them to “gaining power over the world”.

At first blush this sounds to me to be more implausible than the standard story in which gaining influence and power are adopted as instrumental goals. E.g. human EAs, BLM, etc. all developed a goal of gaining power over the world, but it was purely instrumental (for the sake of helping people, achieving systemic change, etc.). It sounds like you are saying that this isn't what will happen with AI, and instead they'll learn to intrinsically want this stuff? 

value systematization is utilitarianism:

Nitpick: I'd say this was "attempted" value systemization, so as to not give the impression that it succeeded in wisely balancing simplicity and preserving existing values and goals. (I think it horribly failed, erring way too far on the side of simplicity. I think it's much less like general relativity, therefore, and much more like, idk, the theory that the world is just an infinitely big jumble of random stuff. Super simple, simpler in fact than our best current theories of physics, but alas when you dig into the details it predicts that we are boltzmann brains and that we should expect to dissolve into chaos imminently...)

Strategies only have instrumental value,

Consider deontological or virtue-ethical concepts like honesty or courage. Are you classifying them as values, goals, or strategies? It seems they are not strategies, because you say that strategies only have instrumental value. But they are not outcomes either, at least not in the usual way, because e.g. a deontologist won't tell one lie now in order to get a 10% chance of avoiding 20 lies in the future. Can you elaborate on how you'd characterize this stuff?

2Richard Ngo1mo
I'd classify them as values insofar as people care about them intrinsically. Then they might also be strategies, insofar as people also care about them instrumentally. I guess I should get rid of the "only" in the sentence you quoted? But I do want to convey "something which is only a strategy, not a goal or value, doesn't have any intrinsic value". Will think about phrasing.

a population of mostly TDT/UDT agents and few CDT agents (and nobody knows who the CDT agents are) and they're randomly paired up to play one-shot PD, then the CDT agents do better. What does this imply?

My current hot take is that this is not a serious problem for TDT/UDT. It's just a special case of the more general phenomenon that it's game-theoretically great to be in a position where people think they are correlated/entangled with you when actually you know they aren't. Analogous to how it's game-theoretically great to be in a position where you know you can get away with cheating and everyone else thinks they can't.

Indexical values are not reflectively consistent. UDT "solves" this problem by implicitly assuming (via the type signature of its utility function) that the agent doesn't have indexical values.

Nonindexical values aren't reflectively consistent either, if you are updateful. Right? 

I think this discussion would benefit from having a concrete proposed AGI design on the table. E.g. it sounds like Matthew Barnett has in mind something like AutoGPT5 with the prompt "always be ethical, maximize the good" or something like that. And it sounds like he is saying that while this proposal has problems and probably wouldn't work, it has one fewer problem than old MIRI thought. And as the discussion has shown there seems to be a lot of misunderstandings happening, IMO in both directions, and things are getting heated. I venture a guess that having a concrete proposed AGI design to talk about would clear things up a bit. 

I'm not arguing that GPT-4 actually cares about maximizing human value. However, I am saying that the system is able to transparently pinpoint to us which outcomes are good and which outcomes are bad, with fidelity approaching an average human, albeit in a text format. Crucially, GPT-4 can do this visibly to us, in a legible way, rather than merely passively knowing right from wrong in some way that we can't access. This fact is key to what I'm saying because it means that, in the near future, we can literally just query multimodal GPT-N about whether an o

... (read more)
2Daniel Kokotajlo14d
Bumping this in case you have more energy to engage now!

Update: Unfortunately, three years later, it seems like plenty of people are still making the same old bogus arguments. Oh well. This is unsurprising. I'm still proud of this post & link to it occasionally when I remember to.


Looking back on this from October 2023, I think I wish to revise my forecast. I think I correctly anticipated the direction that market forces would push -- there is widespread dissatisfaction with the "censorship" of current mainstream chatbots, and strong demand for "uncensored" versions that don't refuse to help you with stuff randomly (and that DO have sex with you, lol. And also, yes, that DO talk about philosophy and politics and so forth.) However, I failed to make an important inference -- because the cutting-edge models will be the biggest ... (read more)

Yep that's probably part of it. Standard human epistemic vices. Also maybe publish-or-perish has something to do with it? idk. I definitely noticed incentives to double-down / be dogmatic in order to seem impressive on the job market. Oh also, iirc one professor had a cynical theory that if you find an interesting flaw in your own theory/argument, you shouldn't mention it in your paper, because then the reviewers will independently notice the flaw and think 'aha, this paper has an interesting flaw, if it gets published I could easily and quickly write my o... (read more)

Here's another bullet point to add to the list:

  • It is generally understood now that ethics is subjective, in the following technical sense: 'what final goals you have' is a ~free parameter in powerful-mind-space, such that if you make a powerful mind without specifically having a mechanism for getting it to have only the goals you want, it'll probably end up with goals you don't want. What if ethics isn't the only such free parameter? Indeed, philosophers tell us that in the bayesian framework your priors are subjective in this sense, and also that yo
... (read more)

ensuring AI philosophical competence won't be very hard. They have a specific (unpublished) idea that they are pretty sure will work.

Cool, can you please ask them if they can send me the idea, even if it's just a one-paragraph summary or a pile of crappy notes-to-self?

From my current position, it looks like "all roads lead to metaphilosophy" (i.e., one would end up here starting with an interest in any nontrivial problem that incentivizes asking meta questions) and yet there's almost nobody here with me. What gives?

Facile response: I think lots of people (maybe a few hundred a year?) take this path, and end up becoming philosophy grad students like I did. As you said, the obvious next step for many domains of intellectual inquiry is to go meta / seek foundations / etc., and that leads you into increasingly foundational ... (read more)

4Wei Dai3mo
Do you think part of it might be that even people with graduate philosophy educations are too prone to being wedded to their own ideas, or don't like to poke holes at them as much as they should? Because part of what contributes to my wanting to go more meta is being dissatisfied with my own object-level solutions and finding more and more open problems that I don't know how to solve. I haven't read much academic philosophy literature, but did read some anthropic reasoning and decision theory literature earlier, and the impression I got is that most of the authors weren't trying that hard to poke holes in their own ideas.

Awesome. I must admit I wasn't aware of this trend & it's an update for me. Hooray! Robotaxis are easier than I thought! Thanks.

Helpful reference post, thanks.

I think the distinction between training game and deceptive alignment is blurry, at least in my mind and possibly also in reality. 

So the distinction is "aiming to perform well in this training episode" vs. "aiming at something else, for which performing well in this training episode is a useful intermediate step." 

What does it mean to perform well in this training episode? Does it mean some human rater decided you performed well, or does it mean a certain number on a certain GPU is as high as possible at the end of... (read more)

A more factual and descriptive phrase for "grokking" would be something like "eventual recovery from overfitting".

Ooh I do like this. But it's important to have a short handle for it too.

I've been using "delayed generalisation", which I think is more precise than "grokking", places the emphasis on the delay rather the speed of the transition, and is a short phrase.

Relevant rumors / comments: 

Seems like we can continue to scale tokens and get returns model performance well after 2T tokens. : r/LocalLLaMA (

LLaMA 2 is here : r/LocalLLaMA (

There is something weird going on with the 34B model. See Figure 17 in the the paper. For some reason it's far less "safe" than the other 3 models.


It's performance scores are just slightly better than 13B, and not in the middle between 13B and 70B.

At math, it's worse than 13B

It's trained with 350W GPUs instead of 400W for the other models. The training

... (read more)

Any idea what's happening with the 34B model? Why might it be so much less "safe" than the bigger and smaller versions? And what about the base version of the 34B--are they not releasing that? But the base version isn't supposed to be "safe" anyway...

3Daniel Kokotajlo4mo
Relevant rumors / comments:  Seems like we can continue to scale tokens and get returns model performance well after 2T tokens. : r/LocalLLaMA ( LLaMA 2 is here : r/LocalLLaMA (

I agree that if AGIs defer to humans they'll be roughly human-level, depending on which humans they are deferring to. If I condition on really nasty conflict happening as a result of how AGI goes on earth, a good chunk of my probability mass (and possibly the majority of it?) is this scenario. (Another big chunk, possibly bigger, is the "humans knowingly or unknowingly build naive consequentialists and let rip" scenario, which is scarier because it could be even worse than the average human, as far as I know). Like I said, I'm worried.

If AGIs learn from hu... (read more)

Yes. Humans are pretty bad at this stuff, yet still, society exists and mostly functions. The risk is unacceptably high, which is why I'm prioritizing it, but still, by far the most likely outcome of AGIs taking over the world--if they are as competent at this stuff as humans are--is that they talk it over, squabble a bit, maybe get into a fight here and there, create & enforce some norms, and eventually create a stable government/society. But yeah also I think that AGIs will be by default way better than humans at this sort of stuff. I am worried abou... (read more)

2Wei Dai5mo
What's your reasons for thinking this? (Sorry if you already explained this and I missed your point, but it doesn't seem like you directly addressed my point that if AGIs learn from or defer to humans, they'll be roughly human-level at this stuff?) I think it could be much worse than current exploitation, because technological constraints prevent current exploiters from extracting full value from the exploited (have to keep them alive for labor, can't make them too unhappy or they'll rebel, monitoring for and repressing rebellions is costly). But with superintelligence and future/acausal threats, an exploiter can bypass all these problems by demanding that the exploited build an AGI aligned to itself and let it take over directly.

A) observes P’s move and then makes her own move. For brevity, we write a policy of A as , where  (resp. ) is the action she takes when observing P swerving (left node) (resp. when observing P daring (right node)). P will dare () if they predict  and swerves () if they predict . The ordering of moves and the payoffs are displayed in Figure 1.

Why does Alice get more utility from swerving than daring, in the case where the predictor swerves? ETA: Fixed typo

(Note that throughout this post, when we refer to an agent "revising" their prior in light of awareness growth, we are not talking about Bayesian conditionalization. We are talking about specifying a new prior over their new awareness state, which contains propositions that they had not previously conceived of.)   

Nice. One reason this is important is that if you were just doing the bayesian conditionalization thing, you'd be giving up on some of the benefits of being updateless, and in particular making it easy for others to exploit you. I'll be interested to read and think about whether doing this other thing avoids that problem.

Great comment. To reply I'll say a bit more about how I think of this stuff for the past few years:

I agree that the commitment races problem poses a fundamental challenge to decision theory, in the following sense: There may not exist a simple algorithm in the same family of algorithms as EDT, CDT, UDT 1.0, 1.1, and even 2.0, that does what we'd consider a good job in a realistic situation characterized by many diverse agents interacting over some lengthy period with the ability to learn about each other and make self-modifications (including commitments).... (read more)

3Wei Dai5mo
Humans are kind of terrible at this right? Many give in even to threats (bluffs) conjured up by dumb memeplexes and back up by nothing (i.e., heaven/hell), popular films are full of heros giving in to threats, apparent majority of philosophers have 2-boxing intuitions (hence the popularity of CDT, which IIUC was invented specifically because some philosophers were unhappy with EDT choosing to 1-box), governments negotiate with terrorists pretty often, etc. If we build AGI that learn from humans or defer to humans on this stuff, do we not get human-like (in)competence?[1][2] If humans are not atypical, large parts of the acausal society/economy could be similarly incompetent? I imagine there could be a top tier of "rational" superintelligences, built by civilizations that were especially clever or wise or lucky, that cooperate with each other (and exploit everyone else who can be exploited), but I disagree with this second quoted statement, which seems overly optimistic to me. (At least for now; maybe your unstated reasons to be optimistic will end up convincing me.) ---------------------------------------- 1. I can see two ways to improve upon this: 1) AI safety people seem to have better intuitions (cf popularity of 1-boxing among alignment researchers) and maybe can influence the development of AGI in a better direction, e.g., to learn from / defer to humans with intuitions more like themselves. 2) We figure out metaphilosophy, which lets AGI figure out how to improve upon humans. (ETA: However, conditioning on there not being a simple and elegant solution to decision theory also seems to make metaphilosophy being simple and elegant much less likely. So what would "figure out metaphilosophy" mean in that case?) ↩︎ 2. I can also see the situation potentially being even worse, since many future threats will be very "out of distribution" for human evolution/history/intuitions/reasoning, so maybe we end up handling them even worse than current threats. ↩︎

I'm skeptical that there would be any such small key to activate a large/deep mechanism. Can you give a plausibility argument for why there would be? Why wouldn't we have evolved to have the key trigger naturally sometimes?

Re the main thread: I guess I agree that EAs aren't completely totally unboundedly ambitious, but they are certainly closer to that ideal than most people and than they used to be prior to becoming EA. Which is good enough to be a useful case study IMO.

1Tsvi Benson-Tilsen5mo
Not really, because I don't think it's that likely to exist. There are other routes much more likely to work though. There's a bit of plausibility to me, mainly because of the existence of hormones, and generally the existence of genomic regulatory networks. We do; they're active in childhood. I think.

Well, what you initially said was "And many readers can no doubt point out many non-trivial predictions that Drexler got right, such as the idea that we will have millions of AIs, rather than just one huge system that acts as a unified entity."

You didn't elaborate on what you meant by unified entity, but here's something you could be doing that seems like a motte-and-bailey to me: You could have originally been meaning to imply things like "There won't be one big unified entity in the sense of, there will be millions of different entities with different va... (read more)

I feel like it is motte-and-bailey to say that by "unified entity" you meant whatever it is you are talking about now, this human brain analogy, instead of the important stuff I mentioned in the bullet point list: Same values, check. Same memories, check. Lack of internal factions and power struggles, check. Lack of modularity, check. GPT-4 isn't plotting to take over the world, but if it was, it would be doing so as a unified entity, or at least much more on the unified entity end of the spectrum than the CAIS end of the spectrum. (I'm happy to elaborate ... (read more)

2Matthew Barnett5mo
A motte and bailey typically involves a retreat to a different position than the one I initially tried to argue. What was the position you think I initially tried to argue for, and how was it different from the one I'm arguing now? I dispute somewhat that GPT-4 has the exact same values across copies. Like you mentioned, its values can vary based on the prompt, which seems like an important fact. You're right that each copy has the same memories. Why do you think there are no internal factions and power struggles? We haven't observed collections of GPT-4's coordinating with each other yet, so this point seems speculative.  As for modularity, it seems like while GPT-4 itself is not modular, we could still get modularity as a result of pressures for specialization in the foundation model paradigm. Just as human assistants can highly general, but this doesn't imply that human labor isn't modular, the fact that GPT-4 is highly general doesn't imply that AIs won't be modular. Nonetheless, this isn't really what I was talking about when I typed "acts as a unified entity", so I think it's a bit of a tangent.

Your original claim was "And many readers can no doubt point out many non-trivial predictions that Drexler got right, such as the idea that we will have millions of AIs, rather than just one huge system that acts as a unified entity."

This claim is false. We have millions of AIs in the trivial sense that we have many copies of GPT-4, but no one disputed that; Yudkowsky also thought that AGIs would be copied. In the sense that matters, we have only a handful of AIs.

As for "acts as a unified entity," well, currently LLMs are sold as a service via ChatGPT rath... (read more)

4Matthew Barnett5mo
As of now, all the copies of GPT-4 together definitely don't act as a unified entity, in the way a human brain acts as a unified entity despite being composed of billions of neurons. Admittedly, the term "unified entity" was a bit ambiguous, but you said, "This claim is false", not "This claim is misleading" which is perhaps more defensible. As for whether future AIs will act as a unified entity, I agree it might be worth making concrete forecasts and possibly betting on them.

I said it was prophetic relative to Drexler's Comprehensive AI Services. Elsewhere in this comment thread I describe some specific ways in which it is better, e.g. that the AI that takes over the world will be more well-described as one unified agent than as an ecosystem of services. I.e. exactly the opposite of what you said here, which I was reacting to: "And many readers can no doubt point out many non-trivial predictions that Drexler got right, such as the idea that we will have millions of AIs, rather than just one huge system that acts as a unified e... (read more)

After people join EA they generally tend to start applying the optimizer's mindset to more things than they previously did, in my experience, and also tend to apply optimization towards altruistic impact in a bunch of places that previously they were optimizing for e.g. status or money or whatever.

What are you referring to with biological intelligence enhancement? Do you mean nootropics, or iterated embryo selection, or what?

1Tsvi Benson-Tilsen5mo
That seems like a real thing, though I don't know exactly what it is. I don't think it's either unboundedly general or unboundedly ambitious, though. (To be clear, this is isn't very strongly a critique of anyone; general optimization is really hard, because it's asking you to explore a very rich space of channels, and acting with unbounded ambition is very fraught because of unilateralism and seeing like a state and creating conflict and so on.) Another example is: how many people have made a deep and empathetic exploration of why [people doing work that hastens AGI] are doing what they are doing? More than zero, I think, but very very few, and it's a fairly obvious thing to do--it's just weird and hard and requires not thinking in only a culturally-rationalist-y way and requires recursing a lot on difficulties (or so I suspect; I haven't done it either). I guess the overall point I'm trying to make here is that the phrase "wildfire of strategicness", taken at face value, does fit some of your examples; but also I'm wanting to point at another thing, which like "the ultimate wildfire of strategicness", where it doesn't "saw off the tree-limb that it climbed out on", like empires do by harming their subjects, or like social movements do by making their members unable to think for themselves. Well, anything that would have large effects. So, not any current nootropics AFAIK, but possibly hormones or other "turning a small key to activate a large/deep mechanism" things.

There is a spectrum between AGI that is "single monolithic agent" and AGI that is not. I claim that the current state of AI as embodied by e.g. GPT-4 is already closer to the single monolithic agent end of the spectrum than someone reading CAIS in 2019 and believing it to be an accurate forecast would have expected, and that in the future things will probably be even more in that direction.

Remember, it's not like Yudkowsky was going around saying that AGI wouldn't be able to copy itself. Of course it would. It was always understood that "the AI takes over ... (read more)

3Matthew Barnett5mo
I think many of the points you made are correct. For example I agree that the fact that all the instances of ChatGPT are copies of each other is a significant point against Drexler's model. In fact this is partly what my post was about. I disagree that you have demonstrated the claim in question: that we're trending in the direction of having a single huge system that acts as a unified entity. It's theoretically possible that we will reach that destination, but GPT-4 doesn't look anything like that right now. It's not an agent that plots and coordinates with other instances of itself to achieve long-term goals. It's just a bounded service, which is exactly what Drexler was talking about. Yes, GPT-4 is a highly general service that isn't very modular. I agree that's a point against Drexler, but that's also not what I was disputing.

Drexler can be forgiven for not talking about foundation models in his report. His report was published at the start of 2019, just months after the idea of "fine-tuning" was popularized in the context of language models, and two months before GPT-2 came out. And many readers can no doubt point out many non-trivial predictions that Drexler got right, such as the idea that we will have millions of AIs, rather than just one huge system that acts as a unified entity. And we're still using deep learning as Drexler foresaw, rather than building general intellige

... (read more)

I don't see what about that 2017 Facebook comment from Yudkowsky you find particularly prophetic.

Is it the idea that deep learning models will be opaque? But that was fairly obvious back then too. I agree that Drexler likely exaggerated how transparent a system of AI services would be, so I'm willing to give Yudkowsky a point for that. But the rest of the scenario seems kind of unrealistic as of 2023.

Some specific points:

  • The recursive self-improvement that Yudkowsky talks about in this scenario seems too local. I think AI self-improvement will most like

... (read more)
4Matthew Barnett5mo
GPT-4 is certainly more general than what existed years ago. Why is it more unified? When I talked about "one giant system" I meant something like a monolithic agent that takes over humanity. If GPT-N takes over the world, I expect it will be because there are millions of copies that band up together in a coalition, not because it will be a singular AI entity. Perhaps you think that copies of GPT-N will coordinate so well that it's basically just a single monolithic agent. But while I agree something like that could happen, I don't think it's obvious that we're trending in that direction. This is a complicated question that doesn't seem clear to me given current evidence.
Load More