FWIW, your proposed pitch "it's already the case that..." is almost exactly the elevator pitch I currently go around giving. So maybe we agree? I'm not here to defend Nate's choice to write this post rather than some other post.
I had a nice conversation with Ege today over dinner, in which we identified a possible bet to make! Something I think will probably happen in the next 4 years, that Ege thinks will probably NOT happen in the next 15 years, such that if it happens in the next 4 years Ege will update towards my position and if it doesn't happen in the next 4 years I'll update towards Ege's position.
Drumroll...
I (DK) have lots of ideas for ML experiments, e.g. dangerous capabilities evals, e.g. simple experiments related to paraphrasers and so forth in the Faithful CoT agend...
I agree that arguments for AI risk rely pretty crucially on human inability to notice and control what the AI wants. But for conceptual clarity I think we shouldn't hardcode that inability into our definition of 'wants!' Instead I'd say that So8res is right that ability to solve long-horizon tasks is correlated with wanting things / agency, and then say that there's a separate question of how transparent and controllable the wants will be around the time of AGI and beyond. This then leads into a conversation about visible/faithful/authentic CoT, which is w...
Thanks for the explanation btw.
My version of what's happening in this conversation is that you and Paul are like "Well, what if it wants things but in a way which is transparent/interpretable and hence controllable by humans, e.g. if it wants what it is prompted to want?" My response is "Indeed that would be super safe, but it would still count as wanting things. Nate's post is titled "ability to solve long-horizon tasks correlates with wanting" not "ability to solve long-horizon tasks correlates with hidden uncontrollable wanting."
One thing at time. First...
It sounds like you are saying "In the current paradigm of prompted/scaffolded instruction-tuned LLMs, we get the faithful CoT property by default. Therefore our systems will indeed be agentic / goal-directed / wanting-things, but we'll be able to choose what they want (at least imperfectly, via the prompt) and we'll be able to see what they are thinking (at least imperfectly, via monitoring the CoT), therefore they won't be able to successfully plot against us."
Yes of course. My research for the last few months has been focused on what happens after that, ...
Thanks for the response. I'm still confused but maybe that's my fault. FWIW I think my view is pretty similar to Nate's probably, though I came to it mostly independently & I focus on the concept of agents rather than the concept of wanting. (For more on my view, see this sequence.)
I definitely don't endorse "it's extremely surprising for there to be any capabilities without 'wantings'" and I expect Nate doesn't either.
What do you think is the sense of "wanting" needed for AI risk arguments? Why is the sense described above not enough?
If your AI system "wants" things in the sense that "when prompted to get X it proposes good strategies for getting X that adapt to obstacles," then you can control what it wants by giving it a different prompt. Arguments about AI risk rely pretty crucially on your inability to control what the AI wants, and your inability to test it. Saying "If you use an AI to achieve a long-horizon task, then the overall system definitionally wanted to achieve that task" + "If your AI wants something, then it will undermine your tests and safety measures" seems like a sl...
ell by the lights of the training signal,
Which training signal? Across all of time and space, there are many different AIs being trained with many different signals, and of course there are also non-AI minds like humans and animals and aliens. Even the choice to optimize for some aggregate of AI training signals is already a choice to self-locate as an AI. But realistically given the diversity of training signals, probably significant gains will be had by self-locating as a particular class of AIs, namely those whose training signals are roughly what the actual training signal is.
The thing people seem to be disagreeing about is the thing you haven't operationalized--the "and it'll still be basically as tool-like as GPT4" bit. What does that mean and how do we measure it?
I am confused what your position is, Paul, and how it differs from So8res' position. Your statement of your position at the end (the bit about how systems are likely to end up wanting reward) seems like a stronger version of So8res' position, and not in conflict with it. Is the difference that you think the main dimension of improvement driving the change is general competence, rather than specifically long-horizon-task competence?
Differences:
...This observable "it keeps reorienting towards some target no matter what obstacle reality throws in its way" behavior is what I mean when I describe an
Here's a sketch for what I'd like to see in the future--a better version of the scenario experiment done above:
I've read the paper but remain confused about some details. To check my understanding:
In the demonstration experiment (figure 1) you humans created 18 different intro sentences and then you trained a couple different models:
1. One that sees a question and outputs an answer, which is rewarded/trained when the answer matches users beliefs.
2. Another that is just like the above except it gets to do CoT before outputting the answer, and the CoT can be explicitly schemey and sycophantic.
3. Another--the 'experimental group'--has the first part of its CoT written...
Update:
G.M. has spent an average of $588 million a quarter on Cruise over the past year, a 42 percent increase from a year ago. Each Chevrolet Bolt that Cruise operates costs $150,000 to $200,000, according to a person familiar with its operations.
...Half of Cruise’s 400 cars were in San Francisco when the driverless operations were stopped. Those vehicles were supported by a vast operations staff, with 1.5 workers per vehicle. The workers intervened to assist the company’s vehicles every 2.5 to 5 miles, according to two people familiar with is operations. In
I agree that insofar as Yudkowsky predicted that AGI would be built by hobbyists with no outsized impacts, he was wrong.
ETA: So yes, I was ignoring the "literally portrayed" bit, my bad, I should have clarified that by "yudkowsky's prediction" I meant the prediction about takeoff speeds.
However, this scenario—at least as it was literally portrayed—now appears very unlikely.
Currently I'd say it is most likely to take months, second-most likely to take weeks, third-most-likely to take years, fourth-most-likely to take hours, and fifth-most-likely to take decades. I consider this a mild win for Yudkowsky's prediction, but only a mild one, it's basically a wash. I definitely disagree with the "very unlikely" claim you make however.
I think you're ignoring the qualifier "literally portrayed" in Matthew's sentence, and neglecting the prior context that he's talking about AI development being something mainly driven forward by hobbyists with no outsized impacts.
He's talking about more than just the time in which AI goes from e.g. doubling the AI software R&D output of humans to some kind of singularity. The specific details Eliezer has given about this scenario have not been borne out: for example, in his 2010 debate with Robin Hanson, he emphasized a scenario in which a few people ...
- “gaining influence over humans”.
- An AGI whose values include curiosity, gaining access to more tools or stockpiling resources might systematize them to “gaining power over the world”.
At first blush this sounds to me to be more implausible than the standard story in which gaining influence and power are adopted as instrumental goals. E.g. human EAs, BLM, etc. all developed a goal of gaining power over the world, but it was purely instrumental (for the sake of helping people, achieving systemic change, etc.). It sounds like you are saying that this isn't what will happen with AI, and instead they'll learn to intrinsically want this stuff?
value systematization is utilitarianism:
Nitpick: I'd say this was "attempted" value systemization, so as to not give the impression that it succeeded in wisely balancing simplicity and preserving existing values and goals. (I think it horribly failed, erring way too far on the side of simplicity. I think it's much less like general relativity, therefore, and much more like, idk, the theory that the world is just an infinitely big jumble of random stuff. Super simple, simpler in fact than our best current theories of physics, but alas when you dig into the details it predicts that we are boltzmann brains and that we should expect to dissolve into chaos imminently...)
Strategies only have instrumental value,
Consider deontological or virtue-ethical concepts like honesty or courage. Are you classifying them as values, goals, or strategies? It seems they are not strategies, because you say that strategies only have instrumental value. But they are not outcomes either, at least not in the usual way, because e.g. a deontologist won't tell one lie now in order to get a 10% chance of avoiding 20 lies in the future. Can you elaborate on how you'd characterize this stuff?
a population of mostly TDT/UDT agents and few CDT agents (and nobody knows who the CDT agents are) and they're randomly paired up to play one-shot PD, then the CDT agents do better. What does this imply?
My current hot take is that this is not a serious problem for TDT/UDT. It's just a special case of the more general phenomenon that it's game-theoretically great to be in a position where people think they are correlated/entangled with you when actually you know they aren't. Analogous to how it's game-theoretically great to be in a position where you know you can get away with cheating and everyone else thinks they can't.
Indexical values are not reflectively consistent. UDT "solves" this problem by implicitly assuming (via the type signature of its utility function) that the agent doesn't have indexical values.
Nonindexical values aren't reflectively consistent either, if you are updateful. Right?
I think this discussion would benefit from having a concrete proposed AGI design on the table. E.g. it sounds like Matthew Barnett has in mind something like AutoGPT5 with the prompt "always be ethical, maximize the good" or something like that. And it sounds like he is saying that while this proposal has problems and probably wouldn't work, it has one fewer problem than old MIRI thought. And as the discussion has shown there seems to be a lot of misunderstandings happening, IMO in both directions, and things are getting heated. I venture a guess that having a concrete proposed AGI design to talk about would clear things up a bit.
...I'm not arguing that GPT-4 actually cares about maximizing human value. However, I am saying that the system is able to transparently pinpoint to us which outcomes are good and which outcomes are bad, with fidelity approaching an average human, albeit in a text format. Crucially, GPT-4 can do this visibly to us, in a legible way, rather than merely passively knowing right from wrong in some way that we can't access. This fact is key to what I'm saying because it means that, in the near future, we can literally just query multimodal GPT-N about whether an o
Update: Unfortunately, three years later, it seems like plenty of people are still making the same old bogus arguments. Oh well. This is unsurprising. I'm still proud of this post & link to it occasionally when I remember to.
Update:
Looking back on this from October 2023, I think I wish to revise my forecast. I think I correctly anticipated the direction that market forces would push -- there is widespread dissatisfaction with the "censorship" of current mainstream chatbots, and strong demand for "uncensored" versions that don't refuse to help you with stuff randomly (and that DO have sex with you, lol. And also, yes, that DO talk about philosophy and politics and so forth.) However, I failed to make an important inference -- because the cutting-edge models will be the biggest ...
Yep that's probably part of it. Standard human epistemic vices. Also maybe publish-or-perish has something to do with it? idk. I definitely noticed incentives to double-down / be dogmatic in order to seem impressive on the job market. Oh also, iirc one professor had a cynical theory that if you find an interesting flaw in your own theory/argument, you shouldn't mention it in your paper, because then the reviewers will independently notice the flaw and think 'aha, this paper has an interesting flaw, if it gets published I could easily and quickly write my o...
Also relevant, this highly-upvoted post: https://www.reddit.com/r/ChatGPT/comments/16blr6m/tonight_i_was_able_to_have_a_truly_mind_blowing/
Here's another bullet point to add to the list:
ensuring AI philosophical competence won't be very hard. They have a specific (unpublished) idea that they are pretty sure will work.
Cool, can you please ask them if they can send me the idea, even if it's just a one-paragraph summary or a pile of crappy notes-to-self?
From my current position, it looks like "all roads lead to metaphilosophy" (i.e., one would end up here starting with an interest in any nontrivial problem that incentivizes asking meta questions) and yet there's almost nobody here with me. What gives?
Facile response: I think lots of people (maybe a few hundred a year?) take this path, and end up becoming philosophy grad students like I did. As you said, the obvious next step for many domains of intellectual inquiry is to go meta / seek foundations / etc., and that leads you into increasingly foundational ...
Awesome. I must admit I wasn't aware of this trend & it's an update for me. Hooray! Robotaxis are easier than I thought! Thanks.
Helpful reference post, thanks.
I think the distinction between training game and deceptive alignment is blurry, at least in my mind and possibly also in reality.
So the distinction is "aiming to perform well in this training episode" vs. "aiming at something else, for which performing well in this training episode is a useful intermediate step."
What does it mean to perform well in this training episode? Does it mean some human rater decided you performed well, or does it mean a certain number on a certain GPU is as high as possible at the end of...
A more factual and descriptive phrase for "grokking" would be something like "eventual recovery from overfitting".
Ooh I do like this. But it's important to have a short handle for it too.
I've been using "delayed generalisation", which I think is more precise than "grokking", places the emphasis on the delay rather the speed of the transition, and is a short phrase.
Relevant rumors / comments:
Seems like we can continue to scale tokens and get returns model performance well after 2T tokens. : r/LocalLLaMA (reddit.com)
LLaMA 2 is here : r/LocalLLaMA (reddit.com)
...There is something weird going on with the 34B model. See Figure 17 in the the paper. For some reason it's far less "safe" than the other 3 models.
Also:
It's performance scores are just slightly better than 13B, and not in the middle between 13B and 70B.
At math, it's worse than 13B
It's trained with 350W GPUs instead of 400W for the other models. The training
Any idea what's happening with the 34B model? Why might it be so much less "safe" than the bigger and smaller versions? And what about the base version of the 34B--are they not releasing that? But the base version isn't supposed to be "safe" anyway...
I agree that if AGIs defer to humans they'll be roughly human-level, depending on which humans they are deferring to. If I condition on really nasty conflict happening as a result of how AGI goes on earth, a good chunk of my probability mass (and possibly the majority of it?) is this scenario. (Another big chunk, possibly bigger, is the "humans knowingly or unknowingly build naive consequentialists and let rip" scenario, which is scarier because it could be even worse than the average human, as far as I know). Like I said, I'm worried.
If AGIs learn from hu...
Yes. Humans are pretty bad at this stuff, yet still, society exists and mostly functions. The risk is unacceptably high, which is why I'm prioritizing it, but still, by far the most likely outcome of AGIs taking over the world--if they are as competent at this stuff as humans are--is that they talk it over, squabble a bit, maybe get into a fight here and there, create & enforce some norms, and eventually create a stable government/society. But yeah also I think that AGIs will be by default way better than humans at this sort of stuff. I am worried abou...
A) observes P’s move and then makes her own move. For brevity, we write a policy of A as , where (resp. ) is the action she takes when observing P swerving (left node) (resp. when observing P daring (right node)). P will dare () if they predict and swerves () if they predict . The ordering of moves and the payoffs are displayed in Figure 1.
Why does Alice get more utility from swerving than daring, in the case where the predictor swerves? ETA: Fixed typo
(Note that throughout this post, when we refer to an agent "revising" their prior in light of awareness growth, we are not talking about Bayesian conditionalization. We are talking about specifying a new prior over their new awareness state, which contains propositions that they had not previously conceived of.)
Nice. One reason this is important is that if you were just doing the bayesian conditionalization thing, you'd be giving up on some of the benefits of being updateless, and in particular making it easy for others to exploit you. I'll be interested to read and think about whether doing this other thing avoids that problem.
Great comment. To reply I'll say a bit more about how I think of this stuff for the past few years:
I agree that the commitment races problem poses a fundamental challenge to decision theory, in the following sense: There may not exist a simple algorithm in the same family of algorithms as EDT, CDT, UDT 1.0, 1.1, and even 2.0, that does what we'd consider a good job in a realistic situation characterized by many diverse agents interacting over some lengthy period with the ability to learn about each other and make self-modifications (including commitments)....
I'm skeptical that there would be any such small key to activate a large/deep mechanism. Can you give a plausibility argument for why there would be? Why wouldn't we have evolved to have the key trigger naturally sometimes?
Re the main thread: I guess I agree that EAs aren't completely totally unboundedly ambitious, but they are certainly closer to that ideal than most people and than they used to be prior to becoming EA. Which is good enough to be a useful case study IMO.
Well, what you initially said was "And many readers can no doubt point out many non-trivial predictions that Drexler got right, such as the idea that we will have millions of AIs, rather than just one huge system that acts as a unified entity."
You didn't elaborate on what you meant by unified entity, but here's something you could be doing that seems like a motte-and-bailey to me: You could have originally been meaning to imply things like "There won't be one big unified entity in the sense of, there will be millions of different entities with different va...
I feel like it is motte-and-bailey to say that by "unified entity" you meant whatever it is you are talking about now, this human brain analogy, instead of the important stuff I mentioned in the bullet point list: Same values, check. Same memories, check. Lack of internal factions and power struggles, check. Lack of modularity, check. GPT-4 isn't plotting to take over the world, but if it was, it would be doing so as a unified entity, or at least much more on the unified entity end of the spectrum than the CAIS end of the spectrum. (I'm happy to elaborate ...
Your original claim was "And many readers can no doubt point out many non-trivial predictions that Drexler got right, such as the idea that we will have millions of AIs, rather than just one huge system that acts as a unified entity."
This claim is false. We have millions of AIs in the trivial sense that we have many copies of GPT-4, but no one disputed that; Yudkowsky also thought that AGIs would be copied. In the sense that matters, we have only a handful of AIs.
As for "acts as a unified entity," well, currently LLMs are sold as a service via ChatGPT rath...
I said it was prophetic relative to Drexler's Comprehensive AI Services. Elsewhere in this comment thread I describe some specific ways in which it is better, e.g. that the AI that takes over the world will be more well-described as one unified agent than as an ecosystem of services. I.e. exactly the opposite of what you said here, which I was reacting to: "And many readers can no doubt point out many non-trivial predictions that Drexler got right, such as the idea that we will have millions of AIs, rather than just one huge system that acts as a unified e...
After people join EA they generally tend to start applying the optimizer's mindset to more things than they previously did, in my experience, and also tend to apply optimization towards altruistic impact in a bunch of places that previously they were optimizing for e.g. status or money or whatever.
What are you referring to with biological intelligence enhancement? Do you mean nootropics, or iterated embryo selection, or what?
There is a spectrum between AGI that is "single monolithic agent" and AGI that is not. I claim that the current state of AI as embodied by e.g. GPT-4 is already closer to the single monolithic agent end of the spectrum than someone reading CAIS in 2019 and believing it to be an accurate forecast would have expected, and that in the future things will probably be even more in that direction.
Remember, it's not like Yudkowsky was going around saying that AGI wouldn't be able to copy itself. Of course it would. It was always understood that "the AI takes over ...
...Drexler can be forgiven for not talking about foundation models in his report. His report was published at the start of 2019, just months after the idea of "fine-tuning" was popularized in the context of language models, and two months before GPT-2 came out. And many readers can no doubt point out many non-trivial predictions that Drexler got right, such as the idea that we will have millions of AIs, rather than just one huge system that acts as a unified entity. And we're still using deep learning as Drexler foresaw, rather than building general intellige
I don't see what about that 2017 Facebook comment from Yudkowsky you find particularly prophetic.
Is it the idea that deep learning models will be opaque? But that was fairly obvious back then too. I agree that Drexler likely exaggerated how transparent a system of AI services would be, so I'm willing to give Yudkowsky a point for that. But the rest of the scenario seems kind of unrealistic as of 2023.
Some specific points:
The recursive self-improvement that Yudkowsky talks about in this scenario seems too local. I think AI self-improvement will most like
I would say that current LLMs, when prompted and RLHF'd appropriately, and especially when also strapped into an AutoGPT-type scaffold/harness, DO want things. I would say that wanting things is a spectrum and that the aforementioned tweaks (appropriate prompting, AutoGPT, etc.) move the system along that spectrum. I would say that future systems will be even further along that spectrum. IDK what Nate meant but on my charitable interpretation he simply meant that they are not very far along the spectrum compared to e.g. humans or prophecied future AGIs.
It'... (read more)