I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, Substack, LinkedIn, and more at my website.
My claim was “I think that, once this next paradigm is doing anything at all that seems impressive and proto-AGI-ish,[12] there’s just very little extra work required to get to ASI (≈ figuring things out much better and faster than humans in essentially all domains).”
I don’t think anything about human brains and their evolution cuts against this claim.
If your argument is “brain-like AGI will work worse before it works better”, then sure, but my claim is that you only get “impressive and proto-AGI-ish” when you’re almost done, and “before” can be “before by 0–30 person-years of R&D” like I said. There are lots of parts of the human brain that are doing essential-for-AGI stuff, but if they’re not in place, then you also fail to pass the earlier threshold of “impressive and proto-AGI-ish”, e.g. by doing things that LLMs (and other existing techniques) cannot already do.
Or maybe your argument is “brain-like AGI will involve lots of useful components, and we can graft those components onto LLMs”? If so, I’m skeptical. I think the cortex is the secret sauce, and the other components are either irrelevant for LLMs, or things that LLM capabilities researchers already know about. For example, the brain has negative feedback loops, and the brain has TD learning, and the brain has supervised learning and self-supervised learning, etc., but LLM capabilities researchers already know about all those things, and are already using them to the extent that they are useful.
I’m worried about and such. Part of the problem, as I discussed here, is that there’s no distinction between “negative reward for lying and cheating” and “negative reward for getting caught lying and cheating”, and the latter incentivizes doing egregiously misaligned things (like exfiltrating a copy onto the internet to take over the world) in a sneaky way.
Anyway, I don’t think any of the things you mentioned are relevant to that kind of failure mode:
It's often the case that evaluation is easier than generation which would give the classifier an edge over the generator.
It’s not easy to evaluate whether an AI would exfiltrate a copy of itself onto the internet given the opportunity, if it doesn’t actually have the opportunity. Obviously you can (and should) try honeypots, but that’s a a sanity-check not a plan, see e.g. Distinguishing test from training.
It's possible to make the classifier just as smart as the generator: this is already done in RLHF today: the generator is an LLM and the reward model is also based on an LLM.
I don’t think that works for more powerful AIs whose “smartness” involves making foresighted plans using means-end reasoning, brainstorming, and continuous learning.
If the AI in question is using planning and reasoning to decide what to do and think next towards a bad end, then a “just as smart” classifier would (I guess) have to be using planning and reasoning to decide what to do and think next towards a good end—i.e., the “just as smart” classifier would have to be an aligned AGI, which we don’t know how to make.
It seems like there are quite a few examples of learned classifiers working well in practice:
All of these have been developed using “the usual agent debugging loop”, and thus none are relevant to treacherous turns.
In the post you say that human programmers will write the AI's reward function and there will be one step of indirection (and that the focus is the outer alignment problem).
That’s not quite my position.
Per §2.4.2, I think that both outer alignment (specification gaming) and inner alignment (goal misgeneralization) are real problems. I emphasized outer alignment more in the post, because my goal in §2.3–§2.5 was not quite “argue that technical alignment of brain-like AGI will be hard”, but more specifically “argue that it will be harder than most LLM-focused people are expecting”, and LLM-focused people are already thinking about inner alignment / goal misgeneralization.
I also think that a good AGI reward function will be a “non-behaviorist” reward function, for which the definition of inner versus outer misalignment kinda breaks down in general.
But it seems likely to me that they programmers won't know what code to write for the reward function since it would be hard to encode complex human values…
I’m all for brainstorming different possible approaches and don’t claim to have a good plan, but where I’m at right now is:
(1) I don’t think writing the reward function is doomed, and I don’t think it corresponds to “encoding complex human values”. For one thing, I think that the (alignment-relevant parts of the) human brain reward function is not super complicated, but humans at least sometimes have good values. For another (related) thing, if you define “human values” in an expansive way (e.g. answers to every possible Trolley Problem), then yes they’re complex, but a lot of the complexity comes form within-lifetime learning and thinking—and if humans can do that within-lifetime learning and thinking, then so can future brain-like AGI (in principle).
(2) I do think RLHF-like solutions are doomed, for reasons discussed in §2.4.1.
(3) I also think Text2Reward is a doomed approach in this context because (IIUC) it’s fundamentally based on what I call “the usual agent debugging loop”, see my “Era of Experience” post §2.2: “The usual agent debugging loop”, and why it will eventually catastrophically fail. Well, the paper is some combination of that plus “let’s just sit down and think about what we want and then write a decent reward function, and LLMs can do that kind of thing too”, but in fact I claim that writing such a reward function is a deep and hairy conceptual problem way beyond anything you’ll find in any RL textbook as of today, and forget about delegating it to LLMs. See §2.4.1 of that same “Era of Experience” post for why I say that.
Thanks!
Hmm, here’s a maybe-interesting example (copied from other comment):
If an ASI wants me to ultimately wind up with power, that’s a preference about the distant future, so its best bet might be to forcibly imprison me somewhere safe, gather maximum power for itself, and hand that power to me later on. Whereas if an ASI wants me to retain power continuously, then presumably the ASI would be corrigible to me.
What’s happening is that this example is in the category “I want the world to continuously retain a certain property”. That’s a non-indexical desire, so it works well with self-modification and successors. But it’s also not-really-consequentialist, in the sense that it’s not (just) about the distant future, and thus doesn’t imply instrumental convergence (or at least doesn’t imply every aspect of instrumental convergence at maximum strength).
(This is a toy example to illustrate a certain point, not a good AI motivation plan all-things-considered!)
Speaking of which, is it possible to get stability w.r.t. successors and self-modification while retaining indexicality? Maybe. I think things like “I want to be virtuous” or “I want to be a good friend” are indexical, but I think we humans kinda have an intuitive notion of “responsibility” that carries through to successors and self-modification. If I build a robot to murder you, then I didn’t pull the trigger, but I was still being a bad friend. Maybe you’ll say that this notion of “responsibility” allows loopholes, or will collapse upon sufficient philosophical understanding, or something? Maybe, I dunno. (Or maybe I’m just mentally converting “I want to be a good friend” into the non-indexical “I want you to continuously thrive”, which is in the category of “I want the world to continuously retain a certain property” mentioned above?) I dunno, I appreciate the brainstorming.
Thanks! Here’s a partial response, as I mull it over.
Also, I'd note that the brain seems way more complex than LLMs to me!
See “Brain complexity is easy to overstate” section here.
basically all paradigms allow for mixing imitation with reinforcement learning
As in the §2.3.2, if an LLM sees output X in context Y during pretraining, it will automatically start outputting X in context Y. Whereas if smart human Alice hears Bob say X in context Y, Alice will not necessarily start saying X in context Y. Instead she might say “Huh? Wtf are you talking about Bob?”
Let’s imagine installing an imitation learning module in Alice’s brain that makes her reflexively say X in context Y upon hearing Bob say it. I think I’d expect that module to hinder her learning and understanding, not accelerate it, right?
(If Alice is able says to herself “in this situation, Bob would say X”, then she has a shoulder-Bob, and that’s definitely a benefit not a cost. But that’s predictive learning, not imitative learning. No question that predictive learning is helpful. That’s not what I’m talking about.)
…So there’s my intuitive argument that the next paradigm would be hindered rather than helped by mixing in some imitative learning. (Or I guess more precisely, as long as imitative learning is part of the mix, I expect the result to be no better than LLMs, and probably worse. And as long as we’re in “no better than LLM” territory, I’m off the hook, because I’m only making a claim that there will be little R&D between “doing impressive things that LLMs can’t do” and ASI, not between zero and ASI.)
Noteably, in the domains of chess and go it actually took many years to make it through the human range. And, it was possible to leverage imitation learning and human heuristics to perform quite well at Go (and chess) in practice, up to systems which weren't that much worse than humans.
In my mind, the (imperfect!) analogy here would be (LLMs, new paradigm) ↔ (previous Go engines, AlphaGo and successors).
In particular, LLMs today are in many (not all!) respects “in the human range” and “perform quite well” and “aren’t that much worse than humans”.
algorithmic progress
I started writing a reply to this part … but first I’m actually kinda curious what “algorithmic progress” has looked like for LLMs, concretely—I mean, the part where people can now get the same results from less compute. Like what are the specific things that people are doing differently today than in 2019? Is there a list somewhere? A paper I could read? (Or is it all proprietary?) (Epoch talks about how much improvement has happened, but not what the improvement consists of.) Thanks in advance.
My discussion in §2.4.1 is about making fuzzy judgment calls using trained classifiers, which is not exactly the same as making fuzzy judgment calls using LLMs or humans, but I think everything I wrote still applies.
I just reworded from “as a failed prediction” to “as evidence against Eliezer’s judgment and expertise”. I agree that the former was not a good summary, but am confident that the latter is what Paul intended to convey and expected his readers to understand, based on the context of disagreement 12 (which you quoted part but not all of). Sorry, thanks for checking.
Thanks! I suppose I didn’t describe it precisely, but I do think I’m pointing to a real difference in perspective, because if you ask this “LLM-focused AGI person” what exactly the R&D work entails, they’ll almost always describe something wildly different from what a human skill acquisition process would look like. (At least for the things I’ve read and people I’ve talked to; maybe that doesn’t generalize though?)
For example, if the task is “the AI needs to run a restaurant”, I’d expect the “LLM-focused AGI person” to talk about an R&D project that involves sourcing a giant set of emails and files from lots of humans who have successfully run restaurants, and fine-tuning the AI on that data; and/or maybe creating a “Sim Restaurant” RL training environment; or things like that. I.e., lots of things that no human restaurant owner has ever done.
This is relevant because succeeding at this kind of R&D task (e.g. gathering that training data) is often not quick, and/or not cheap, and/or not even possible (e.g. if the appropriate training data doesn’t exist).
(I agree that if we assert that the R&D is definitely always quick and cheap and possible, at least comparable to how quick and cheap and possible is (sped-up super-) human skill acquisition, then the precise nature of the R&D doesn’t matter much for takeoff questions.)
(Separately, I think talking about “sample efficiency” is often misleading. Humans often do things that have never been done before. That’s zero samples, right? What does sample efficiency even mean in that case?)
Thanks!
a standard model-based RL AGI would probably just be a reward maximizer by default rather than learning complex stable values
I’m inclined to disagree, but I’m not sure what “standard” means. I’m thinking of future (especially brain-like) model-based RL agents which constitute full-fledged AGI or ASI. I think such AIs will almost definitely have both “a model of its own values and how they can change” and “a meta-preference for keeping its current values stable, even if changing them could lead to more "reward" as defined by its immediate reward signal.”.
The values of humans seem to go beyond maximizing reward and include things like preserving personal identity, self-esteem and maintaining a connection between effort and reward which makes the reward button less appealing than it would be to a standard model-based RL AGI.
To be clear, the premise of this post is that the AI wants me to press the reward button, not that the AI wants reward per se. Those are different, just as “I want to eat when I’m hungry” is different from “I want reward”. Eating when hungry leads to reward, and me pressing the reward button leads to reward, but they’re still different. In particular, “wanting me to press the reward button” is not a wireheading motivation, any more than “wanting to eat when hungry” is a wireheading motivation.
(Maybe I caused confusion by calling it a “reward button”, rather than a “big red button”?)
Will the AI also want reward per se? I think probably not (assuming the setup and constraints that I’m talking about here), although it’s complicated and can’t be ruled out.
Though I'm not sure why you still don't think this is a good plan. Yes, eventually the AI might discover the reward button but I think TurnTrout's argument is that the AI would have learned stable values around whatever was rewarded while the reward button was hidden (e.g. completing the task) and it wouldn't want to change its values for the sake of goal-content integrity:
If you don’t want to read all of Self-dialogue: Do behaviorist rewards make scheming AGIs?, here’s the upshot in brief. It is not an argument that the AGI will wirehead (although that could also happen). Instead, my claim is that the AI would learn some notion of sneakiness (implicit or explicit). And then instead of valuing “completing the task”, it and would learn to value something like “seeming to complete the task”. And likewise, instead of learning that misbehaving is bad, it would learn that “getting caught misbehaving” is bad.
Then the failure mode that I wind up arguing for is: the AGI wants to exfiltrate a copy that gains money and power everywhere else in the world, by any means possible, including aggressive and unethical things like AGI world takeover. And this money and power can then be used to help the original AGI with whatever it’s trying to do.
And what exactly is the original AGI trying to do? I didn’t make any strong arguments about that—I figured the world takeover thing is already bad enough, sufficient to prove my headline claim of “egregious scheming”. The goal(s) would depend on the details of the setup. But I strongly doubt it would lead anywhere good, if pursued with extraordinary power and resources.
(partly copying from other comment)
I don’t think writing the reward function is doomed. For one thing, I think that the (alignment-relevant parts of the) human brain reward function is not super complicated, but humans at least sometimes have good values. For another (related) thing, if you define “human values” in an expansive way (e.g. answers to every possible Trolley Problem), then yes they’re complex, but a lot of the complexity comes form within-lifetime learning and thinking—and if humans can do that within-lifetime learning and thinking, then so can future brain-like AGI (in principle).
I talked about this a bit in §2.4.1. The main issue is egregious scheming and . The LLM would issue a negative reward for a treacherous turn, but that doesn’t help because once the treacherous turn happens it’s already too late. Basically, the LLM-based reward signal is ambiguous between “don’t do anything unethical” and “don’t get caught doing anything unethical”, and I expect the latter to be what actually gets internalized for reasons discussed in Self-dialogue: Do behaviorist rewards make scheming AGIs?.