Steve Byrnes

I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See for a summary of my research and sorted list of writing. Physicist by training. Email: Leave me anonymous feedback here. I’m also at: RSS feed , Twitter , Mastodon , Threads , Bluesky , GitHub , Wikipedia , Physics-StackExchange , LinkedIn


Intro to Brain-Like-AGI Safety

Wiki Contributions


Yeah when I say things like “I expect LLMs to plateau before TAI”, I tend not to say it with the supremely high confidence and swagger that you’d hear from e.g. Yann LeCun, François Chollet, Gary Marcus, Dileep George, etc. I’d be more likely to say “I expect LLMs to plateau before TAI … but, well, who knows, I guess. Shrug.” (The last paragraph of this comment is me bringing up a scenario with a vaguely similar flavor to the thing you’re pointing at.)

I feel like “Will LLMs scale to AGI?” is right up there with “Should there be government regulation of large ML training runs?” as a black-hole-like attractor state that sucks up way too many conversations. :) I want to fight against that: this post is not about the question of whether or not LLMs will scale to AGI.

Rather, this post is conditioned on the scenario where future AGI will be an algorithm that (1) does not involve LLMs, and (2) will be invented by human AI researchers, as opposed to being invented by future LLMs (whether scaffolded, multi-modal, etc. or not). This is a scenario that I want to talk about; and if you assign an extremely low credence to that scenario, then whatever, we can agree to disagree. (If you want to argue about what credence is appropriate, you can try responding to me here or links therein, but note that I probably won’t engage, it’s generally not a topic I like to talk about for “infohazard” reasons [see footnote here if anyone reading this doesn’t know what that means].)

I find that a lot of alignment researchers don’t treat this scenario as their modal expectation, but still assign it like >10% credence, which is high enough that we should be able to agree that thinking through that scenario is a good use of time.

Yeah we already know that LLM training finds underlying patterns that are helpful for explaining / compressing / predicting the training data. Like “the vibe of Victorian poetry”. I’m not sure what you mean by “none of which are present in the training data”. Is the vibe of Victorian poetry present in the training data? I would have said “yeah” but I’m not sure what you have in mind.

One interesting result here, I think, is that the LLM is then able to explicitly write down the definition of f(blah), despite the fact that the fine-tuning training set didn't demand anything like this. That ability – to translate the latent representation of f(blah) into humanese – appeared coincidentally, as the result of the SGD chiseling-in some module for merely predicting f(blah).

I kinda disagree that this is coincidental. My mental image is something like

  1. The earliest layers see inputs of the form f(…)
  2. Slightly later layers get into an activation state that we might describe as “the idea of the function x-176”
  3. The rest of the layers make inferences and emit outputs appropriate to that idea.

I’m claiming that before fine-tuning, everything is already in place except for the 1→2 connection. Fine-tuning just builds the 1→2 connection.

The thing you mention—that an LLM with the idea of x-176 in mind can output the tokens “x-176”—is part of step 3, and therefore (I hypothesize) comes entirely from LLM pretraining, not from this fine-tuning process

The fact that pretraining can and does build that aspect of step 3 seems pretty much expected to me, not coincidental, as such a connection is obviously useful for predicting GitHub code and math homework and a zillion other things in the training data. It’s also something you can readily figure out by playing with an LLM: if you say “if I subtract 6 from x-170, what do I get?”, then it’s obviously able to output the tokens “x-176”.

Here’s how I’m thinking of this result right now. Recall that we start with a normal LLM, and then 32,000 times (or whatever) we gradient-update it such that its f(blah) = blah predictions are better.

The starting LLM has (in effect) a ton of information-rich latent variables / representations that comprise the things that the LLM “understands” and can talk about. For example, obviously the concept of “x-176” is a thing that the LLM can work with, answer questions about, and so on. So there has to be something in the guts of the LLM that’s able to somehow represent that concept.

Anyway “f(blah)” doesn’t start out triggering any of these latent representations in particular. (Or, well, it triggers the ones related to how f(blah) is used in internet text, e.g. as a generic math function.) But it does presumably trigger all of them to some tiny random extent. And then each of the 32,000 gradient descent update steps will strengthen the connection between “f(blah)” and the particular “concept” / latent representation / whatever of “x-176”. …Until eventually the fine-tuned LLM is strongly invoking this preexisting “concept” / representation / activation-state / whatever of “x-176”, whenever it sees “f”. And then yeah of course it can answer questions about “f” and so on—we already know that LLMs can do those kinds of things if you activate that same “x-176” concept the old-fashioned way (by writing it in the context window).

(To be clear, I don’t think the variable name “f” is an important ingredient here; in fact, I didn’t understand the discussion of why that would ever be expected in the first place. For example, in the mixture-of-functions case, the LLM would be gradually getting tweaked such that an input of the type “User: [number]” activates such-and-such concept in the guts of the LLM. Or in fact, maybe in that case the LLM is getting tweaked such that any input at all activates such-and-such concept in the guts of the LLM!)

(Also, to be clear, I don’t think it’s necessary or important that the LLM has already seen the specific function “x-176” during pretraining. Whether it has seen that or not, the fact remains that I can log in and ask GPT-4 to talk about “x-176” right now, and it can easily do so. So, like I said, there has to be something in the guts of the LLM that’s able to somehow represent that “concept”, and whatever that thing is, fine-tuning gradient descent will eventually tweak the weights such that that thing gets triggered by the input “f(blah)”.) (Indeed, it should be super obvious and uncontroversial to say that LLMs can somehow represent and manipulate “concepts” that were nowhere in the training data—e.g. “Gandalf as a Martian pirate”.)

Anyway, in my mental model spelled out above, I now think your results are unsurprising, and I also think your use of the word “reasoning” seems a bit dubious, unless we’re gonna say that LLMs are “reasoning” every time they output a token, which I personally probably wouldn’t say, but whatever. (I also have long complained about the term “in-context learning” so maybe I’m just a stick-in-the-mud on these kinds of things.)

[not really my area of expertise, sorry if I said anything stupid.]

I guess I’m concerned that there’s some kind of “conservation law for wisdom / folly / scout mindset” in the age of instruction-following AI. If people don’t already have wisdom / scout mindset, I’m concerned that “Instruct the AGI to tell you the truth” won’t create it.

For example, if you ask the AI a question for which there’s no cheap and immediate ground truth / consequences (“Which politician should I elect?”, “Will this alignment approach scale to superintelligence?”), then the AI can say what the person wants to hear, or the AI can say what’s true.

Likewise, if there’s something worth doing that might violate conventional wisdom and make you look foolish, and ask the AI for a recommendation, the AI can recommend the easy thing that the person wants to hear, or the AI can recommend the hard annoying thing that the person doesn’t want to hear.

If people are not really deeply motivated to hear things that they don’t want to hear, I’m skeptical that instruction-following AI can change that. Here are three ways for things to go wrong:

  • During training (e.g. RLHF), presumably people will upvote the AIs for providing answers that they want to hear, even if they ask for the truth, resulting in AIs that behave that way;
  • During usage, people could just decide that they don’t trust the AI on thus-and-such type of question. I’m sure they could easily come up with a rationalization! E.g. “well it’s perfectly normal and expected for AIs to be very smart at questions for which there’s a cheap and immediate ground truth, while being lousy at questions for which there isn’t! Like, how would it even learn the latter during training? And as for ‘should’ questions involving tradeoffs, why would we even trust it on that anyway?” The AIs won’t be omniscient anyway; mistrusting them in certain matters wouldn’t be crazy.
  • In a competitive marketplace, if one company provides an AI that tells people what they want to hear in cases where there’s no immediate consequences, and other company provides an AI that tells people hard truths, people may pick the former.

(To be clear, if an AI is saying things that the person wants to hear in certain cases, the AI will still say that it’s telling the truth, and in fact the AI will probably even believe that it’s telling the truth! …assuming it’s a type of AI that has “beliefs”.)

(I think certain things like debate or training-on-prediction markets might help a bit with the first bullet point, and are well worth investigating for that purpose; but they wouldn’t help with the other two bullet points.)

So anyway, my background belief here is that defending the world against out-of-control AGIs will require drastic, unpleasant, and norm-violating actions. So then the two options to get there would be: (1) people with a lot of scout mindset / wisdom etc. are the ones developing and using instruction-following AGIs, and they take those actions; or (2) make non-instruction-following AGIs, and those AGIs themselves are the ones taking those actions without asking any human’s permission. E.g. “pivotal acts” would be (1), whereas AGIs that deeply care about humans and the future would be (2). I think I’m more into (2) than you both because I’m (even) more skeptical about (1) than you are, and because I’m less skeptical about (2) than you. But it’s hard to say; I have a lot of uncertainty. (We’ve talked about this before.)

Anyway, I guess I think it’s worth doing technical research towards both instruction-following-AI and AI-with-good-values in parallel.

Regardless, thanks for writing this.

My complaint about “transformative AI” is that (IIUC) its original and universal definition is not about what the algorithm can do but rather how it impacts the world, which is a different topic. For example, the very same algorithm might be TAI if it costs $1/hour but not TAI if it costs $1B/hour, or TAI if it runs at a certain speed but not TAI if it runs many OOM slower, or “not TAI because it’s illegal”. Also, two people can agree about what an algorithm can do but disagree about what its consequences would be on the world, e.g. here’s a blog post claiming that if we have cheap AIs that can do literally everything that a human can do, the result would be “a pluralistic and competitive economy that’s not too different from the one we have now”, which I view as patently absurd.

Anyway, “how an AI algorithm impacts the world” is obviously an important thing to talk about, but “what an AI algorithm can do” is also an important topic, and different, and that’s what I’m asking about, and “TAI” doesn’t seem to fit it as terminology.

I’m talking about the AI’s ability to learn / figure out a new system / idea / domain on the fly. It’s hard to point to a particular “task” that specifically tests this ability (in the way that people normally use the term “task”), because for any possible task, maybe the AI happens to already know how to do it.

You could filter the training data, but doing that in practice might be kinda tricky because “the AI already knows how to do X” is distinct from “the AI has already seen examples of X in the training data”. LLMs “already know how to do” lots of things that are not superficially in the training data, just as humans “already know how to do” lots of things that are superficially unlike anything they’ve seen before—e.g. I can ask a random human to imagine a purple colander falling out of an airplane and answer simple questions about it, and they’ll do it skillfully and instantaneously. That’s the inference algorithm, not the learning algorithm.

Well, getting an AI to invent a new scientific field would work as such a task, because it’s not in the training data by definition. But that’s such a high bar as to be unhelpful in practice. Maybe tasks that we think of as more suited to RL, like low-level robot control, or skillfully playing games that aren’t like anything in the training data?

Separately, I think there are lots of domains where “just generate synthetic data” is not a thing you can do. If an AI doesn’t fully ‘understand’ the physics concept of “superradiance” based on all existing human writing, how would it generate synthetic data to get better? If an AI is making errors in its analysis of the tax code, how would it generate synthetic data to get better? (If you or anyone has a good answer to those questions, maybe you shouldn’t publish them!! :-P )

Well I’m one of the people who says that “AGI” is the scary thing that doesn’t exist yet (e.g. FAQ  or “why I want to move the goalposts on ‘AGI’”). I don’t think “AGI” is a perfect term for the scary thing that doesn’t exist yet, but my current take is that “AGI” is a less bad term compared to alternatives. (I was listing out some other options here.) In particular, I don’t think there’s any terminological option that is sufficiently widely-understood and unambiguous that I wouldn’t need to include a footnote or link explaining exactly what I mean. And if I’m going to do that anyway, doing that with “AGI” seems OK. But I’m open-minded to discussing other options if you (or anyone) have any.

Generative pre-training is AGI technology: it creates a model with mediocre competence at basically everything.

I disagree with that—as in “why I want to move the goalposts on ‘AGI’”, I think there’s an especially important category of capability that entails spending a whole lot of time working with a system / idea / domain, and getting to know it and understand it and manipulate it better and better over the course of time. Mathematicians do this with abstruse mathematical objects, but also trainee accountants do this with spreadsheets, and trainee car mechanics do this with car engines and pliers, and kids do this with toys, and gymnasts do this with their own bodies, etc. I propose that LLMs cannot do things in this category at human level, as of today—e.g. AutoGPT basically doesn’t work, last I heard. And this category of capability isn’t just a random cherrypicked task, but rather central to human capabilities, I claim. (See Section 3.1 here.)

(Disclaimer: Nothing in this comment is meant to disagree with “I just think it's not plausible that we just keep scaling up [LLM] networks, run pretraining + light RLHF, and then produce a schemer.” I’m agnostic about that, maybe leaning towards agreement, although that’s related to skepticism about the capabilities that would result.)

It is simply not true that "[RL approaches] typically involve creating a system that seeks to maximize a reward signal."

I agree that Bostrom was confused about RL. But I also think there are some vaguely-similar claims to the above that are sound, in particular:

  • RL approaches may involve inference-time planning / search / lookahead, and if they do, then that inference-time planning process can generally be described as “seeking to maximize a learned value function / reward model / whatever” (which need not be identical to the reward signal in the RL setup).
  • And if we compare Bostrom’s incorrect “seeking to maximize the actual reward signal” to the better “seeking at inference time to maximize a learned value function / reward model / whatever to the best of its current understanding”, then…
  • RL approaches historically have typically involved the programmer wanting to get a maximally high reward signal, and creating a training setup such that the resulting trained model does stuff that get as high a reward signal as possible. And this continues to be a very important lens for understanding why RL algorithms work the way they work. Like, if I were teaching an RL class, and needed to explain the formulas for TD learning or PPO or whatever, I think I would struggle to explain the formulas without saying something like “let’s pretend that you the programmer are interested in producing trained models that score maximally highly according to the reward function. How would you update the model parameters in such-and-such situation…?” Right?
  • Related to the previous bullet, I think many RL approaches have a notion of “global optimum” and “training to convergence” (e.g. given infinite time in a finite episodic environment). And if a model is “trained to convergence”, then it will behaviorally “seek to maximize a reward signal”. I think that’s important to have in mind, although it might or might not be relevant in practice.

I bet people would care a lot less about “reward hacking” if RL’s reinforcement signal hadn’t ever been called “reward.”

In the context of model-based planning, there’s a concern that the AI will come upon a plan which from the AI’s perspective is a “brilliant out-of-the-box solution to a tricky problem”, but from the programmer’s perspective is “reward-hacking, or Goodharting the value function (a.k.a. exploiting an anomalous edge-case in the value function), or whatever”. Treacherous turns would probably be in this category.

There’s a terminology problem where if I just say “the AI finds an out-of-the-box solution”, it conveys the positive connotation but not the negative one, and if I just say “reward-hacking” or “Goodharting the value function” it conveys the negative part without the positive.

The positive part is important. We want our AIs to find clever out-of-the-box solutions! If AIs are not finding clever out-of-the-box solutions, people will presumably keep improving AI algorithms until they do.

Ultimately, we want to be able to make AIs that think outside of some of the boxes but definitely stay inside other boxes. But that’s tricky, because the whole idea of “think outside the box” is that nobody is ever aware of which boxes they are thinking inside of.

Anyway, this is all a bit abstract and weird, but I guess I’m arguing that I think the words “reward hacking” are generally pointing towards an very important AGI-safety-relevant phenomenon, whatever we want to call it.

I strongly disagree with the words “we train our kids”. I think kids learn via within-lifetime RL, where the reward function is installed by evolution inside the kid’s own brain. Parents and friends are characters in the kid’s training environment, but that’s very different from the way that “we train” a neural network, and very different from RLHF.

What does “Parents and friends are characters in the kid’s training environment” mean? Here’s an example. In principle, I could hire a bunch of human Go players on MTurk (for reward-shaping purposes we’ll include some MTurkers who have never played before, all the way to experts), and make a variant of AlphaZero that has no self-play at all, it’s 100% trained on play-against-humans, but is otherwise the same as the traditional AlphaZero. Then we can say “The MTurkers are part of the AlphaZero training environment”, but it would be very misleading to say “the MTurkers trained the AlphaZero model”. The MTurkers are certainly affecting the model, but the model is not imitating the MTurkers, nor is it doing what the MTurkers want, nor is it listening to the MTurkers’ advice. Instead the model is learning to exploit weaknesses in the MTurkers’ play, including via weird out-of-the-box strategies that would have never occurred to the MTurkers themselves.

When you think “parents and friends are characters in the kid’s training environment”, I claim that this MTurk-AlphaGo mental image should be in your head just as much as the mental image of LLM-like self-supervised pretraining.

For more related discussion see my posts “Thoughts on “AI is easy to control” by Pope & Belrose” sections 3 & 4, and Heritability, Behaviorism, and Within-Lifetime RL.

Load More