Steve Byrnes

I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See for a summary of my research and sorted list of writing. Physicist by training. Email: Leave me anonymous feedback here. I’m also at: RSS feed , Twitter , Mastodon , Threads , Bluesky , GitHub , Wikipedia , Physics-StackExchange , LinkedIn


Intro to Brain-Like-AGI Safety

Wiki Contributions


My complaint about “transformative AI” is that (IIUC) its original and universal definition is not about what the algorithm can do but rather how it impacts the world, which is a different topic. For example, the very same algorithm might be TAI if it costs $1/hour but not TAI if it costs $1B/hour, or TAI if it runs at a certain speed but not TAI if it runs many OOM slower, or “not TAI because it’s illegal”. Also, two people can agree about what an algorithm can do but disagree about what its consequences would be on the world, e.g. here’s a blog post claiming that if we have cheap AIs that can do literally everything that a human can do, the result would be “a pluralistic and competitive economy that’s not too different from the one we have now”, which I view as patently absurd.

Anyway, “how an AI algorithm impacts the world” is obviously an important thing to talk about, but “what an AI algorithm can do” is also an important topic, and different, and that’s what I’m asking about, and “TAI” doesn’t seem to fit it as terminology.

I’m talking about the AI’s ability to learn / figure out a new system / idea / domain on the fly. It’s hard to point to a particular “task” that specifically tests this ability (in the way that people normally use the term “task”), because for any possible task, maybe the AI happens to already know how to do it.

You could filter the training data, but doing that in practice might be kinda tricky because “the AI already knows how to do X” is distinct from “the AI has already seen examples of X in the training data”. LLMs “already know how to do” lots of things that are not superficially in the training data, just as humans “already know how to do” lots of things that are superficially unlike anything they’ve seen before—e.g. I can ask a random human to imagine a purple colander falling out of an airplane and answer simple questions about it, and they’ll do it skillfully and instantaneously. That’s the inference algorithm, not the learning algorithm.

Well, getting an AI to invent a new scientific field would work as such a task, because it’s not in the training data by definition. But that’s such a high bar as to be unhelpful in practice. Maybe tasks that we think of as more suited to RL, like low-level robot control, or skillfully playing games that aren’t like anything in the training data?

Separately, I think there are lots of domains where “just generate synthetic data” is not a thing you can do. If an AI doesn’t fully ‘understand’ the physics concept of “superradiance” based on all existing human writing, how would it generate synthetic data to get better? If an AI is making errors in its analysis of the tax code, how would it generate synthetic data to get better? (If you or anyone has a good answer to those questions, maybe you shouldn’t publish them!! :-P )

Well I’m one of the people who says that “AGI” is the scary thing that doesn’t exist yet (e.g. FAQ  or “why I want to move the goalposts on ‘AGI’”). I don’t think “AGI” is a perfect term for the scary thing that doesn’t exist yet, but my current take is that “AGI” is a less bad term compared to alternatives. (I was listing out some other options here.) In particular, I don’t think there’s any terminological option that is sufficiently widely-understood and unambiguous that I wouldn’t need to include a footnote or link explaining exactly what I mean. And if I’m going to do that anyway, doing that with “AGI” seems OK. But I’m open-minded to discussing other options if you (or anyone) have any.

Generative pre-training is AGI technology: it creates a model with mediocre competence at basically everything.

I disagree with that—as in “why I want to move the goalposts on ‘AGI’”, I think there’s an especially important category of capability that entails spending a whole lot of time working with a system / idea / domain, and getting to know it and understand it and manipulate it better and better over the course of time. Mathematicians do this with abstruse mathematical objects, but also trainee accountants do this with spreadsheets, and trainee car mechanics do this with car engines and pliers, and kids do this with toys, and gymnasts do this with their own bodies, etc. I propose that LLMs cannot do things in this category at human level, as of today—e.g. AutoGPT basically doesn’t work, last I heard. And this category of capability isn’t just a random cherrypicked task, but rather central to human capabilities, I claim. (See Section 3.1 here.)

(Disclaimer: Nothing in this comment is meant to disagree with “I just think it's not plausible that we just keep scaling up [LLM] networks, run pretraining + light RLHF, and then produce a schemer.” I’m agnostic about that, maybe leaning towards agreement, although that’s related to skepticism about the capabilities that would result.)

It is simply not true that "[RL approaches] typically involve creating a system that seeks to maximize a reward signal."

I agree that Bostrom was confused about RL. But I also think there are some vaguely-similar claims to the above that are sound, in particular:

  • RL approaches may involve inference-time planning / search / lookahead, and if they do, then that inference-time planning process can generally be described as “seeking to maximize a learned value function / reward model / whatever” (which need not be identical to the reward signal in the RL setup).
  • And if we compare Bostrom’s incorrect “seeking to maximize the actual reward signal” to the better “seeking at inference time to maximize a learned value function / reward model / whatever to the best of its current understanding”, then…
  • RL approaches historically have typically involved the programmer wanting to get a maximally high reward signal, and creating a training setup such that the resulting trained model does stuff that get as high a reward signal as possible. And this continues to be a very important lens for understanding why RL algorithms work the way they work. Like, if I were teaching an RL class, and needed to explain the formulas for TD learning or PPO or whatever, I think I would struggle to explain the formulas without saying something like “let’s pretend that you the programmer are interested in producing trained models that score maximally highly according to the reward function. How would you update the model parameters in such-and-such situation…?” Right?
  • Related to the previous bullet, I think many RL approaches have a notion of “global optimum” and “training to convergence” (e.g. given infinite time in a finite episodic environment). And if a model is “trained to convergence”, then it will behaviorally “seek to maximize a reward signal”. I think that’s important to have in mind, although it might or might not be relevant in practice.

I bet people would care a lot less about “reward hacking” if RL’s reinforcement signal hadn’t ever been called “reward.”

In the context of model-based planning, there’s a concern that the AI will come upon a plan which from the AI’s perspective is a “brilliant out-of-the-box solution to a tricky problem”, but from the programmer’s perspective is “reward-hacking, or Goodharting the value function (a.k.a. exploiting an anomalous edge-case in the value function), or whatever”. Treacherous turns would probably be in this category.

There’s a terminology problem where if I just say “the AI finds an out-of-the-box solution”, it conveys the positive connotation but not the negative one, and if I just say “reward-hacking” or “Goodharting the value function” it conveys the negative part without the positive.

The positive part is important. We want our AIs to find clever out-of-the-box solutions! If AIs are not finding clever out-of-the-box solutions, people will presumably keep improving AI algorithms until they do.

Ultimately, we want to be able to make AIs that think outside of some of the boxes but definitely stay inside other boxes. But that’s tricky, because the whole idea of “think outside the box” is that nobody is ever aware of which boxes they are thinking inside of.

Anyway, this is all a bit abstract and weird, but I guess I’m arguing that I think the words “reward hacking” are generally pointing towards an very important AGI-safety-relevant phenomenon, whatever we want to call it.

I strongly disagree with the words “we train our kids”. I think kids learn via within-lifetime RL, where the reward function is installed by evolution inside the kid’s own brain. Parents and friends are characters in the kid’s training environment, but that’s very different from the way that “we train” a neural network, and very different from RLHF.

What does “Parents and friends are characters in the kid’s training environment” mean? Here’s an example. In principle, I could hire a bunch of human Go players on MTurk (for reward-shaping purposes we’ll include some MTurkers who have never played before, all the way to experts), and make a variant of AlphaZero that has no self-play at all, it’s 100% trained on play-against-humans, but is otherwise the same as the traditional AlphaZero. Then we can say “The MTurkers are part of the AlphaZero training environment”, but it would be very misleading to say “the MTurkers trained the AlphaZero model”. The MTurkers are certainly affecting the model, but the model is not imitating the MTurkers, nor is it doing what the MTurkers want, nor is it listening to the MTurkers’ advice. Instead the model is learning to exploit weaknesses in the MTurkers’ play, including via weird out-of-the-box strategies that would have never occurred to the MTurkers themselves.

When you think “parents and friends are characters in the kid’s training environment”, I claim that this MTurk-AlphaGo mental image should be in your head just as much as the mental image of LLM-like self-supervised pretraining.

For more related discussion see my posts “Thoughts on “AI is easy to control” by Pope & Belrose” sections 3 & 4, and Heritability, Behaviorism, and Within-Lifetime RL.

Thanks for the comment!

Right, so my concern is that humans evidently don’t take societal resilience seriously, e.g. gain-of-function research remains legal in every country on earth (as far as I know) even after COVID. So you can either:

  • (1) try to change that fact through conventional means (e.g. be an activist for societal resilience, either directly or via advocating for prediction markets and numeracy or something, I dunno), per Section 3.3 — I’m very strongly in favor of people working on this but don’t hold out much hope for anything more than a marginal improvement;
  • (2) hope that “AI helpers” will convince people to take societal resilience seriously — I’m pessimistic per the Section 3.2 argument that people won’t use AI helpers that tell them things they don’t want to hear, in situations where there are no immediate consequences, and I think sacrificing immediate gains for uncertain future societal resilience is one such area;
  • (3) make AIs that take societal resilience seriously and act on it, not because any human told them to but rather because their hearts are in the right place and they figured this out on their own — this is adjacent to Section 3.5.2 where we make friendly autonomous AGI, and I’m probably most optimistic / least pessimistic about that path right now;
  • (4) suggest that actually this whole thing is not that important, i.e., it would be nice if humans were better at societal resilience, but evidently we’ve been muddling along so far and maybe we’ll continue to do so — I’m pessimistic for various reasons in the post but I hope I’m wrong!

I guess you’re suggesting (3) or (4) or maybe some combination of both, I’m not sure. You can correct me if I’m wrong.

Separately, in response to your “Mr. Smiles” thing, I think all realistic options on the table can be made to sound extremely weird and dystopian. I agree with you that “AI(s) that can prevent powerful out-of-control AI from coming into existence in the first place” seems pretty dystopian, but I’m also concerned that “AI(s) that does allow out-of-control AIs to come into existence, but prevents them from doing much harm by intervening elsewhere in the world” seems pretty dystopian too, once you think it through. And so does every other option. Or at least, that’s my concern.

in this post of my moral anti-realism sequence

I read that sequence a couple months ago (in preparation for writing §2.7 here), and found it helpful, thanks.

To give some quotes from that…

I agree that we’re probably on basically the same page.

So, it seems like we don't want "perfect inner alignment,"

FYI Alex also has this post making a similar point.

Idk, the whole thing seems to me like brewing a potion in Harry Potter

I think I agree, in that I’m somewhat pessimistic about plans wherein we want the “adult AI” to have object-level goal X, and so we find a reward function and training environment where that winds up happening.

Not that such a plan would definitely fail (e.g. lots of human adults are trying to take care of their children), just that it doesn’t seem like the kind of approach that passes the higher bar of having a strong reason to expect success (e.g. lots of human adults are not trying to take care of their children). (See here for someone trying to flesh out this kind of approach.)

So anyway, my take right now is basically:

  • If we want the “adult AGI” to be trying to do a particular thing (‘make nanobots’, or ‘be helpful towards its supervisor’, or whatever), we should replace (or at least supplement) a well-chosen reward function with a more interpretability-based approach; for example, see Plan for mediocre alignment of brain-like [model-based RL] AGI (which is a simplified version of Post 14 of this series)
  • Or we can have a similar relation to AGIs that we have to the next generation of humans: We don’t know exactly at the object level what they will be trying to do and why, but they basically have “good hearts” and so we trust their judgment.

These two bullet points correspond to the “two paths forward” of Post 12 of this series.

I think CSC can gradually morph itself into CEV and that's how we solve AI Goalcraft.

That sounds lovely if it’s true, but I think it’s a much more ambitious vision of CSC than people usually have in mind. In particular, CSC (as I understand it) usually takes people’s preferences as a given, so if somebody wants something they wouldn’t want upon reflection, and maybe they’re opposed to doing that reflection because their preferences were always more about signaling etc., well then that’s not really in the traditional domain of CSC, but CEV says we ought to sort that out (and I think I agree). More discussion in the last two paragraphs of this comment of mine.

I’m still chewing on this and our other discussion thread, but just wanted to quickly clarify that when I wrote “Thanks for the pushback!” above, what I was actually thinking was “Yeah I guess maybe the original thing I wrote wasn’t exactly right! Hmm, let me think about this…”, as opposed to “I stand by the exact thing I wrote in that top comment”.

Sorry that I didn’t say so explicitly; I see how that’s confusing. I just added it in.

That’s a very helpful comment, thanks!

Yeah, Vision 1 versus Vision 2 are two caricatures, and as such, they differ along a bunch of axes at once. And I think you're emphasizing on different axes than the ones that seem most salient to me. (Which is fine!)

In particular, maybe I should have focused more on the part where I wrote: “In that case, an important conceptual distinction (as compared to Vision 1) is related to AI goals: In Vision 1, there’s a pretty straightforward answer of what the AI is supposed to be trying to do… By contrast, in Vision 2, it’s head-scratching to even say what the AI is supposed to be doing…”

Along this axis-of-variation:

  • “An AI that can invent a better solar cell, via doing the same sorts of typical human R&D stuff that a human solar cell research team would do” is pretty close to the Vision 1 end of the spectrum, despite the fact that (in a different sense) this AI has massive amounts of “autonomy”: all on its own, the AI may rent a lab space, apply for permits, order parts, run experiments using robots, etc.
  • The scenario “A bunch of religious fundamentalists build an AI, and the AI notices the error in its programmers’ beliefs, and successfully de-converts them” would be much more towards the Vision 2 end of the spectrum—despite the fact that this AI is not very “autonomous” in the going-out-and-doing-things sense. All the AI is doing is thinking, and chatting with its creators. It doesn’t have direct physical control of its off-switch, etc.

Why am I emphasizing this axis in particular?

For one thing, I think this axis has practical importance for current research; on the narrow value learning vs ambitious value learning dichotomy, “narrow” is enough to execute Vision 1, but you need “ambitious” for Vision 2.

For example, if we move from “training by human approval” to “training by human approval after the human has had extensive time to reflect, with weak-AI brainstorming help”, then that’s a step from Vision 1 towards Vision 2 (i.e. a step from narrow value learning towards ambitious value learning). But my guess is that it’s a pretty small step towards Vision 2. I don’t think it gets us all the way to the AI I mentioned above, the one that will proactively deconvert a religious fundamentalist supervisor who currently has no interest whatsoever in questioning his faith.

For another thing, I think this axis is important for strategy and scenario-planning. For example, if we do Vision 2 really well, it changes the story in regards to “solution to global wisdom and coordination” mentioned in Section 3.2 of my “what does it take” post.

In other words, I think there are a lot of people (maybe including me) who are wrong about important things, and also not very scout-mindset about those things, such that “AI helpers” wouldn’t particularly help, because the person is not asking the AI for its opinion, and would ignore the opinion anyway, or even delete that AI in favor of a more sycophantic one. This is a societal problem, and always has been. One possible view of that problem is: “well, that’s fine, we’ve always muddled through”. But if you think there are upcoming VWH-type stuff where we won’t muddle through (as I tentatively do in regards to ruthlessly-power-seeking AGI), then maybe the only option is a (possibly aggressive) shift in the balance of power towards a scout-mindset-y subpopulation (or at least, a group with more correct beliefs about the relevant topics). That subpopulation could be composed of either humans (cf. “pivotal act”), or of Vision 2 AIs.

Here’s another way to say it, maybe. I think you’re maybe imagining a dichotomy where either AI is doing what we want it to do (which is normal human stuff like scientific R&D), or the AI is plotting to take over. I’m suggesting that there’s a third murky domain where the person wants something that he maybe wouldn’t want upon reflection, but where “upon reflection” is kinda indeterminate because he could be manipulated into wanting different things depending on how they’re framed. This third domain is important because it contains decisions about politics and society and institutions and ethics and so on. I have concerns that getting an AI to “perform well” in this murky domain is not feasible via a bootstrap thing that starts from the approval of random people; rather, I think a good solution would have to look more like an AI which is internally able to do the kinds of reflection and thinking that humans do (but where the AI has the benefit of more knowledge, insight, time, etc.). And that requires that the AI have a certain kind of “autonomy” to reflect on the big picture of what it’s doing and why. I think that kind of “autonomy” is different than how you’re using the term, but if done well (a big “if”!), it would open up a lot of options.

Load More