Steve Byrnes

I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See for a summary of my research and sorted list of writing. Email: Also on Twitter, Mastodon, Threads. Physicist by training.


Intro to Brain-Like-AGI Safety

Wiki Contributions


I agree that it would be nice to get to a place where it is known (technically) how to make a kind AGI, but nobody knows how to make an unkind AGI. That's what you're saying, right? If so, yes that would be nice, but I see it as extraordinarily unlikely. I’m optimistic that there are technical ways to make a kind AGI, but I’m unaware of any remotely plausible approach to doing so that would not be straightforwardly modifiable to turn it into a way to make an unkind AGI.

I’m not sure what you think my expectations are. I wrote “I am not crazy to hope for whole primate-brain connectomes in the 2020s and whole human-brain connectomes in the 2030s, if all goes well.“ That’s not the same as saying “I expect those things”; it’s more like “those things are not completely impossible”. I’m not an expert but my current understanding is (1) you’re right that existing tech doesn’t scale well enough (absent insane investment of resources), (2) it’s not impossible that near-future tech could scale much better than current tech. I’m particularly thinking of the neuron-barcoding technique that E11 is trying to develop, which would (if I understand correctly) make registration of neurons between different slices easy and automatic and essentially perfect. Again, I’m not an expert, and you can correct me. I appreciate your comment.

I find it interesting how he says that there is no such thing as AGI, but acknowledges that machines will "eventually surpass human intelligence in all domains where humans are intelligent" as that would meet most people's definition of AGI.

The somewhat-reasonable-position-adjacent-to-what-Yann-believes would be: “I don’t like the term ‘AGI’. It gives the wrong idea. We should use a different term instead. I like ‘human-level AI’.”

I.e., it’s a purely terminological complaint. And it’s not a crazy one! Lots of reasonable people think that “AGI” was a poorly-chosen term, although I still think it’s possibly the least-bad option.

Yann’s actual rhetorical approach tends to be:

  • Step 1: (re)-define the term “AGI” in his own idiosyncratic and completely insane way;
  • Step 2: say there’s no such thing as “AGI” (as so defined), and that anyone who talks about AGI is a moron.

I talk about it in much more detail here.

I think that if you read the later Intro to Brain-Like AGI Safety series, then the only reason you might want to read this post (other than historical interest) is that the section “Dopamine category #2: RPE for “local” sub-circuit rewards” is talking about a topic that was omitted from Intro to Brain-Like AGI Safety (for brevity).

For example, practically everything I said about neuroanatomy in this post is at least partly wrong and sometimes very wrong. (E.g. the “toy loop model” diagrams are pretty bad.) The “Finally, the “prediction” part of reward prediction error” section has a very strange proposal for how RPE works; I don’t even remember why I ever believed that.

The main strengths of the post are the “normative” discussions: why might supervised learning be useful? why might more than one reward signal be useful? etc. I mostly stand by those. I also stand by “learning from scratch” being a very useful concept, and elaborated on it much more later.

Follow-up: More specific example of classic goal misgeneralization in the context of process-based supervision (more details here):

Suppose the AI is currently thinking about what to propose for step 84 of an ongoing plan to found a company. Ideally, according to the philosophy of process-based supervision, the only thing the AI “cares about” right now is getting a good score in the event that this particular proposal gets immediately audited. But the AI can't read the supervisor's mind, so figuring out how to get a good score in that hypothetical audit can be kinda tricky. The AI will presumably take advantage of whatever imperfect heuristics / proxies it can find, when they help it get a higher score.

And here's one such proxy: “If the proposed step is actually carried out, will it help the project in the long-term?” This is not a perfect proxy for success-if-audited! Two reasons it's not perfect are: (1) the supervisor could be mistaken about whether it will help the project or not; (2) the supervisor presumably cares about other things unrelated to project success, like following social norms. But “it's not a perfect proxy” is different from “it is totally useless as a proxy”. It is in fact useful, and indeed probably irreplaceably useful, in the sense that ignoring that information would probably make the AI’s prediction of audit results strictly worse.

So we should expect the AI to not only try to guess the long-term real-world consequences of its proposals (if carried out), but also to “care about” those consequences, in the sense of allowing those guesses to shape what proposals it offers in the first place.

OK, so now we have an AI that (among other things) “cares about” stuff that will happen in the future, well after the current step. And that “caring” is actually helping the AI to get better scores. So we’re (unintentionally) training that caring into the AI, not training it out, even in the absence of any AI obfuscation.

It's still a misaligned motivation though. The idea of process-based supervision is that we don't want our AI to care about outcomes in the distant future at all. We want it to be totally indifferent. Otherwise it might try to escape onto the internet, etc.

This post doesn’t talk about classic goal misgeneralization, e.g. the CoinRun paper (see also Rob Miles explainer). If X = “get the coin” is what we want the AI to desire, and Y = “get to the right side” is something else, but Y and X are correlated in the training distribution, then we can wind up with an AI that’s trying to do Y rather than X, i.e. misalignment. (Or maybe it desires some mix of X & Y. But that counts as misalignment too.)

That problem can arise without the AI having any situational awareness, or doing any “obfuscation”. The AI doesn’t have to obfuscate! The AI wants Y, and it acts accordingly, and that leads to good performance and high reward. But it’s still misaligned, because presumably X & Y come apart out-of-distribution.

So anyway, it’s perfectly great for you to write about how a situationally-aware AI might deliberately obfuscate its motivations. I just want us to keep in mind that deliberate obfuscation is not the only reason that it might be hard to notice that an AI is misaligned.

(It’s still probably true that the misalignment isn’t dangerous unless the AI is situationally-aware—for example, executing a treacherous turn presumably requires situational awareness. And also, noticing misalignment via adversarial testing is presumably much easier if the AI is not situationally aware.)

(If the AI is situationally aware, an initial CoinRun-style goal misgeneralization could lead to deception, via means-end reasoning.)

Seems false in RL, for basically the reason you said (“it’s not clear how to update a model towards performing the task if it intentionally tries to avoid showing us any task-performing behavior”). In other words, if we’re doing on-policy learning, and if the policy never gets anywhere close to a reward>0 zone, then the reward>0 zone isn’t doing anything to shape the policy. (In a human analogy, I can easily avoid getting addicted to nicotine by not exposing myself to nicotine in the first place.)

I think this might be a place where people-thinking-about-gradient-descent have justifiably different intuitions from people-thinking-about-RL.

(The RL problem might be avoidable if we know how to do the task and can turn that knowledge into effective reward-shaping. Also, for a situationally-aware RL model with a wireheading-adjacent desire to get reward per se, we can get it to do arbitrary things by simply telling it what the reward function is.)

investing in government money market funds earning ~5% rather than 0% interest checking accounts

It’s easier than that—there are high-interest-rate free FDIC-eligible checking accounts. has a good list, although you might need to be a member to view it. As of this moment (2023-07-20), the top of their leaderboard is: Customers Bank (5.20% APY), BankProv (5.15%), BrioDirect (5.06%), UFB Direct (5.06%).

I was just trying to replace “reward” by “reinforcement”, but hit the problem that “negative reward” makes sense, but behaviorist terminology is such that “reinforcement” is always after a good thing happens, including “negative reinforcement”, which would be a kind of positive reward that entails removing something aversive. The behaviorists use the word “punishment” for “negative reward”. But “punishment” has all the same downsides as “reward”, so I assume you’re also opposed to that. Unfortunately, if I avoid both “punishment” and “reward”, then it seems I have no way to unambiguously express the concept “negative reward”.

So “negative reward” it is. ¯\_(ツ)_/¯

Nice interview, kudos to you both!

One is a bunch of very simple hardwired genomically-specified reward circuits over stuff like your sensory experiences or simple correlates of good sensory experiences. 

I just want to flag that the word “simple” is contentious in this context. The above excerpt isn’t a specific claim (how simple is “simple”, and how big is “a bunch”?) so I guess I neither agree nor disagree with it as such. But anyway, my current guess (see here) is that reward circuits might effectively comprise tens of thousands of lines of pseudocode. That’s “simple” compared to a billion-parameter ML trained model, but it’s super complicated compared to any reward function that you could find in an RL paper on arxiv.

There seems to be a spectrum of opinion about how complicated the reward circuitry is, with Jacob Cannell at one extreme and Geoffrey Miller at the opposite extreme and me somewhere in the middle. See Geoffrey Miller’s post here, and my comment on it, and also a back-and-forth between me & Jacob Cannell here.

And so in the human brain you have, what’s basically self-supervised prediction of incoming sensory signals, like predictive processing, that sort of thing, in terms of learning to predictively model what’s going to happen in your local sensory environment. And then in deep learning, we have all the self-supervised learning of pre-training in language models that’s also learning to predict a sensory environment. Of course, the sensory environment in question is text, at least at the moment. But we’ve seen how that same approach easily extends to multimodal text plus image or text audio or whatever other systems.

I was a bit put off by the vibe that LLMs and humans both have self-supervised learning, and both have RL, so really they’re not that different. On the contrary, I think there are numerous alignment-relevant (and also capabilities-relevant) disanalogies between the kind of model-based RL that I believe the brain uses, versus LLM+RHLF.

Probably the most alignment-relevant of these disanalogies is that in LLMs (but not humans), the main self-supervised learning system is also simultaneously an output system.

Specifically, after self-supervised pretraining, an LLM outputs exactly the thing that it expects to see. (After RLHF, that is no longer strictly true, but RLHF is just a fine-tuning step, most of the behavioral inclinations are coming from pretraining IMO.) That just doesn’t make sense in a human. When I take actions, I am sending motor commands to my own arms and my own mouth etc. Whereas when I observe another human and do self-supervised learning, my brain is internally computing predictions of upcoming sounds and images etc. These are different, and there isn’t any straightforward way to translate between them. (Cf. here where Owain Evans & Jacob Steinhardt show a picture of a movie frame and ask “what actions are being performed?”) Now, as it happens, humans do often imitate other humans. But other times they don’t. Anyway, insofar as humans-imitating-other-humans happens, it has to happen via a very different and much less direct algorithmic mechanism than how it happens in LLMs. Specifically, humans imitate other humans because they want to, i.e. because of the history of past reinforcement, directly or indirectly. Whereas a pretrained LLM will imitate human text with no RL or “wanting to imitate” at all, that’s just mechanically what it does.

I’m not trying to make any larger point about alignment being easy or hard, I just think it’s important to keep that particular difference clear in our heads. Well, OK, actually it is alignment-relevant—I think it weakly suggests that aligning brain-like model-based RL might be harder than aligning LLMs. (As I wrote here, “for my part, if I believed that [LLMs] were sufficient for TAI—which I don’t—then I think I would feel slightly less concerned about AI x-risk than I actually do, all things considered!”) Speaking of which:

So for example, Steven Byrnes is an excellent researcher who thinks primarily about the human brain first and foremost and how to build brain-like AGI as his alignment approach.

(That’s very kind of you to say!) I think of brain-like AGI (and relatedly model-based RL AGI) as a “threat model” much more than an “alignment approach”—in the sense that future researchers might build brain-like AGI, and we need to plan for that possibility. My main professional interest is in finding alignment approaches that would be helpful for this threat model.

Separately, it’s possible that those alignment approaches might themselves be brain-like, i.e. whatever mechanisms lead to humans being (sometimes) nice to each other could presumably be a source of inspiration. That’s my main current project. But that’s not how I personally have been using the term “brain-like AGI”.

There’s no simple function of physical matter configurations that if tiled across the entire universe would fully satisfy the values. … It’s like, we tend to value lots of different stuff and we sort of asymptotically run out of caring about things when there’s lots of them, or decreasing marginal value for any particular fixed pattern.

(Sorry in advance if I’m missing your point.)

It’s worth noting that at least some humans seem pretty gung-ho about the project of tiling the universe with flourishing life and civilization (hi Eliezer!). Maybe they have other desires too, and maybe those other desires can be easily saturated, but that doesn’t seem safety-relevant. If superintelligent-Eliezer likes tiling the universe with flourishing life & civilization in the morning and playing cricket after lunch, then we still wind up with a tiled universe.

Or maybe you’re saying that if a misaligned AI wanted to tile the galaxy with something, it would be something more complicated and diverse than paperclips / tiny molecular squiggles? OK maybe, but if it isn’t sentient then I’m still unhappy about that.

(cc @Quintin Pope )

Load More