Steve Byrnes

I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See for a summary of my research and sorted list of writing. Physicist by training. Email: Leave me anonymous feedback here. I’m also at: RSS feed , Twitter , Mastodon , Threads , Bluesky , GitHub , Wikipedia , Physics-StackExchange , LinkedIn


Intro to Brain-Like-AGI Safety

Wiki Contributions


Thanks for the comment!

Right, so my concern is that humans evidently don’t take societal resilience seriously, e.g. gain-of-function research remains legal in every country on earth (as far as I know) even after COVID. So you can either:

  • (1) try to change that fact through conventional means (e.g. be an activist for societal resilience, either directly or via advocating for prediction markets and numeracy or something, I dunno), per Section 3.3 — I’m very strongly in favor of people working on this but don’t hold out much hope for anything more than a marginal improvement;
  • (2) hope that “AI helpers” will convince people to take societal resilience seriously — I’m pessimistic per the Section 3.2 argument that people won’t use AI helpers that tell them things they don’t want to hear, in situations where there are no immediate consequences, and I think sacrificing immediate gains for uncertain future societal resilience is one such area;
  • (3) make AIs that take societal resilience seriously and act on it, not because any human told them to but rather because their hearts are in the right place and they figured this out on their own — this is adjacent to Section 3.5.2 where we make friendly autonomous AGI, and I’m probably most optimistic / least pessimistic about that path right now;
  • (4) suggest that actually this whole thing is not that important, i.e., it would be nice if humans were better at societal resilience, but evidently we’ve been muddling along so far and maybe we’ll continue to do so — I’m pessimistic for various reasons in the post but I hope I’m wrong!

I guess you’re suggesting (3) or (4) or maybe some combination of both, I’m not sure. You can correct me if I’m wrong.

Separately, in response to your “Mr. Smiles” thing, I think all realistic options on the table can be made to sound extremely weird and dystopian. I agree with you that “AI(s) that can prevent powerful out-of-control AI from coming into existence in the first place” seems pretty dystopian, but I’m also concerned that “AI(s) that does allow out-of-control AIs to come into existence, but prevents them from doing much harm by intervening elsewhere in the world” seems pretty dystopian too, once you think it through. And so does every other option. Or at least, that’s my concern.

in this post of my moral anti-realism sequence

I read that sequence a couple months ago (in preparation for writing §2.7 here), and found it helpful, thanks.

To give some quotes from that…

I agree that we’re probably on basically the same page.

So, it seems like we don't want "perfect inner alignment,"

FYI Alex also has this post making a similar point.

Idk, the whole thing seems to me like brewing a potion in Harry Potter

I think I agree, in that I’m somewhat pessimistic about plans wherein we want the “adult AI” to have object-level goal X, and so we find a reward function and training environment where that winds up happening.

Not that such a plan would definitely fail (e.g. lots of human adults are trying to take care of their children), just that it doesn’t seem like the kind of approach that passes the higher bar of having a strong reason to expect success (e.g. lots of human adults are not trying to take care of their children). (See here for someone trying to flesh out this kind of approach.)

So anyway, my take right now is basically:

  • If we want the “adult AGI” to be trying to do a particular thing (‘make nanobots’, or ‘be helpful towards its supervisor’, or whatever), we should replace (or at least supplement) a well-chosen reward function with a more interpretability-based approach; for example, see Plan for mediocre alignment of brain-like [model-based RL] AGI (which is a simplified version of Post 14 of this series)
  • Or we can have a similar relation to AGIs that we have to the next generation of humans: We don’t know exactly at the object level what they will be trying to do and why, but they basically have “good hearts” and so we trust their judgment.

These two bullet points correspond to the “two paths forward” of Post 12 of this series.

I think CSC can gradually morph itself into CEV and that's how we solve AI Goalcraft.

That sounds lovely if it’s true, but I think it’s a much more ambitious vision of CSC than people usually have in mind. In particular, CSC (as I understand it) usually takes people’s preferences as a given, so if somebody wants something they wouldn’t want upon reflection, and maybe they’re opposed to doing that reflection because their preferences were always more about signaling etc., well then that’s not really in the traditional domain of CSC, but CEV says we ought to sort that out (and I think I agree). More discussion in the last two paragraphs of this comment of mine.

I’m still chewing on this and our other discussion thread, but just wanted to quickly clarify that when I wrote “Thanks for the pushback!” above, what I was actually thinking was “Yeah I guess maybe the original thing I wrote wasn’t exactly right! Hmm, let me think about this…”, as opposed to “I stand by the exact thing I wrote in that top comment”.

Sorry that I didn’t say so explicitly; I see how that’s confusing. I just added it in.

That’s a very helpful comment, thanks!

Yeah, Vision 1 versus Vision 2 are two caricatures, and as such, they differ along a bunch of axes at once. And I think you're emphasizing on different axes than the ones that seem most salient to me. (Which is fine!)

In particular, maybe I should have focused more on the part where I wrote: “In that case, an important conceptual distinction (as compared to Vision 1) is related to AI goals: In Vision 1, there’s a pretty straightforward answer of what the AI is supposed to be trying to do… By contrast, in Vision 2, it’s head-scratching to even say what the AI is supposed to be doing…”

Along this axis-of-variation:

  • “An AI that can invent a better solar cell, via doing the same sorts of typical human R&D stuff that a human solar cell research team would do” is pretty close to the Vision 1 end of the spectrum, despite the fact that (in a different sense) this AI has massive amounts of “autonomy”: all on its own, the AI may rent a lab space, apply for permits, order parts, run experiments using robots, etc.
  • The scenario “A bunch of religious fundamentalists build an AI, and the AI notices the error in its programmers’ beliefs, and successfully de-converts them” would be much more towards the Vision 2 end of the spectrum—despite the fact that this AI is not very “autonomous” in the going-out-and-doing-things sense. All the AI is doing is thinking, and chatting with its creators. It doesn’t have direct physical control of its off-switch, etc.

Why am I emphasizing this axis in particular?

For one thing, I think this axis has practical importance for current research; on the narrow value learning vs ambitious value learning dichotomy, “narrow” is enough to execute Vision 1, but you need “ambitious” for Vision 2.

For example, if we move from “training by human approval” to “training by human approval after the human has had extensive time to reflect, with weak-AI brainstorming help”, then that’s a step from Vision 1 towards Vision 2 (i.e. a step from narrow value learning towards ambitious value learning). But my guess is that it’s a pretty small step towards Vision 2. I don’t think it gets us all the way to the AI I mentioned above, the one that will proactively deconvert a religious fundamentalist supervisor who currently has no interest whatsoever in questioning his faith.

For another thing, I think this axis is important for strategy and scenario-planning. For example, if we do Vision 2 really well, it changes the story in regards to “solution to global wisdom and coordination” mentioned in Section 3.2 of my “what does it take” post.

In other words, I think there are a lot of people (maybe including me) who are wrong about important things, and also not very scout-mindset about those things, such that “AI helpers” wouldn’t particularly help, because the person is not asking the AI for its opinion, and would ignore the opinion anyway, or even delete that AI in favor of a more sycophantic one. This is a societal problem, and always has been. One possible view of that problem is: “well, that’s fine, we’ve always muddled through”. But if you think there are upcoming VWH-type stuff where we won’t muddle through (as I tentatively do in regards to ruthlessly-power-seeking AGI), then maybe the only option is a (possibly aggressive) shift in the balance of power towards a scout-mindset-y subpopulation (or at least, a group with more correct beliefs about the relevant topics). That subpopulation could be composed of either humans (cf. “pivotal act”), or of Vision 2 AIs.

Here’s another way to say it, maybe. I think you’re maybe imagining a dichotomy where either AI is doing what we want it to do (which is normal human stuff like scientific R&D), or the AI is plotting to take over. I’m suggesting that there’s a third murky domain where the person wants something that he maybe wouldn’t want upon reflection, but where “upon reflection” is kinda indeterminate because he could be manipulated into wanting different things depending on how they’re framed. This third domain is important because it contains decisions about politics and society and institutions and ethics and so on. I have concerns that getting an AI to “perform well” in this murky domain is not feasible via a bootstrap thing that starts from the approval of random people; rather, I think a good solution would have to look more like an AI which is internally able to do the kinds of reflection and thinking that humans do (but where the AI has the benefit of more knowledge, insight, time, etc.). And that requires that the AI have a certain kind of “autonomy” to reflect on the big picture of what it’s doing and why. I think that kind of “autonomy” is different than how you’re using the term, but if done well (a big “if”!), it would open up a lot of options.

Vision 1 style models can be turned into Vision 2 autonomous models very easily

Sure, Vision 1 models can be turned into dangerous Vision 2 models, but they can’t be turned into good Vision 2 models that we want to have around, unless you solve the different set of problems associated with full-fledged Vision 2. For example, in the narrow value learning vs ambitious value learning dichotomy, “narrow” is sufficient for Vision 1 to go well, but you need “ambitious” for Vision 2 to go well. Right?

For me, Vision 3 shouldn't depend on biological neurons. I think it's more like 'brain-like AGI that is so brain-like that it is basically an accurate whole brain emulation, and thus you can trust it as much as you can trust a human (which isn't necessarily all that much)."

I think you’re more focused on “why do I trust the AI (insofar as I trust it)” (e.g. my “two paths” here), whereas in this post I’m ultimately focused on “what should I be working on (or funding, or whatever) and why”.

Thus, I think “System X does, or does not, involve actual squishy biological neurons” is not only a nice bright line, but it’s also a bright line with great practical importance for what research projects to work on, and what the eventual results will look like, and how the scenarios play out from there. I have lots of reasons for thinking that. E.g. super-ambitious moonshot BCI research is critical for “merging” but only slightly relevant for WBE; conversely measuring human brain connectomes is critical for WBE but only slightly relevant for “merging”. Another example: simbox testing is useful for WBEs but not “merging”. Also, a WBE would be an extraordinarily powerful system because it can be sped up 100-fold, duplicated, tweaked, and so on, in a way that any system involving actual squishy biological neurons basically can’t (I would argue). And that’s highly relevant to how it fits into longer-term scenarios.

Thanks. I changed the wording to “moody 7-year-old” and “office or high-tech factory” which puts me on firmer ground I think.  :)

I think there have been general increases in productivity across the economy associated with industrialization, automation, complex precise machines, and so on, and those things provide a separate reason (besides legal & social norms as you mentioned) that 7yos are far less employable today than in the 18th century. E.g. I can easily imagine a moody 7yo being net useful in a mom & pop artisanal candy shop, but it’s much harder to imagine a moody 7yo being net useful in a modern jelly bean factory.

I think your bringing up “$3/day” gives the wrong idea; I think we should focus on whether the sign is positive or negative. If the sign is positive at all, it’s probably >$3/day. The sign could be negative because they sometimes touch something they’re not supposed to touch, or mess up in other ways, or it could simply be that they bring in extra management overhead greater than their labor contribution. (We’ve all delegated projects where it would have been far less work to just do the project ourselves, right?) E.g. even if the cost to feed and maintain a horse were zero, I would still not expect to see horses being used in a modern construction project.

Anyway, I think I’m on firmer ground when talking about a post-AGI economy, in which case, literally anything that can be done by a human at all, can be automated.

Thanks for the pushback! [subsequent part of this paragraph was added later for clarity] …Yeah I guess maybe the original thing I wrote wasn’t exactly right! Hmm, let me think about this…

Here’s one possible story: (1) the good AI imagines the bad AI creating a plague, (2) the good AI doesn’t want that to happen, (3) the good AI convinces the human that plague-prevention is important.

That would be the kind of story that would be typical in the context of “a helpful human assistant”, right? (Put aside LLMs for now.) When I’m trying to be helpful to someone, I do that kind of thing all the time. E.g. (1) I imagine my spouse getting wet in the rain and being sad about that, and (2) I don't want that to happen, so (3) I try to convince her to bring an umbrella. Etc.

Hopefully everyone agrees that, in that story, the AI is not under human control, because a sufficiently competent AI can probably convince anyone of anything. (Or at least, it can convince lots of people of lots of things.)

OK, well, more precisely: hopefully everyone agrees that, in this story, there is no appreciable “human control” happening in step (3).

But maybe you'll say that there can be "human control" via step (2): After all, the AI can learn to anticipate (and be motivated by) what the human would want upon reflection, right?

And then my response kinda forks:

(A) If the story closely involves actual humans actually reflecting and making actual decisions (even if it’s generalized, as opposed to this particular case), then we're in Section 3.2 territory: “some people imagine that when future humans have smart AGI assistants trying to help them … no one will have stupid self-serving opinions … etc” As discussed in that section, I think that vision would be absolutely wonderful, and I hope that people figure out how to make that happen. But I am currently very pessimistic. To put the pessimism in your terms: I don’t think debate++, or whatever, will widely rid people of having stupid ideas in the various domains where stupid ideas have always been rampant, e.g. related to politics, in-group signaling, far-mode thinking, or really anything that lacks immediate feedback. I think I have good structural reasons for my pessimism here: specifically, if debate++ keeps convincing people that their cherished shibboleths are actually bad, then societal memetic immune systems will kick in and try to convince people that debate++ is actually bad and we should use “debate – –” instead which doesn't do that. See “Dark Side Epistemology”, or try imagining what happens next when it’s reported on the front page that the latest debate++ system has convinced some liberals that [insert thing that you absolutely cannot say in liberal circles], or convinced some academics that [insert thing you absolutely cannot say in academia], or convinced some Christians that there is no God, etc. I also have non-structural reasons for pessimism, which is that I’m skeptical that there is such a thing as a debate++ which is simultaneously powerful enough to really move the needle on human reasoning, and also safe—see here

(B) If the story is more abstracted and idealized from that, e.g. CEV or ambitious value learning, then this isn't “human control” in the normal sense. Instead we're in Section 3.5.2 territory.


Alternatively, going way back to the top, if you’re thinking of LLMs, then you probably don’t like the story I mentioned (i.e., “(1) the good AI imagines the bad AI creating a plague, (2) the good AI doesn’t want that to happen, (3) the good AI convinces the human that plague-prevention is important”.) Instead it would just be “the AI is trained to anticipate what the human would say upon reflection”, or something like that, right? I don’t expect AIs of this type to be powerful enough to constitute TAI, but they could be a relevant part of the scene on which TAI appears. But regardless, if that’s what you’re imagining, then you would ignore (B) above and just read (A).

This was one of those posts that I dearly wish somebody else besides me had written, but nobody did, so here we are. I have no particular expertise. (But then again, to some extent, maybe nobody does?)

I basically stand by everything I wrote here. I remain pessimistic for reasons spelled out in this post, but I also still have a niggling concern that I haven’t thought these things through carefully enough, and I often refer to this kind of stuff as “an area where reasonable people can disagree”.

If I were rewriting this post today, three changes I’d make would be:

  • I would make it clearer that I’m arguing against a particular vision involving Paul-corrigible AGIs. In particular, as I wrote in this comment, “If we’re talking about an AGI that’s willing and able to convince its (so-called) supervisor to do actions that the (so-called) supervisor initially doesn’t want to do, because the AGI thinks they’re in the (so-called) supervisor’s long-term best interest, then we are NOT talking about a corrigible AGI under human control, rather we’re talking about a non-corrigible, out-of-control AGI. So we better hope that it’s a friendly out-of-control AGI!!” … “this is Section 3.5.2 territory”
  • I would dive much more into the question of AGI self-sufficiency, a.k.a. when is human-omnicide (or at least, human-disempowerment) strategically useful for a power-seeking AGI? I gave this topic one sub-bullet in 3.3.3, but it’s pretty important and crux-y, and I could have said much more about the range of arguments, and where I stand. That discussion entails a fun romp through everything from Drexlerian nanotech, to growing-brains-in-vats, to what it takes to make millions or billions of teleoperated robots, to how an entirely-AGI-controlled economy might differ from our current one (e.g. would they manufacture chips using e-beam lithography instead of EUV?), to compute requirements for AGI, and on and on.
  • I would elaborate much more on the “zombie dynamic” wherein the more chips that an AGI can get under its control, the more copies of that AGI will exist, and thus the better positioned they will be to get control of even more chips—either through hacking, or through actually grabbing the chip with a teleoperated robot and getting root access, using a soldering iron if necessary! This “zombie dynamic” suggests a strong reason to expect a unipolar as opposed to multipolar outcome, and seems pretty disanalogous to human affairs. This phenomenon seems very important and under-discussed. I gave it one bullet-point in 3.3.3, but I think it merits much more thought & discussion than that. Can a multipolar outcome arise nevertheless? What would that look like? Tamper-proof boxes laced with explosives if an enemy robot tries to get physical access to the chip inside? (I think the military does things like that to protect their IP, right? What’s the offense-defense balance on that?) I dunno.


I don’t think “be nice to others in proportion to how similar to yourself they are” is part of it. For example, dogs can be nice to humans, and to goats, etc. I guess your response is ‘well dogs are a bit like humans and goats’. But are they? From the dog’s perspective? They look different, sound different, smell different, etc. I don’t think dogs really know what they are in the first place, at least not in that sense. Granted, we’re talking about humans not dogs. But humans can likewise feel compassion towards animals, especially cute ones (cf. “charismatic megafauna”). Do humans like elephants because elephants are kinda like humans? I mean, I guess elephants are more like humans than microbes are. But they’re still pretty different. I don’t think similarity per se is why humans care about elephants. I think it’s something about the elephants’ cute faces, and the cute way that they move around.

More specifically, my current vague guess is that the brainstem applies some innate heuristics to sensory inputs to guess things like “that thing there is probably a person”. This includes things like heuristics for eye-contact-detection and face-detection and maybe separately cute-face-detection etc. The brainstem also has heuristics that detect the way that spiders scuttle and snakes slither (for innate phobias). I think these heuristics are pretty simple; for example, the human brainstem face detector (in the superior colliculus) has been studied a bit, and the conclusion seems to be that it mostly just detects the presence of three dark ,blobs of about the right size, in an inverted triangle. (The superior colliculus is pretty low resolution.)

If we’re coding the AGI, we can design those sensory heuristics to trigger on whatever we want. Presumably we would just use a normal ConvNet image classifier for this. If we want the AGI to find cockroaches adorably “cute”, and kittens gross, I think that would be really straightforward to code up.

So I’m not currently worried about that exact thing. I do have a few kinda-related concerns though. For example, maybe adult social emotions can only develop after lots and lots of real-time conversations with real-world humans, and that’s a slow and expensive kind of training data for an AGI. Or maybe the development of adult social emotions is kinda a package deal, such that you can’t delete “the bad ones” (e.g. envy) from an AGI without messing everything else up.

(Part of the challenge is that false-positives, e.g. where the AGI feels compassion towards microbes or teddy bears or whatever, are a very big problem, just as false-negatives are.)

Load More