Steve Byrnes

I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See for a summary of my research and sorted list of writing. Email: Also on Twitter, Mastodon, Threads. Physicist by training.


Intro to Brain-Like-AGI Safety

Wiki Contributions


There's something to that, but this sounds too strong to me. If someone had hypothetically spent a year observing all of my behavior, having some sort of direct read access to what was happening in my mind, and also doing controlled experiments where they reset my memory and tested what happened with some different stimulus... it's not like all of their models would become meaningless the moment I read the morning newspaper. If I had read morning newspapers before, they would probably have a pretty good model of what the likely range of updates for me would be.

I dunno, I wrote “invalid (or at least, open to question)”. I don’t think that’s too strong. Like, just because it’s “open to question”, doesn’t mean that, upon questioning it, we won’t decide it’s fine. I.e., it’s not that the conclusion is necessarily wrong, it’s that the original argument for it is flawed.

Of course I agree that the morning paper thing would probably be fine for humans, unless the paper somehow triggered an existential crisis, or I try a highly-addictive substance while reading it, etc.  :)

Some relevant context is: I don’t think it’s realistic to assume that, in the future, AI models will be only slightly fine-tuned in a deployment-specific way. I think the relevant comparison is more like “can your values change over the course of years”, not “can your values change after reading the morning paper?”

Why do I think that? Well, let’s imagine a world where you could instantly clone an adult human. One might naively think that there would be no more on-the-job learning ever. Instead, (one might think), if you want a person to help with chemical manufacture, you open the catalog to find a human who already knows chemical manufacturing, and order a clone of them; and if you want a person to design widgets, you go to a different catalog page, and order a clone of a human widget design expert; so on.

But I think that’s wrong.

I claim there would be lots of demand to clone a generalist—a person who is generally smart and conscientious and can get things done, but not specifically an expert in metallurgy or whatever the domain is. And then, this generalist would be tasked with figuring out whatever domains and skills they didn’t already have.

Why do I think that? Because there’s just too many possible specialties, and especially combinations of specialties, for a pre-screened clone-able human to already exist in each of them. Like, think about startup founders. They’re learning how to do dozens of things. Why don’t they outsource their office supply questions to an office supply expert, and their hiring questions to a hiring expert, etc.? Well they do to some extent, but there are coordination costs, and more importantly the experts would lack all the context necessary to understand what the ultimate goals are. What are the chances that there’s a pre-screened clone-able human that knows about the specific combination of things that a particular application needs (rural Florida zoning laws AND anti-lock brakes AND hurricane preparedness AND …)

So instead I expect that future AIs will eventually do massive amounts of figuring-things-out in a nearly infinite variety of domains, and moreover that the figuring out will never end. (Just as the startup founder never stops needing to learn new things, in order to succeed.) So I don’t like plans where the AI is tested in a standardized way, and then it’s assumed that it won’t change much in whatever one of infinitely many real-world deployment niches it winds up in.

While it's obviously true that there is a lot of stuff operating in brains besides LLM-like prediction, such as mechanisms that promote specific predictive models over other ones, that seems to me to only establish that "the human brain is not just LLM-like prediction", while you seem to be saying that "the human brain does not do LLM-like prediction at all". (Of course, "LLM-like prediction" is a vague concept and maybe we're just using it differently and ultimately agree.)

I disagree with whether that distinction matters:

I think technical discussions of AI safety depend on the AI-algorithm-as-a-whole; I think “does the algorithm have such-and-such component” is not that helpful a question.

So for example, here’s a nightmare-scenario that I think about often:

  • (step 1) Someone reads a bunch of discussions about LLM x-risk
  • (step 2) They come down on the side of “LLM x-risk is low”, and therefore (they think) it would be great if TAI is an LLM as opposed to some other type of AI
  • (step 3) So then they think to themselves: Gee, how do we make LLMs more powerful? Aha, they find a clever way to build an AI that combines LLMs with open-ended real-world online reinforcement learning or whatever.

Even if (step 2) is OK (which I don’t want to argue about here), I am very opposed to (step 3), particularly the omission of the essential part where they should have said “Hey wait a minute, I had reasons for thinking that LLM x-risk is low, but do those reasons apply to this AI, which is not an LLM of the sort that I'm used to, but rather it’s a combination of LLM + open-ended real-world online reinforcement learning or whatever?” I want that person to step back and take a fresh look at every aspect of their preexisting beliefs about AI safety / control / alignment from the ground up, as soon as any aspect of the AI architecture and training approach changes, even if there’s still an LLM involved.  :)

(Sorry in advance if this whole comment is stupid, I only read a bit of the report.)

As context, I think the kind of technical plan where we reward the AI for (apparently) being helpful is at least not totally doomed to fail. Maybe I’m putting words in people’s mouths, but I think even some pretty intense doomers would agree with the weak statement “such a plan might turn out OK for all we know” (e.g. Abram Demski talking about a similar situation here, Nate Soares describing a similar-ish thing as “maybe one nine” a.k.a. a mere 90% chance that it would fail). Of course, I would rather have a technical plan for which there’s a strong reason to believe it would actually work. :-P

Anyway, if that plan had a catastrophic safety failure (assuming proper implementation etc., and also assuming a situationally-aware AI), I think I would bet on a goal misgeneralization failure mode over a “schemer” failure mode. Specifically, such an AI could plausibly (IMO) wind up feeling motivated by any combination of “the idea of getting a reward signal”, or “the idea of the human providing a reward signal”, or “the idea of the human feeling pleased and motivated to provide a reward signal”, or “the idea of my output having properties X,Y,Z (which make it similar to outputs that have been rewarded in the past)”, or whatever else. None of those possible motivations would require “scheming”, if I understand that term correctly, because in all cases the AI would generally be doing things during training that it was directly motivated to do (as opposed to only instrumentally motivated). But some of those possible motivations are really bad because they would make the AI think that escaping from the box, launching a coup, etc., would be an awesome idea, given the opportunity.

(Incidentally, I’m having trouble fitting that concern into the Fig. 1 taxonomy. E.g. an AI with a pure wireheading motivation (“all I want is for the reward signal to be high”) is intrinsically motivated to get reward each episode as an end in itself, but it’s also intrinsically motivated to grab power given an opportunity to do so. So would such an AI be a “reward-on-the-episode seeker” or a “schemer”? Or both?? Sorry if this is a stupid question, I didn’t read the whole report.)

GPT-4 is different from APTAMI. I'm not aware of any method that starts with movies of humans, or human-created internet text, or whatever, and then does some kind of ML, and winds up with a plausible human brain intrinsic cost function. If you have an idea for how that could work, then I'm skeptical, but you should tell me anyway. :)

“Extract from the brain” how? A human brain has like 100 billion neurons and 100 trillion synapses, and they’re generally very difficult to measure, right? (I do think certain neuroscience experiments would be helpful.) Or do you mean something else?

I would say “the human brain’s intrinsic-cost-like-thing is difficult to figure out”. I’m not sure what you mean by “…difficult to extract”. Extract from what?

The “similar reason as why I personally am not trying to get heroin right now” is “Example 2” here (including the footnote), or a bit more detail in Section 9.5 here. I don’t think that involves an idiosyncratic anti-heroin intrinsic cost function.

The question “What is the intrinsic cost in a human brain” is a topic in which I have a strong personal interest. See Section 2 here and links therein. “Why don’t humans have an alignment problem” is sorta painting the target around the arrow I think? Anyway, if you radically enhanced human intelligence and let those super-humans invent every possible technology, I’m not too sure what you would get (assuming they don’t blow each other to smithereens). Maybe that’s OK though? Hard to say. Our distant ancestors would think that we have awfully weird lifestyles and might strenuously object to it, if they could have a say.

Maybe the view of alignment pessimists is that the paradigmatic human brain’s intrinsic cost is intractably complex.

Speaking for myself, I think the human brain’s intrinsic-cost-like-thing is probably hundreds of lines of pseudocode, or maybe low thousands, certainly not millions. (And the part that’s relevant for AGI is just a fraction of that.) Unfortunately, I also think nobody knows what those lines are. I would feel better if they did. That wouldn’t be enough to make me “optimistic” overall, but it would certainly be a step in the right direction. (Other things can go wrong too.)

I agree that it would be nice to get to a place where it is known (technically) how to make a kind AGI, but nobody knows how to make an unkind AGI. That's what you're saying, right? If so, yes that would be nice, but I see it as extraordinarily unlikely. I’m optimistic that there are technical ways to make a kind AGI, but I’m unaware of any remotely plausible approach to doing so that would not be straightforwardly modifiable to turn it into a way to make an unkind AGI.

I’m not sure what you think my expectations are. I wrote “I am not crazy to hope for whole primate-brain connectomes in the 2020s and whole human-brain connectomes in the 2030s, if all goes well.“ That’s not the same as saying “I expect those things”; it’s more like “those things are not completely impossible”. I’m not an expert but my current understanding is (1) you’re right that existing tech doesn’t scale well enough (absent insane investment of resources), (2) it’s not impossible that near-future tech could scale much better than current tech. I’m particularly thinking of the neuron-barcoding technique that E11 is trying to develop, which would (if I understand correctly) make registration of neurons between different slices easy and automatic and essentially perfect. Again, I’m not an expert, and you can correct me. I appreciate your comment.

I find it interesting how he says that there is no such thing as AGI, but acknowledges that machines will "eventually surpass human intelligence in all domains where humans are intelligent" as that would meet most people's definition of AGI.

The somewhat-reasonable-position-adjacent-to-what-Yann-believes would be: “I don’t like the term ‘AGI’. It gives the wrong idea. We should use a different term instead. I like ‘human-level AI’.”

I.e., it’s a purely terminological complaint. And it’s not a crazy one! Lots of reasonable people think that “AGI” was a poorly-chosen term, although I still think it’s possibly the least-bad option.

Yann’s actual rhetorical approach tends to be:

  • Step 1: (re)-define the term “AGI” in his own idiosyncratic and completely insane way;
  • Step 2: say there’s no such thing as “AGI” (as so defined), and that anyone who talks about AGI is a moron.

I talk about it in much more detail here.

Load More