Steve Byrnes

I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms—see Email: Twitter: @steve47285. Employer: Physicist by training.


Intro to Brain-Like-AGI Safety

Wiki Contributions


I wish that everyone (including OP) would be clearer about whether or not we’re doing worst-case thinking, and why.

In particular, if the AGI has some pile of kludges disproportionately pointed towards accomplishing X, and the AGI does self-reflection and “irons itself out”, my prediction is “maybe this AGI will wind up pursuing X, or maybe not, I dunno”. I don’t have a strong reason to expect that to happen, and I also don’t have a strong reason to expect that to not happen. I mostly feel uncertain and confused.

So if the debate is “Are Eliezer & Nate right about ≳99% (or whatever) chance of doom?”, then I find myself on the optimistic side (at least, leaving aside the non-technical parts of the problem), whereas if the debate is “Do we have a strong reason to believe that thus-and-such plan will actually solve technical alignment?”, then I find myself on the pessimistic side.


Separately, I don’t think it’s true that reflectively-stable hard superintelligence needs to have a particular behavioral goal, for reasons here.

If you think some more specific aspect of this post is importantly wrong for reasons that are downstream of that, I’d be curious to hear more details.

In this post, I’m discussing a scenario where one AGI gets out of control and kills everyone. If the people who programmed and turned on that AGI were not omnicidal maniacs who wanted to wipe out humanity, then I call that an “accident”. If they were omnicidal maniacs then I call that a “bad actor” problem. I think that omnicidal maniacs are very rare in the human world, and therefore this scenario that I’m talking about is an “accident” scenario.

From reading the post you linked, my best guess is that

(1) You’re not thinking about this scenario in the first place,

(2) If you are, then you would say something like “When we use the word ‘accident’, it suggests ‘avoidable accident caused by stupid or reckless people’, but maybe the real problem was a race-to-the-bottom on safety, which need not be avoidable and need not involve any stupid or reckless people.”

If it’s (1), this is a whole long post is about why that scenario seems very difficult to avoid, from my current perspective. (If your argument is “that scenario won’t happen because something else will kill us first”, then I happen to disagree, but that’s off-topic for this post.)

If it’s (2), well I don’t see why the word “accident” has to have that connotation. It doesn’t have that connotation to me. I think it’s entirely possible for people who are neither stupid nor reckless to cause x-risk by accident. A lot of this post is about structural factors, seems to me, and Section 3.4 in particular seems to be an argument which is structural in nature, where I note that by default, more and more people are going to train more and more powerful AGIs, and thus somebody is going to make one that is motivated to cause mass destruction sooner or later—even if there isn’t a race-to-the-bottom on safety / alignment, and certainly if there is. That’s a structural argument, I think.

So anyway, I don’t think that my using the word “accident” has rendered me incapable of thinking about structural factors, right?

Sorry if I’m misunderstanding. Anyway, thanks for your comment.

I was an independent AGI safety researcher because I didn't want to move to a different city and (at the time, it might or might not have changed in the past couple years) few if any orgs that might hire me were willing to hire remote workers.

Hmm. I’m not sure it’s that important what is or isn’t “behaviorism”, and anyway I’m not an expert on that (I haven’t read original behaviorist writing, so maybe my understanding of “behaviorism” is a caricature by its critics). But anyway, I thought Scott & Eliezer were both interested in the question of what happens when the kid grows up and the parents are no longer around.

My comment above was a bit sloppy. Let me try again. Here are two stories:

“RL with continuous learning” story: The person has an internal reward function in their head, and over time they’ll settle into the patterns of thought & behavior that best tickle their internal reward function. If they spend a lot of time in the presence of their parents, they’ll gradually learn patterns of thought & behavior that best tickle their internal reward function in the presence of their parents. If they spend a lot of time hanging out with friends, they’ll gradually learn patterns of thought & behavior that best tickle their internal reward function when they’re hanging out with friends. As adults in society, they’ll gradually learn patterns of thought & behavior that best tickle their internal reward function as adults in society.

“RL learn-then-get-stuck” story: As Scott wrote in OP, “a child does something socially proscribed (eg steal). Their parents punish them. They learn some combination of "don't steal" and "don't get caught stealing". A few people (eg sociopaths) learn only "don't get caught stealing", but most of the rest of us get at least some genuine aversion to stealing that eventually generalizes into a real sense of ethics.” (And that “real sense of ethics” persists through adulthood.)

I think lots of evidence favors the first story over the second story, at least in humans (I don’t know much about non-human animals). Particularly: (1) heritability studies, (2) cultural shifts, (3) people’s ability to have kinda different personalities in different social contexts, like reverting to childhood roles / personalities when they visit family for the holidays. I don’t want to say that the second story never happens, but it seems to me to be an unusual edge case, like childhood phobias / trauma that persists into adulthood, whereas the first story is central.

That’s one topic, maybe the main one at issue here. Then a second topic is: even leaving aside what happens after the kid grows up, let’s zoom in on childhood. I wrote “If they spend a lot of time in the presence of their parents, they’ll gradually learn patterns of thought & behavior that best tickle their internal reward function in the presence of their parents.” In that context, my comment above was bringing up the fact that IMO parental control over rewards is pretty minimal, such that the “patterns of thought & behavior that best tickle the kid’s internal reward function in the presence of their parents” can be quite different from “the thoughts & behaviors that the parent wishes the kid would have”. I think this has a lot to do with the fact that the parent can’t see inside the kid’s head and issue positive rewards when the kid thinks docile & obedient thoughts, and negative rewards when the kid thinks defiant thoughts. If defiant thoughts are its own reward in the kid’s internal reward function, then the kid is getting a continuous laser-targeted stream of rewards for thinking defiant thoughts, potentially hundreds or thousands of times per day, whereas a parent’s ability to ground their kid or withhold dessert or whatever is comparatively rare and poorly-targeted.

Hmm, maybe. I talk about training compute in Section 4 of this post (upshot: I’m confused…). See also Section 3.1 of this other post. If training is super-expensive, then run-compute would nevertheless be important if (1) we assume that the code / weights / whatever will get leaked in short order, (2) the motivations are changeable from "safe" to "unsafe" via fine-tuning or decompiling or online-learning or whatever. (I happen to strongly expect powerful AGI to necessarily use online learning, including online updating the RL value function which is related to motivations / goals. Hope I’m wrong! Not many people seem to agree with me on that.)


We also know that in many cases the brain and some ANN are actually computing basically the same thing in the same way (LLMs and linguistic cortex), and it's now obvious and uncontroversial that the brain is using the sparser but larger version of the same circuit, whereas the LLM ANN is using the dense version which is more compact but less energy/compute efficient (as it uses/accesses all params all the time).

I disagree with “uncontroversial”. Just off the top of my head, people who I’m pretty sure would disagree with your “uncontroversial” claim include Randy O’Reilly, Josh Tenenbaum, Jeff Hawkins, Dileep George, these people, maybe some of the Friston / FEP people, probably most of the “evolved modularity” people like Steven Pinker, and I think Kurzweil (he thought the cortex was built around hierarchical hidden Markov models, last I heard, which I don’t think are equivalent to ANNs?). And me! You’re welcome to argue that you’re right and we’re wrong (and most of that list are certainly wrong, insofar as they’re also disagreeing with each other!), but it’s not “uncontroversial”, right?

The true fundamental information capacity of the brain is probably much smaller than 1e14 bytes, but that has nothing to do with the size of an actually *efficient* circuit, because efficient circuits (efficient for runtime compute, energy etc) are never also efficient in terms of information compression.

In the OP (Section 3.3.1) I talk about why I don’t buy that—I don’t think it’s the case that the brain gets dramatically more “bang for its buck” / “thinking per FLOP” than GPT-3. In fact, it seems to me to be the other way around.

Then “my model of you” would reply that GPT-3 is much smaller / simpler than the brain, and that this difference is the very important secret sauce of human intelligence, and the “thinking per FLOP” comparison should not be brain-vs-GPT-3 but brain-vs-super-scaled-up-GPT-N, and in that case the brain would crush it. And I would disagree about the scale being the secret sauce. But we might not be able to resolve that—guess we’ll see what happens! See also footnote 16 and surrounding discussion.

Bit of a nitpick, but I think you’re misdescribing AIXI. I think AIXI is defined to have a reward input channel, and its collection-of-all-possible-generative-world-models are tasked with predicting both sensory inputs and reward inputs, and Bayesian-updated accordingly, and then the generative models are issuing reward predictions which in turn are used to choose maximal-reward actions. (And by the way it doesn’t really work—it under-explores and thus can be permanently confused about counterfactual / off-policy rewards, IIUC.) So AIXI has no utility function.

That doesn’t detract from your post, it’s just that I maybe wouldn’t have used the term “AIXI-like” for the AIs that you’re describing, I think.

(There’s a decent chance that I’m confused and this whole comment is wrong.)

Ooh interesting! Can you say how you're figuring that it's "gigabytes of information?"

I’ve spent thousands of hours reading neuroscience papers, I know how synapses work, jeez :-P

Similarly we never have to bother with a "minicolumn".  We only care about what works best.  Notice how human aerospace engineers never developed flapping wings for passenger aircraft, because they do not work all that well.  

We probably will find something way better than a minicolumn.  Some argue that's what a transformer is.

I’m sorta confused that you wrote all these paragraphs with (as I understand it) the message that if we want future AGI algorithms to do the same things that a brain can do, then it needs to do MAC operations in the same way that (you claim) brain synapses do, and it needs to have 68 TB of weight storage just as (you claim) the brain does. …But then here at the end you seem to do a 180° flip and talk about flapping wings and transformers and “We probably will find something way better”. OK, if “we probably will find something way better”, do you think that the “way better” thing will also definitely need 68 TB of memory, and definitely not orders of magnitude less than 68 TB? If you think it definitely needs 68 TB of memory, no way around it, then what’s your basis for believing that? And how do you reconcile that belief with the fact that we can build deep learning models of various types that do all kinds of neat things like language modeling and motor control and speech synthesis and image recognition etc. but require ≈100-100,000× less than 68 TB of memory? How are you thinking about that? (Maybe you have a “scale-is-all-you-need” perspective, and you note that we don’t have AGI yet, and therefore the explanation must be “insufficient scale”? Or something else?)

There's a MAC in there.

OK, imagine for the sake of argument that we live in the following world (a caricatured version of this model):

  • Dendrites have lots of clusters of 10 nearby synapses
  • Iff all 10 synapses within one cluster get triggered simultaneously, then it triggers a dendritic spike on the downstream neuron.
  • Different clusters on the same dendritic tree can each be treated independently
    • As background, the whole dendrite doesn’t have a single voltage (let alone the whole dendritic tree). Dendrites have different voltages in different places. If there are multiple synaptic firings that are very close in both time and space, then the voltages can add up and get past the spike threshold; but if multiple synapses that are very far apart from each other fire simultaneously, they don’t add up, they each affect the voltage in their own little area, and it doesn’t create a dendritic spike.
  • The upstream neurons are all firing on a regular clock cycle, such that the synapse firing is either “simultaneous” or “so far apart in time that we can treat each timestep independently”.

In this imaginary world, you would use AND (within each cluster of 10 synapses) and OR (between clusters) to calculate whether dendritic spikes happen or not. Agree?

Using MACs in this imaginary world is both too complicated and too simple. It’s too complicated because it’s a very wasteful way to calculate AND. It’s too simple because it’s wrong to MAC together spatially-distant synapses, when in fact spatially-distant synapses can’t collaboratively create a spike.

If you’re with me so far, that’s what I mean when I say that this model has “no MAC operations”.

And by the way, I think we could reformulate this same algorithm to have a very different low-level implementation (but the same input and output), by replacing “groups of neurons that form clusters together” with “serial numbers”. Then there would be no MACs and there would be no multi-synapse ANDs, but rather there would be various hash tables or something, I dunno. And the memory requirements would be different, as would the number of required operations, presumably.

At this point maybe you’re going to reply “OK but that’s an imaginary world, whereas I want to talk about the real world.” Certainly the bullet points above are erasing real-world complexities. But it’s very difficult to judge which real-world complexities are actually playing an important role in brain algorithms and which aren’t. For example, should we treat (certain classes of) cortical synapses as having binary strength rather than smoothly-varying strength? That’s a longstanding controversy! Do neurons really form discrete and completely-noninteracting clusters on dendrites? I doubt it…but maybe the brain would work better if they did!! What about all the other things going on in the cortex? That’s a hard question. There are definitely other things going on unrelated to this particular model, but it’s controversial exactly what they are.

Thanks for your comment! I am not a GPU expert, if you didn’t notice. :) 

I might note that you could have tried to fill in the "cartoon switch" for human synapses.  They are likely a MAC for each incoming axon…

This is the part I disagree with. For example, in the OP I cited this paper which has no MAC operations, just AND & OR. More importantly, you’re implicitly assuming that whatever neocortical neurons are doing, the best way to do that same thing on a chip is to have a superficial 1-to-1 mapping between neurons-in-the-brain and virtual-neurons-on-the-chip. I find that unlikely. Back to that paper just above, things happening in the brain are (supposedly) encoded as random sparse subsets of active neurons drawn from a giant pool of neurons. We could do that on the chip, if we wanted to, but we don’t have to! We could assign them serial numbers instead! We can do whatever we want! Also, cortical neurons are arranged into six layers vertically, and in the other direction, 100 neurons are tied into a closely-interconnected cortical minicolumn, and 100 minicolumns in turn form a cortical column. There’s a lot of structure there! Nobody really knows, but my best guess from what I’ve seen is that a future programmer might have one functional unit in the learning algorithm called a “minicolumn” and it’s doing, umm, whatever it is that minicolumns do, but we don’t need to implement that minicolumn in our code by building it out of 100 different interconnected virtual neurons. Yes the brain builds it that way, but the brain has lots of constraints that we won’t have when we’re writing our own code—for example, a GPU instruction set can do way more things than biological neurons can (partly because biological neurons are so insanely slow that any operation that requires more than a couple serial steps is a nonstarter).

Load More