My AGI safety research—2022 review, ’23 plans

Steven Byrnes

“Our greatest fear should not be of failure, but of succeeding at something that doesn't really matter.” –DL Moody (allegedly)

The short version: In this post I’m briefly summarizing how I spent my work-time in 2022, and what I’m planning for 2023.

The first half of 2022 was writing the “Intro to Brain-Like-AGI Safety” blog post series.
The second half of 2022 was split maybe 45%-45%-10% between my main research project (on reverse-engineering human social instincts), miscellaneous other research and correspondence, and outreach mostly targeted towards neuroscientists.

I expect to carry on with a similar time allocation into 2023.

If you think there are other things I should be doing instead or differently, please don’t be shy, the comment section is below, or DM me, email, etc.

The long version:

1. First half of 2022: Writing “Intro to Brain-Like AGI Safety”

So, I was writing some technical post in late 2021, and realized that the thing I was talking about was a detail sitting on top of a giant pile of idiosyncratic beliefs and terminology that nobody else would understand. So I started writing a background section to that post. That background section grew and grew and grew, and eventually turned into a book-length series of 15 blog posts entitled “Intro to Brain-Like AGI Safety”, which reorganized and re-explained almost everything I had written and thought about up to that point, since I started in the field around 2019. (My palimpsest!) Writing that series took up pretty much 100% of my work time until May 2022.

Then I spent much of the late spring and summer catching up on lots of miscellaneous to-do-list stuff that I had put off while writing the series, and everyone in my family caught COVID^[1], and we took a family vacation, and I attended two conferences, and I switched jobs when Jed McCaleb generously offered me a home at Astera Institute, and various other things. So I didn’t get much research done during the late spring and summer.

Moving on to the rest of the year, my substantive work time has been divided, I dunno, something like 45%-45%-10% between “my main research project”, “other research”, and “outreach”. Let’s take those one at a time in the next three sections.

2. Second half of 2022 (1/3): My main research project

2.1 What’s the project?

I’m working on the open neuroscience problem that I described in the post Symbol Grounding and Human Social Instincts, and motivated in the post Two paths forward: “Controlled AGI” and “Social-instinct AGI”. I’ll give an abbreviated version here.

As discussed in “Intro to Brain-Like-AGI Safety”, I hold the following opinions:

We should think of within-lifetime learning in the human brain as a kind of model-based reinforcement learning (RL) system;
We should think of that model-based RL system as potentially similar to how future AGIs will work;
We should (to a first approximation) think of the “reward function” of that RL system as encoding “innate drives”, like pain being bad and sweet tastes being good;
These “innate drives” correspond to specific genetically-hardwired circuitry primarily in the hypothalamus and brainstem;
A subset of that circuitry underlies human social and moral instincts;
…And the project I’m working on is an attempt to figure out what those circuits are and how they work.

2.2 Why do I think success on this project would be helpful for AGI safety?

I have two arguments:

The modest argument is: At some point, I hope, we will have a science that can produce predictions of the form:

(“Innate drives” a.k.a. “Reward function” X)
+ (“Life experience” a.k.a. “Training environment” Y)
→ (“Adult” AGI that’s trying to do Z)

If we knew exactly what innate drives are in humans (particularly related to sociality, morality, etc.), then we would have actual examples of X+Y→Z to ground this future science.

Even with the benefit of actual examples, building a science that can predict Z from X+Y seems very hard, don’t get me wrong. Still, I think we’ll be in a better place if we have actual examples of X+Y→Z, than if we don’t.

The bolder argument is: Maybe we can just steal ideas from human social instincts for AGI.

I need to elaborate here. I do not think it’s a good idea to slavishly and unthinkingly copy human social instincts into an AGI. Why is that a bad idea?

For one thing, human social instincts leave something to be desired! For example, I don’t want AGIs with teenage angst, or zero-sum status drive, or bloodlust, etc.
For another thing, in the X+Y→Z calculus above, an AGI with a human-like innate drives X will not necessarily grow up into human-like goals and desires Z, unless it also has human-like training environment Y. And I think it’s very likely that AGIs will have different training environments than human children, at least in some ways (e.g. lack of human body, more capacity to self-modify).

On the other hand, if we first understand human social instincts, and then maybe adapt some aspects of those for AGIs, presumably in conjunction with other non-biological ingredients, that seems like quite possibly a great idea.

Again, see Two paths forward: “Controlled AGI” and “Social-instinct AGI” for further discussion.

2.3 Why is this such a priority that it warrants a large fraction of my time?

Impact—see above.
Tractability—It’s in principle tractable, in the sense that there’s a concrete algorithmic problem, and there’s a specific solution to that problem implemented in the brain, and I am “merely” trying to figure out what that solution is. And while there’s a decent chance that I’ll hit roadblocks at some point, I feel a sense of steady progress so far (see below).
Neglectedness—This problem seems importantly neglected, in the “bus factor” sense. I’m happily not the only person on Earth trying to bring neuroscience knowledge to bear on AGI safety / alignment questions in general,^[2] but I do unfortunately seem to be the only one working on this particular neuroscience puzzle.^[3] For example, I’m quite sure that I’m the only person in AGI safety who cares a whit about the algorithmic role of neuropeptide receptors in the lateral septum.

2.4 Recent progress on that project

2.4.1 Shoring up foundations

I spent quite a bit of time in the summer and fall getting up to speed on the hypothalamus (see my book review on that topic) and other relevant parts of the brain (basal forebrain, amygdala, NAc, etc.—this book was especially helpful!).

I have also made a lot of progress towards cleaning up some of the sketchy bits and loose ends of my big-picture understanding of model-based RL in the brain. It seems that some aspects of my neuroscience discussion in the first half of Intro to Brain-Like AGI Safety will be different in the next iteration! But generally (1) none of those mistakes has any important downstream implications for how one should think about AGI safety, (2) those mistakes were pretty much all in areas that I had explicitly flagged as especially speculative. I mostly feel proud of myself for continuing to make progress, rather than annoyed at myself for having written things that were wrong; if you think that’s the incorrect takeaway, we can discuss in the comments.

2.4.2 OK, great, but how do social instincts work?

I still don’t know. The shoring-up-foundations work above is giving me a progressively better sense of what I’m looking for and where. But I’d better keep working!

Philosophically, my general big-picture plan / workflow for solving the problem is:

(A) Come up with plausible theories / pseudocode for how human social instincts might work;
(B) Read the literature on socially-relevant bits of the hypothalamus & brainstem, including how they interface with the striatum etc.;
(C) Try to match up (A)+(B)—and iterate;
(D) If more experiments are needed, e.g. because there’s more than one plausible theory, try to figure out which experiments, and somehow make them happen.

In the second half of 2022 I’ve been almost entirely focused on (B), but I’m finally getting to the point where it’s beneficial for me to spend more time on (A) and (C). I’m not really thinking about (D) yet, and have a looming suspicion that (D) will be intractable, especially if I wind up thinking that human social instincts are importantly different from rat social instincts, because I suspect that the kinds of experiments that we need are not possible in humans. I hope I’m wrong! But even if (D) were to fail, I think what I’m working on would still be good—I think having several plausible theories of human social instincts would still be a significant improvement over having zero, from the perspective of Safe & Beneficial AGI.

3. Second half of 2022 (2/3): Miscellaneous other research

I do a lot of things that are not “my main research project”. Much of it is kinda scattered—email correspondence, lesswrong comments, something random that I want to get off my chest, etc. I think that’s fine.

One of the larger projects that I started was my idea to do a brain-dump-post on the AGI deployment problem, basically as a way of forcing myself to think about that topic more carefully. I’ve been publishing it in pieces—so far, this one on AGI consciousness, and this one on offense-defense balance. Hopefully there will be more. For example, I need to think more about training environments. If we raise an AGI in a VR environment for a while, and then give it access to the real world, will the AGI wind up feeling like the VR environment is “real” and the real world isn’t? (Cf. surveys about the “Experience Machine”.) If so, what can we do about that? Alternatively, if we decide to raise an AGI in a literal robot body, how on earth would that be practical and competitive? Or is there a third option? Beats me.

I’m also hoping to write a follow-up on that offense-defense balance post mentioned above, discussing how I updated from the comments / correspondence afterwards.

4. Second half of 2022 (3/3): Outreach, field-building, etc.

Outreach, field-building, etc. are time-consuming, stressful for me, and not particularly my comparative advantage, I think. So I don’t do it much. Sorry everyone! One exception is outreach towards the neuroscience community in particular, which in some cases I’m somewhat-uniquely positioned to do well, I think. The “Intro to Brain-Like-AGI Safety” series itself is (in part) beginner-friendly pedagogical outreach material of that type, and later in the year I did this podcast appearance and this post. I will endeavor to continue doing things like that from time to time into 2023.

Also, I recently made a 1-hour talk (UPDATE: I also now have a 30-minute version) based on the “Intro to brain-like AGI” series. If you have previously invited me to give a talk, and I said “Sorry but I don’t have any talk to give”, then you can try asking me again. As long as I don’t have to travel.

5. On to 2023!

Looking back, I think I’m pretty happy with how I’ve been allocating time, and plan to just keep moving forward as I have since the summer. If you think that’s bad or suboptimal, let’s chat in the comments section!

I’d like to give my thanks to my family, to my old funder Beth Barnes / EA Funds Donor Lottery Program, to my new employer Astera, to my colleagues and coworkers, to my biweekly-productivity-status-check-in-friendly-volunteer-person, to the people who write interesting things for me to read, to the people who write helpful replies to and criticisms of my blog posts and comments, to Lightcone Infrastructure for running this site, and to all of you for reading this far. To a happy, healthy, and apocalypse-free 2023!

^{^}
Nobody got a bad case of COVID, but there was much time-consuming annoyance, particularly from lost childcare.
^{^}
For example (in reverse alphabetical order) (I think) Eli Sennesh, Adam Safron, Beren Millidge, Linda Linsefors, Seth Herd, Nathan Helm-Burger, Jon Garcia, Patrick Butlin, plus the AIntelope people and maybe some of the shard theory people, plus various other people to whom I apologize for omitting.
^{^}
I hope I’m not insulting the AIntelope people here. They’re interested in the same general problem, but are using very different methods from me, methods which will hopefully ultimately be complementary to what I’m trying to do.

Will you publish all the progress you make on decoding social instincts, or would that result in an unacceptable increase in s-risks and/or socially-capable-AI?

I expect the results of my main research project (reverse-engineering human social instincts) to be publishable:

I don’t expect that publishing would increase socially-capable-AI. High-functioning sociopaths, who (to oversimplify somewhat) lack normal social instincts, are nevertheless very socially capable—in some ways they can be more socially capable than neurotypical people. If you think about it, an intelligent agent can form a good model of a car engine, and then use that model to skillfully manipulate the engine; well, by the same token, an intelligent agent can form a good model of a human, and then use that model to skillfully manipulate the human. You don’t need social instincts for that. Social instincts are mainly about motivations, not capabilities.
I expect that publishing would net decrease s-risks, not increase them. However, this is a long story that involves various hard-to-quantify considerations in both directions, and I think reasonable people can disagree about how they balance out. I have written down some sketchy notes trying to work through all the considerations, email me if you’re interested.
You didn’t bring this up, but I think there’s a small but nonzero chance that the story of social instincts will wind up involving aspects that I don’t want to publish because of concerns about speeding timelines-to-AGI, in which case I would probably endeavor to publish as much of the story as I could without saying anything problematic.

I expect that publishing would net decrease s-risks, not increase them. However

Yeah, I'd be interested in this, and will email you. That said, I'll just lay out my concerns here for posterity. What generated my question in the first place was thinking "what could possibly go wrong with publishing a reward function for social instincts?" My brain helpfully suggested that someone would use it to cognitively-shape their AI in a half-assed manner because they thought that the reward function is all they would need. Next thing you know, we're all living in super-hell^[1].

You didn’t bring this up, but I think there’s a small but nonzero chance that the story of social instincts will wind up involving aspects that I don’t want to publish because of concerns about speeding timelines-to-AGI

You mind giving some hypothetical examples? This sounds plausible, but I'm struggling to think of concrete examples beyond vague thoughts like "maybe explaining social instincts involves describing a mechanism for sample efficient learning".

^{^}
Yes, that is an exaggeration, but I like the sentence.

You mind giving some hypothetical examples?

If we think of brain within-lifetime learning as roughly a model-based RL algorithm, then

questions like “how exactly does this model-based RL algorithm work? what’s the model? how is it updated? what’s the neural architecture? how does the value function work? etc.” are all highly capabilities-relevant, and
the question “what is the reward function?” is mostly not capabilities-relevant.

There are exceptions—e.g. curiosity is part of the reward function but probably helpful for capabilities—but I don’t think social instincts are one of those exceptions. If social instincts are in versus out of the reward function, I think you get a powerful AGI either way—note that high-functioning sociopaths are generally intelligent and competent. More thorough discussion of this topic here.

So that’s basically why I’m optimistic that social instincts won’t be capabilities-relevant.

However, social instincts are probably not as simple as “a term in a reward function”, they’re probably somewhat more complicated than that, and it’s at least possible that there are aspects of how social instincts work that cannot be properly explained except in the context of a nuts-and-bolts understanding of the gory details of the model-based RL algorithm. I still think that’s unlikely, but it’s possible.

"what could possibly go wrong with publishing a reward function for social instincts?" My brain helpfully suggested that someone would use it to cognitively-shape their AI in a half-assed manner because they thought that the reward function is all they would need. Next thing you know, we're all living in super-hell

A big question is: If I don’t reverse-engineer human social instincts, and nobody else does either, then what AGI motivations should we expect? Something totally random like a paperclip maximizer? Well, lots of reasonable people expect that, but I mostly don’t; I think there are pretty obvious things that future programmers can and will do that will get them into the realm of “the AGI’s motivations have some vague distorted relationship to humans and human values”, rather than “the AGI’s motivations are totally random” (e.g. see here). And if the AGI’s motivations are going to be at least vaguely related to humans and human values whether we like it or not, then by and large I think I’d rather empower future programmers with tools that give them more control and understanding, from an s-risk perspective.

This is drifting a bit far afield from the neurobio aspect of this research, but do you have an opinion about the likelihood that a randomly sampled human, if endowed with truly superhuman powers, would utilize those powers in a way that we'd be pleased to see from an AGI?

It seems to me like we have many salient examples of power corrupting, and absolute power corrupting to a great degree. Understanding that there's a distribution of outcomes, do you have an opinion about the likelihood of benevolent use of great power, among humans?

This is not to say that this understanding can't still be usefully employed, but somehow it seems like a relevant question. E.g. if it turns out that most of what keeps humans acting pro-socially is the fear that anti-social behavior will trigger their punishment by others, that's likely not as juicy a mechanism since it may be hard to convince a comparatively omniscient and omnipotent being that it will somehow suffer if it does anti-social things.

(lightly edited from my old comment here)

For what it’s worth, Eliezer in 2018 said that he’d be pretty happy with endowing some specific humans with superhuman powers:

If the subject is Paul Christiano, or Carl Shulman, I for one am willing to say these humans are reasonably aligned; and I'm pretty much okay with somebody giving them the keys to the universe in expectation that the keys will later be handed back.

I’ve shown the above quote to a lot of people who say “yes that’s perfectly obvious”, and I’ve also shown this quote to a lot of people who say “Eliezer is being insufficiently cynical; absolute power corrupts absolutely”. For my part, I don’t have a strong opinion, but on my models, if we know how to make virtual humans, then we probably know how to make virtual humans without envy and without status drive and without teenage angst etc., which should help somewhat. More discussion here.

Thanks for the thoughtful reply. I read the fuller discussion you linked to and came away with one big question which I didn't find addressed anywhere (though it's possible I just missed it!)

Looking at the human social instinct, we see that it indeed steers us towards not wanting to harm other humans, but it weakens when extended to other creatures, somewhat in proportion to their difference from humans. We (generally) have lots of empathy for other humans, less so for apes, less so for other mammals (who we factory farm by the billions without most people particularly minding it) probably less so for octopi (who are bright but quite different) and almost none to the zillion microorganisms, some of which we allegedly evolved from. I would guess that even canonical Good Person Paul Christiano probably doesn't lose much sleep over his impact on microorganisms.

This raises the question of whether the social instinct we have, even if fully reverse engineered, can be deployed separately from the identity of the entity to which it is attached. In other words, if the social instinct circuitry humans have is "be nice to others in proportion to how similar to yourself they are", which seems to match the data, then we would need more than just the ability to place that circuitry in AGIs (which would presumably make the AGIs want to be nice to other similar AGIs). We would in fact need to be able to tease apart the object of empathy, and replace it with something that is very different than how humans operate, since no human is nice to microorganisms, so I see no evidence that the existing social instincts ever make any person be nice to something very different, and much weaker, than them, so I would expect it to work similarly in an AGI.

This is speculative, but it seems reasonably likely to me to turn out to be an actual problem. Curious if you have thoughts on it.

Thanks!

I don’t think “be nice to others in proportion to how similar to yourself they are” is part of it. For example, dogs can be nice to humans, and to goats, etc. I guess your response is ‘well dogs are a bit like humans and goats’. But are they? From the dog’s perspective? They look different, sound different, smell different, etc. I don’t think dogs really know what they are in the first place, at least not in that sense. Granted, we’re talking about humans not dogs. But humans can likewise feel compassion towards animals, especially cute ones (cf. “charismatic megafauna”). Do humans like elephants because elephants are kinda like humans? I mean, I guess elephants are more like humans than microbes are. But they’re still pretty different. I don’t think similarity per se is why humans care about elephants. I think it’s something about the elephants’ cute faces, and the cute way that they move around.

More specifically, my current vague guess is that the brainstem applies some innate heuristics to sensory inputs to guess things like “that thing there is probably a person”. This includes things like heuristics for eye-contact-detection and face-detection and maybe separately cute-face-detection etc. The brainstem also has heuristics that detect the way that spiders scuttle and snakes slither (for innate phobias). I think these heuristics are pretty simple; for example, the human brainstem face detector (in the superior colliculus) has been studied a bit, and the conclusion seems to be that it mostly just detects the presence of three dark ,blobs of about the right size, in an inverted triangle. (The superior colliculus is pretty low resolution.)

If we’re coding the AGI, we can design those sensory heuristics to trigger on whatever we want. Presumably we would just use a normal ConvNet image classifier for this. If we want the AGI to find cockroaches adorably “cute”, and kittens gross, I think that would be really straightforward to code up.

So I’m not currently worried about that exact thing. I do have a few kinda-related concerns though. For example, maybe adult social emotions can only develop after lots and lots of real-time conversations with real-world humans, and that’s a slow and expensive kind of training data for an AGI. Or maybe the development of adult social emotions is kinda a package deal, such that you can’t delete “the bad ones” (e.g. envy) from an AGI without messing everything else up.

(Part of the challenge is that false-positives, e.g. where the AGI feels compassion towards microbes or teddy bears or whatever, are a very big problem, just as false-negatives are.)

Also, I recently made a 1-hour talk based on the “Intro to brain-like AGI” series.

Is there a recording available? Or slides?

Wasn’t recorded. I’ll email you the powerpoint.