“Our greatest fear should not be of failure, but of succeeding at something that doesn't really matter.” –DL Moody (allegedly)
The short version: In this post I’m briefly summarizing how I spent my work-time in 2022, and what I’m planning for 2023.
I expect to carry on with a similar time allocation into 2023.
If you think there are other things I should be doing instead or differently, please don’t be shy, the comment section is below, or DM me, email, etc.
The long version:
So, I was writing some technical post in late 2021, and realized that the thing I was talking about was a detail sitting on top of a giant pile of idiosyncratic beliefs and terminology that nobody else would understand. So I started writing a background section to that post. That background section grew and grew and grew, and eventually turned into a book-length series of 15 blog posts entitled “Intro to Brain-Like AGI Safety”, which reorganized and re-explained almost everything I had written and thought about up to that point, since I started in the field around 2019. (My palimpsest!) Writing that series took up pretty much 100% of my work time until May 2022.
Then I spent much of the late spring and summer catching up on lots of miscellaneous to-do-list stuff that I had put off while writing the series, and everyone in my family caught COVID, and we took a family vacation, and I attended two conferences, and I switched jobs when Jed McCaleb generously offered me a home at Astera Institute, and various other things. So I didn’t get much research done during the late spring and summer.
Moving on to the rest of the year, my substantive work time has been divided, I dunno, something like 45%-45%-10% between “my main research project”, “other research”, and “outreach”. Let’s take those one at a time in the next three sections.
I’m working on the open neuroscience problem that I described in the post Symbol Grounding and Human Social Instincts, and motivated in the post Two paths forward: “Controlled AGI” and “Social-instinct AGI”. I’ll give an abbreviated version here.
As discussed in “Intro to Brain-Like-AGI Safety”, I hold the following opinions:
I have two arguments:
The modest argument is: At some point, I hope, we will have a science that can produce predictions of the form:
(“Innate drives” a.k.a. “Reward function” X) + (“Life experience” a.k.a. “Training environment” Y) → (“Adult” AGI that’s trying to do Z)
If we knew exactly what innate drives are in humans (particularly related to sociality, morality, etc.), then we would have actual examples of X+Y→Z to ground this future science.
Even with the benefit of actual examples, building a science that can predict Z from X+Y seems very hard, don’t get me wrong. Still, I think we’ll be in a better place if we have actual examples of X+Y→Z, than if we don’t.
The bolder argument is: Maybe we can just steal ideas from human social instincts for AGI.
I need to elaborate here. I do not think it’s a good idea to slavishly and unthinkingly copy human social instincts into an AGI. Why is that a bad idea?
On the other hand, if we first understand human social instincts, and then maybe adapt some aspects of those for AGIs, presumably in conjunction with other non-biological ingredients, that seems like quite possibly a great idea.
Again, see Two paths forward: “Controlled AGI” and “Social-instinct AGI” for further discussion.
I spent quite a bit of time in the summer and fall getting up to speed on the hypothalamus (see my book review on that topic) and other relevant parts of the brain (basal forebrain, amygdala, NAc, etc.—this book was especially helpful!).
I have also made a lot of progress towards cleaning up some of the sketchy bits and loose ends of my big-picture understanding of model-based RL in the brain. It seems that some aspects of my neuroscience discussion in the first half of Intro to Brain-Like AGI Safety will be different in the next iteration! But generally (1) none of those mistakes has any important downstream implications for how one should think about AGI safety, (2) those mistakes were pretty much all in areas that I had explicitly flagged as especially speculative. I mostly feel proud of myself for continuing to make progress, rather than annoyed at myself for having written things that were wrong; if you think that’s the incorrect takeaway, we can discuss in the comments.
I still don’t know. The shoring-up-foundations work above is giving me a progressively better sense of what I’m looking for and where. But I’d better keep working!
Philosophically, my general big-picture plan / workflow for solving the problem is:
In the second half of 2022 I’ve been almost entirely focused on (B), but I’m finally getting to the point where it’s beneficial for me to spend more time on (A) and (C). I’m not really thinking about (D) yet, and have a looming suspicion that (D) will be intractable, especially if I wind up thinking that human social instincts are importantly different from rat social instincts, because I suspect that the kinds of experiments that we need are not possible in humans. I hope I’m wrong! But even if (D) were to fail, I think what I’m working on would still be good—I think having several plausible theories of human social instincts would still be a significant improvement over having zero, from the perspective of Safe & Beneficial AGI.
I do a lot of things that are not “my main research project”. Much of it is kinda scattered—email correspondence, lesswrong comments, something random that I want to get off my chest, etc. I think that’s fine.
One of the larger projects that I started was my idea to do a brain-dump-post on the AGI deployment problem, basically as a way of forcing myself to think about that topic more carefully. I’ve been publishing it in pieces—so far, this one on AGI consciousness, and this one on offense-defense balance. Hopefully there will be more. For example, I need to think more about training environments. If we raise an AGI in a VR environment for a while, and then give it access to the real world, will the AGI wind up feeling like the VR environment is “real” and the real world isn’t? (Cf. surveys about the “Experience Machine”.) If so, what can we do about that? Alternatively, if we decide to raise an AGI in a literal robot body, how on earth would that be practical and competitive? Or is there a third option? Beats me.
I’m also hoping to write a follow-up on that offense-defense balance post mentioned above, discussing how I updated from the comments / correspondence afterwards.
Outreach, field-building, etc. are time-consuming, stressful for me, and not particularly my comparative advantage, I think. So I don’t do it much. Sorry everyone! One exception is outreach towards the neuroscience community in particular, which in some cases I’m somewhat-uniquely positioned to do well, I think. The “Intro to Brain-Like-AGI Safety” series itself is (in part) beginner-friendly pedagogical outreach material of that type, and later in the year I did this podcast appearance and this post. I will endeavor to continue doing things like that from time to time into 2023.
Also, I recently made a 1-hour talk (UPDATE: I also now have a 30-minute version) based on the “Intro to brain-like AGI” series. If you have previously invited me to give a talk, and I said “Sorry but I don’t have any talk to give”, then you can try asking me again. As long as I don’t have to travel.
Looking back, I think I’m pretty happy with how I’ve been allocating time, and plan to just keep moving forward as I have since the summer. If you think that’s bad or suboptimal, let’s chat in the comments section!
I’d like to give my thanks to my family, to my old funder Beth Barnes / EA Funds Donor Lottery Program, to my new employer Astera, to my colleagues and coworkers, to my biweekly-productivity-status-check-in-friendly-volunteer-person, to the people who write interesting things for me to read, to the people who write helpful replies to and criticisms of my blog posts and comments, to Lightcone Infrastructure for running this site, and to all of you for reading this far. To a happy, healthy, and apocalypse-free 2023!
Nobody got a bad case of COVID, but there was much time-consuming annoyance, particularly from lost childcare.
For example (in reverse alphabetical order) (I think) Eli Sennesh, Adam Safron, Beren Millidge, Linda Linsefors, Seth Herd, Nathan Helm-Burger, Jon Garcia, Patrick Butlin, plus the AIntelope people and maybe some of the shard theory people, plus various other people to whom I apologize for omitting.
I hope I’m not insulting the AIntelope people here. They’re interested in the same general problem, but are using very different methods from me, methods which will hopefully ultimately be complementary to what I’m trying to do.
Will you publish all the progress you make on decoding social instincts, or would that result in an unacceptable increase in s-risks and/or socially-capable-AI?
I expect the results of my main research project (reverse-engineering human social instincts) to be publishable:
I expect that publishing would net decrease s-risks, not increase them. However
Yeah, I'd be interested in this, and will email you. That said, I'll just lay out my concerns here for posterity. What generated my question in the first place was thinking "what could possibly go wrong with publishing a reward function for social instincts?" My brain helpfully suggested that someone would use it to cognitively-shape their AI in a half-assed manner because they thought that the reward function is all they would need. Next thing you know, we're all living in super-hell.
You didn’t bring this up, but I think there’s a small but nonzero chance that the story of social instincts will wind up involving aspects that I don’t want to publish because of concerns about speeding timelines-to-AGI
You mind giving some hypothetical examples? This sounds plausible, but I'm struggling to think of concrete examples beyond vague thoughts like "maybe explaining social instincts involves describing a mechanism for sample efficient learning".
Yes, that is an exaggeration, but I like the sentence.
You mind giving some hypothetical examples?
If we think of brain within-lifetime learning as roughly a model-based RL algorithm, then
There are exceptions—e.g. curiosity is part of the reward function but probably helpful for capabilities—but I don’t think social instincts are one of those exceptions. If social instincts are in versus out of the reward function, I think you get a powerful AGI either way—note that high-functioning sociopaths are generally intelligent and competent. More thorough discussion of this topic here.
So that’s basically why I’m optimistic that social instincts won’t be capabilities-relevant.
However, social instincts are probably not as simple as “a term in a reward function”, they’re probably somewhat more complicated than that, and it’s at least possible that there are aspects of how social instincts work that cannot be properly explained except in the context of a nuts-and-bolts understanding of the gory details of the model-based RL algorithm. I still think that’s unlikely, but it’s possible.
"what could possibly go wrong with publishing a reward function for social instincts?" My brain helpfully suggested that someone would use it to cognitively-shape their AI in a half-assed manner because they thought that the reward function is all they would need. Next thing you know, we're all living in super-hell
A big question is: If I don’t reverse-engineer human social instincts, and nobody else does either, then what AGI motivations should we expect? Something totally random like a paperclip maximizer? Well, lots of reasonable people expect that, but I mostly don’t; I think there are pretty obvious things that future programmers can and will do that will get them into the realm of “the AGI’s motivations have some vague distorted relationship to humans and human values”, rather than “the AGI’s motivations are totally random” (e.g. see here). And if the AGI’s motivations are going to be at least vaguely related to humans and human values whether we like it or not, then by and large I think I’d rather empower future programmers with tools that give them more control and understanding, from an s-risk perspective.
Also, I recently made a 1-hour talk based on the “Intro to brain-like AGI” series.
Is there a recording available? Or slides?
Wasn’t recorded. I’ll email you the powerpoint.