I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, Substack, LinkedIn, and more at my website.
Oh, it’s definitely controversial—as I always say, there is never a neuroscience consensus. My sense is that a lot of the controversy is about how broadly to define “reinforcement learning”.
If you use a narrow definition like “RL is exactly those algorithms that are on arxiv cs.AI right now with an RL label”, then the brain is not RL.
If you use a broad definition like “RL is anything with properties like Thorndike's law of effect”, then, well, remember that “reinforcement learning” was a psychology term long before it was an AI term!
If it helps, I was arguing about this with a neuroscientist friend (Eli Sennesh) earlier this year, and wrote the following summary (not necessarily endorsed by Eli) afterwards in my notes:
- Eli doesn’t like the term “RL” in a brain context because of (1) its implication that "reward" is stuff in the environment as opposed to an internal “reward function” built from brain-internal signals, (2) its implication that we’re specifically maximizing an exponentially-discounted sum of future rewards.
- …Whereas I like the term “RL” because (1) If brain-like algorithms showed up on GitHub, then everyone in AI would call it an “RL algorithm”, put it in “RL textbooks”, and use it to solve “RL problems”, (2) This follows the historical usage (there’s reinforcement, and there’s learning, per Thorndike’s Law of Effect etc.).
- When I want to talk about “the brain’s model-based RL system”, I should translate that to “the brain’s Bellman-solving system” when I’m talking to Eli, and then we’ll be more-or-less on the same page I think?
…But Eli is just one guy, I think there are probably dozens of other schools-of-thought with their own sets of complaints or takes on “RL”.
Personally, my stance is something more like, "It seems very feasible to create sophisticated AI architectures that don't act as scary maximizers." To me it seems like this is what we're doing now, and I see some strong reasons to expect this to continue. (I realize this isn't guaranteed, but I do think it's pretty likely)
We probably mostly disagree because you’re expecting LLMs forever and I’m not. For example, AlphaZero does act as a scary maximizer. Indeed, nobody knows any way to make an AI that’s superhuman at Go, except by techniques that produce scary maximizers. Is there a way to make an AI that’s superhuman at founding and running innovative companies, but isn’t a scary maximizer? That’s beyond present AI capabilities, so the jury is still out.
The issue is basically “where do you get your capabilities from?” One place to get capabilities is by imitating humans. That’s the LLM route, but (I claim) it can’t go far beyond the hull of existing human knowledge. Another place to get capabilities is specific human design (e.g. the heuristics that humans put into Deep Blue), but that has the same limitation. That leaves consequentialism as a third source of capabilities, and it definitely works in principle, but it produces scary maximizers.
While the human analogies are interesting, I assume they might appeal more to the "consequentialist AIs are still coming” crowd than people like myself. Humans were evolved for some pretty wacky reasons, and have a large number of serious failure modes…
Yup, my expectation is that ASI will be even scarier than humans, by far. But we are in agreement that humans with power are much-more-than-zero scary.
I'd flag that in a competent and complex AI architecture, I'd expect that many subcomponents would have strong biases towards corrigibility and friendliness. This seems highly analogous to human minds, where it's really specific sub-routines and similar that have these more altruistic motivations.
I’m not sure what you mean by “subcomponents”. Are you talking about subcomponents at the learning algorithm level, or subcomponents at the trained model level? For the former, I think both LLMs and human brains are mostly big simple-ish learning algorithms, without much in the way of subcomponents. For the latter (where I would maybe say “circuits” instead of “subcomponents”?), I would also disagree but for different reasons, maybe see §2 of this post.
I personally find histories of engineering complex systems in predictable and controllable ways to be much more informative, for these challenges.
To explain my disagreement, I’ll start with an excerpt from my post here:
Question: Do you expect almost all companies to eventually be founded and run by AGIs rather than humans? …
3.2.4 Possible Answer 4: “No, because if someone wants to start a business, they would prefer to remain in charge themselves, and ask an AGI for advice when needed, rather than ‘pressing go’ on an autonomous entrepreneurial AGI.”
That’s a beautiful vision for the future. It really is. I wish I believed it. But even if lots of people do in fact take this approach, and they create lots of great businesses, it just takes one person to say “Hmm, why should I create one great business, when I can instead create 100,000 great businesses simultaneously?”
…And then let’s imagine that this one person starts “Everything, Inc.”, a conglomerate company running millions of AGIs that in turn are autonomously scouting out new business opportunities and then founding, running, and staffing tens of thousands of independent business ventures.
Under the giant legal umbrella of “Everything, Inc.”, perhaps one AGI has started a business venture involving robots building solar cells in the desert; another AGI is leading an effort to use robots to run wet-lab biology experiments and patent any new ideas; another AGI is designing and prototyping a new kind of robot that’s specialized to repair other robots, another AGI is buying land and getting permits to eventually build a new gas station in Hoboken, various AGIs are training narrow AIs or writing other special-purpose software, and of course there are AGIs making more competent and efficient next-generation AGIs, and so on.
Obviously, “Everything, Inc.” would earn wildly-unprecedented, eye-watering amounts of money, and reinvest that money to buy or build chips for even more AGIs that can found and grow even more companies in turn, and so on forever, as this person becomes the world’s first trillionaire, then the world’s first quadrillionaire, etc.
That’s a caricatured example—the story could of course be far more gradual and distributed than one guy starting “Everything, Inc.”—but the point remains: there will be an extraordinarily strong economic incentive to use AGIs in increasingly autonomous ways, rather than as assistants to human decision-makers. And in general, when things are both technologically possible and supported by extraordinarily strong economic incentives, those things are definitely gonna happen sooner or later, in the absence of countervailing forces. …
So that’s one piece of where I’m coming from.
Meanwhile, as it happens, I have worked on “engineering complex systems in predictable and controllable ways”, in a past job at an engineering firm that made guidance systems for nuclear weapons and so on. The techniques we used involved understanding the engineered system incredibly well, understanding the environment / situations that the system would be in incredibly well, knowing exactly what the engineered system should do in any of those situations, and thus developing strong confidence and controls to ensure that the system would in fact do those things.
If I imagine applying those engineering techniques, or anything remotely like them, to “Everything, Inc.”, I just can’t. They seem obviously totally inapplicable. I know extraordinarily little about what any of these millions of AGIs is doing, or where they are, or what they should be doing.
See what I mean?
…human capabilities are not largely explained by consequentialist planning…
I think I disagree with this. I would instead say something like: “Humans are the least intelligent species capable of building a technological civilization; but to the extent that humans have capabilities relevant to that, those capabilities CAN generally be explained by consequentialist planning; the role of Approval Reward is more about what people want than how capable they are of getting it.”
Note that, in this post I’m mostly focusing on the median human, who I claim spends a great deal of their life in simulacrum level 3. I’m not centrally talking about humans who are nerds, or unusually “agential”, etc., a category that includes most successful scientists, company founders, etc. If everyone was doing simulacrum 3 all the time, I don’t think humans would have invented science and technology. Maybe related: discussion of “sapient paradox” here.
I don’t know! IIRC they talk about related things a bit in this podcast but I wound up not really knowing what to make of it. (But I listened to it a year ago, and I think I’ve learned new things since then, perhaps I should try listening to it again.)
I feel like someone should be arguing the other side, and no one else has stepped up, so I guess I’ll have a go. :-P This comment will be like 75% my honest opinions and 25% devil’s advocate. Note that I wasn’t around at the time, sorry for any misunderstandings.
I think your OP does some conflation of (1) “Eliezer was trying to build FAI” with (2) “Eliezer was loudly raising the salience of ASI risk (and thus incidentally the salience ASI in general and how big a deal ASI is), along with related community-building etc.”. But these are two somewhat separate decisions that Eliezer made.
For example, you summarize an article as claiming “Shane Legg was introduced to the idea of AGI through a 2000 talk by Eliezer, and then co-founded DM in 2010 (following an introduction by Eliezer to investor Peter Thiel…)” Those seem to be (2) not (1), right? Well, I guess the 2000 talk is neither (1) nor (2) (Eliezer didn’t yet buy AI risk in 2000), but more generally, MIRI could have directly tried to build FAI without Eliezer giving talks and introducing people, and conversely Eliezer could have given talks and introduced people without MIRI directly trying to build FAI.
So I’m skeptical that (1) (per se) contributed nontrivially to accelerating the race to ASI. For example, I’d be surprised if Demis founded DeepMind partly because he expected MIRI to successfully build ASI, and wanted to beat them to it. My guess is the opposite: Demis expected MIRI to fail to build powerful AI at all, and saw it as a safety outfit not doing anything relevant from a capabilities perspective. After all, DeepMind pursued a very different technical research direction.
On the other hand, I think there’s at least a strong prima facie case that (2) shortened timelines, which is bad. On the other hand, (2) helped build the field of alignment, which is good. So overall, how do we feel about (2)? I dunno. You yourself seemed to be endorsing (2) in 2004 (“…putting more resources into highlighting the dangers of unsafe AI…”). For my part, I have mixed feelings, but by default I tend to be in favor of (2) for kinda deontological reasons (if people’s lives are at risk, it’s by default good to tell them). But (2) is off-topic anyway; the thing you’re re-litigating is (1), right?
OK next, let’s talk about intelligence augmentation (IA), per your other comment proposal: “Given that there are known ways to significantly increase the number of geniuses (i.e., von Neumann level, or IQ 180 and greater), by cloning or embryo selection, an obvious alternative Singularity strategy is to invest directly or indirectly in these technologies, and to try to mitigate existential risks (for example by attempting to delay all significant AI efforts) until they mature and bear fruit (in the form of adult genius-level FAI researchers).”
There are geniuses today, and they mostly don’t work on FAI. Indeed, I think existing geniuses have done more to advance UFAI than FAI. I think the obvious zeroth-order model is that a world with more geniuses would just have all aspects of intellectual progress advance more rapidly, including both capabilities and alignment. So we’d wind up in the same place (i.e. probably doom), just sooner.
What would be some refinements on that zeroth-order model that make IA seem good?
One possible argument: “Maybe there’s a kind of ‘uncanny valley’ of ‘smart enough to advance UFAI but not smart enough to realize that it’s a bad idea’. And IA gets us a bunch of people who are all the way across the valley”. But uncanny-valley-theory doesn’t seem to fit the empirical data, from my perspective. When I look around, “raw intelligence” vs “awareness of AI risk and tendency to leverage that understanding into good decisions” seem somewhat orthogonal to me, as much as I want to flatter myself by thinking otherwise.
Another possible argument: “Maybe it’s not about the tippy-top of the intelligence distribution doing research, but rather the middle of the distribution, e.g. executives and other decisionmakers making terrible decisions”. But realistically we’re not going to be creating tens of millions of geniuses before ASI, enough to really shift the overall population distribution. Note that there are already millions of people smarter than, say, Donald Trump, but they’re not in charge of the USA, and he is. Ditto Sam Altman, etc. There are structural reasons for that, and those reasons won’t go away when thousands of super-geniuses appear on the scene.
Another possible argument: “If awareness of x-risk, good decision-making, etc., relies partly on something besides pure intelligence, e.g. personality … well OK fine, we can do embryo-selection etc. on both intelligence and (that aspect of) personality.” I’m a bit more sympathetic to this, but the science to do that doesn’t exist yet (details). (I might work on it at some point.)
So that’s the IA possibility, which I don’t think changes the overall picture much. And now I’ll circle back to your five-point list. I already addressed the fifth. I claim that the other four are really bad things about our situation that we have basically no hope of avoiding. On my models, ASI doesn’t require much compute, just ideas, and people are already making progress developing those ideas. On the margin we can and should try to delay the inevitable, but ultimately someone is going to build it (and then probably everyone dies). If it gets built in a more democratic and bureaucratic way, like by some kind of CERN for AI, then there are some nice things to say about that from the perspective of ethical procedure, but I don’t expect a better actual outcome than MIRI-of-2010 building it. Probably much worse. The project will still be rolling its own metaethics (at best!), the project will still be ignoring illegible safety problems, the project will almost definitely still involve key personnel winding up in a position to grab world-altering power, and the project will probably still be subjecting the whole world to dire risk by doing something that most of the world doesn’t want them to do. (Or if they pause to wait for global consensus, then someone else will build it in the meantime.) We still have all those problems, because those problems are unavoidable, alas.
Do you think sociopaths are sociopaths because their approval reward is very weak?
Basically yes (+ also sympathy reward); see Approval Reward post §4.1, including the footnote.
And if so, why do they often still seek dominance/prestige?
My current take is that prestige-seeking comes mainly from Approval Reward, and is very weak in (a certain central type of) sociopath, whereas dominance-seeking comes mainly from a different social drive that I discussed in Neuroscience of human social instincts: a sketch §7.1, but mostly haven’t thought about too much, and which may be strong in some sociopathic people (and weak in others).
I guess it’s also possible to prestige-seek not because prestige seems intrinsically desirable, but rather as a means to an end.
Yeah, I tried to catalog the ways that major prosocial drives may fail to trigger in the human world, in my Sympathy Reward post §4.1.1 and (relatedly) Approval Reward post §6.2. In brief, my list is:
In Scott Alexander’s breakdown, I think “outgroup” is basically #3 while “fargroup” is a lot of #1 and #4.
People often have callous indifference to fargroup welfare, but their attitude towards outgroups is even worse than indifference, instead they usually actively want the outgroup to suffer (cf. my discussion of “Schadenfreude Reward” and “Provocation Reward” towards enemies).
I definitely have strong concerns that Approval Reward won’t work on AGI. (But I don’t have an airtight no-go theorem either. I just don’t know; I plan to think about it more.) See especially footnote 7 of this post, and §6 of the Approval Reward post, for some of my concerns, which overlap with yours.
(I hope I wasn’t insinuating that I think AGI with Approval Reward is definitely a great plan that will solve AGI technical alignment. I’m open to wording changes if you can think of any.)
I don’t view this as particularly relevant to understanding human brains, intelligence, or AGI, but since you asked, if we define RL in the broad (psych-literature) sense, then here’s a relevant book excerpt: