Steve Byrnes
Boston, MA, USA

I'm an AGI safety researcher with a particular focus on brain algorithms. See Email: Twitter: @steve47285


Intro to Brain-Like-AGI Safety

Wiki Contributions


[Intro to brain-like-AGI safety] 13. Symbol grounding & human social instincts


For example, humans…

Just to be clear, I was speculating in that section about filial imprinting in geese, not familial bonding in humans. I presume that those two things are different in lots of important ways. In fact, for all I know, they might have nothing whatsoever in common. ¯\_(ツ)_/¯

If the learned representations change over time as the agent learns, the thought assessors have to keep up and do the same, otherwise their accuracy will slowly degrade over time.

Yeah, that seems possible (although I also consider it possible that it’s not a problem; by analogy, catastrophic forgetting is famously more of an issue for ANNs than for brains).

If the learned representations do in fact change a lot over time, I’m slightly skeptical that it would be possible to solve that problem directly, thanks to the lack of an independent ground truth. For example, I can imagine a system that says “If I’m >95% confident that this is MOMMY, then update such that I’m 100% confident that this is MOMMY.” Maybe that system would work to keep pointing at the real mommy, even as learned representations drift. But also, maybe that system would cause the Thought Assessor to gradually go off the rails and trigger off weird patterns in noise. Not sure. Did you have something like that in mind? Or something different?

An alternative might be that, if the specific filial-imprinting mechanism gradually stops working over time, it deactivates at some point and the (now-adolescent) goose switches to some other mechanism(s), like “desire to be with fellow geese that are extremely familiar to me” a la Section 13.4.

Reminder that I know very little about goose behavior and this is all casual speculation. :)

[Intro to brain-like-AGI safety] 15. Conclusion: Open problems, how to help, AMA


how do we reverse-engineering human social instincts?

I don't know! Getting a better idea is high on my to-do list. :)

I guess broadly, the four things are (1) “armchair theorizing” (as I was doing in Post #13), (2) reading / evaluating existing theories, (3) reading / evaluating existing experimental data (I expect mainly neuroscience data, but perhaps also psychology etc.), (4) doing new experiments to gather new data.

As an example of (3) & (4), I can imagine something like “the connectomics and microstructure of the something-or-other nucleus of the hypothalamus” providing a helpful hint about what's going on; this information might or might not already be in the literature.

Neuroscience experiments are presumably best done by academic groups. I hope that neuroscience PhDs are not necessary for the other things, because I don’t have one myself :-P

AFAICT, in a neuroscience PhD, you might learn lots of facts about the hypothalamus and brainstem, but those facts almost definitely won’t be incorporated into a theoretical framework involving (A) calculating reward functions for RL (as in Section, (B) the symbol grounding problem (as in Post #13). I really like that theoretical framework, but it seems uncommon in the literature.

FYI, here on lesswrong, “Gunnar_Zarncke” & “jpyykko” have been trying to compile a list of possible instincts, or something like that, Gunnar emailed me but I haven’t had time to look closely and have an opinion; just wanted to mention that.

[Intro to brain-like-AGI safety] 14. Controlled AGI

consider the fusion power generator scenario

It's possible that I misunderstood what you were getting at in that post. I thought delegation-to-GPT-N was a central part of the story: i.e., maybe GPT-N knew that the designs could be used for bombs, but it didn't care to tell the human, because the human didn't ask. But from what you're saying now, I guess GPT-N has nothing to do with the story? You could have equally well written the post as “Suppose, a few years from now, I set about trying to design a cheap, simple fusion power generator - something I could build in my garage and use to power my house. After years of effort, I succeed….” Is that correct?

If so, I think that’s a problem that can be mitigated in mundane ways (e.g. mandatory inventor training courses spreading best-practices for brainstorming unanticipated consequences, including red-teams, structured interviews, etc.), but can't be completely solved by humans. But it also can’t be completely solved by any possible AI, because AIs aren’t and will never be omniscient, and hence may make mistakes or overlook things, just as humans can.

Maybe you're thinking that we can make AIs that are less prone to human foibles like wishful thinking and intellectual laziness etc.? But I’m optimistic that we can make “social instinct” brain-like AGIs that are also unusually good at avoiding those things (after all, some humans are significantly better than others at avoiding those things, while still having normal-ish social instincts and moral intuitions).

[Intro to brain-like-AGI safety] 14. Controlled AGI

I don’t think “the human is deciding whether or not she cares about Ems” is a different set of mental activities from “the human is trying to make sense of a confusing topic”, or “the human is trying to prove a theorem”, etc.

So from my perspective, what you said sounds like “Write code for a Social-Instinct AGI, and then stamp the word subroutine on that code, and then make an “outer AI” with the power to ‘query’ that ‘subroutine’.” From that perspective, I would be concerned that if the (so-called) subroutine never wanted to do anything bad or stupid, then the outer AI is redundant, and if the (so-called) subroutine did want to do something bad or stupid, then the outer AI may not be able to recognize and stop it.

Separately, shouldn't “doing something catastrophically stupid” become progressively less of an issue as the AGI gets “smarter”? And insofar as caution / risk-aversion / etc. is a personality type, presumably we could put a healthy dose of it into our AGIs.

[Intro to brain-like-AGI safety] 14. Controlled AGI

should be conceptually straightforward to model how humans would reason about those concepts or value them

Let’s say that the concept of an Em had never occurred to me before, and now you knock on my door and tell me that there’s a thing called Ems, and you know how to make them but you need my permission, and now I have to decide whether or not I care about the well-being of Ems. What do I do? I dunno, I would think about the question in different ways, I would try to draw analogies to things I already knew about, maybe I would read some philosophy papers, and most of all I would be implicitly probing my own innate "caring" reaction(s) and seeing exactly what kinds of thoughts do or don't trigger it.

Can we make an AGI that does all that? I say yes: we can build an AGI with human-like “innate drives” such that it has human-like moral intuitions, and then it applies those human-like intuitions in a human-like way when faced with new out-of-distribution situations. That’s what I call the “Social-Instinct AGI” research path, see Post #12.

But if we can do that, we’ve already arguably solved the whole AGI safety problem. I suspect you have something different in mind?

[Intro to brain-like-AGI safety] 14. Controlled AGI

This approach is probably particularly characteristic of my approach.

Yeah, you were one of the “couple other people” I alluded to. The other was ‪Tan Zhi-Xuan (if I was understanding her correctly during our most recent (very brief) conversation).

my approach … ontological lock …

I think I know what you’re referring to, but I’m not 100% sure, and other people reading this probably won’t. Can you provide a link? Thanks.

[Intro to brain-like-AGI safety] 14. Controlled AGI

Thanks! One of my current sources of mild skepticism right now (which again you might talk me out of) is:

  • For capabilities reasons, the AGI will probably need to be able to add things to its world-model / ontology, including human-illegible things, and including things that don't exist in the world but which the AGI imagines (and could potentially create).
  • If the AGI is entertaining a plan of changing the world in important ways (e.g. inventing and deploying mind-upload technology, editing its own code, etc.), it seems likely that the only good way of evaluating whether it's a good plan would involve having opinions about features of the future world that the plan would bring about—as opposed to basing the evaluation purely on current-world-features of the plan, like the process by which it was made.
  • …And in that case, it's not sufficient to have rigorous concepts / things that apply in our world, but rather we need to be able to pick those concepts / things out of any possible future world that the AGI might bring about.
  • I'm mildly skeptical that we can find / define such concepts / things, especially for things that we care about like “corrigibility”.
  • …And thus the story needs something along the lines of out-of-distribution edge-case detection and handling systems like Section 14.4.
[Intro to brain-like-AGI safety] 14. Controlled AGI

Thanks! Follow-up question: Do you see yourself as working towards “Proof Strategy 2”? Or “none of the above”?

[Intro to brain-like-AGI safety] 13. Symbol grounding & human social instincts

little glimpse of empathy has some literature under the term mirror neurons

Sorta, but unfortunately the "mirror neuron" literature seems to be a giant dumpster fire. I suggest & endorse the book The Myth Of Mirror Neurons by Hickok.

Everything I Need To Know About Takeoff Speeds I Learned From Air Conditioner Ratings On Amazon

I would say “Humanity's current state is so spectacularly incompetent that even the obvious problems with obvious solutions might not be solved”.

If humanity were not spectacularly incompetent, then maybe we wouldn't have to worry about the obvious problems with obvious solutions. But we would still need to worry about the obvious problems with extremely difficult and non-obvious solutions.

Load More