“Our greatest fear should not be of failure, but of succeeding at something that doesn't really matter.” –DL Moody (allegedly)
(Related: Solving the whole AGI control problem, version 0.0001. Also related: all my posts.)
I’m five months into my new job! This here is a forward-looking post where I write down what I’m planning to work in the near future and why. Please comment (or otherwise reach out) if you think I’m prioritizing wrong or going in a sub-optimal direction for Safe & Beneficial AGI. That’s the whole reason this post exists!
I’ll work backwards: first the intermediate “problems to solve” that I think would help for the AGI control problem, then the immediate things I’m doing to work towards those goals.
1. Intermediate problems to solve
1.1 “The 1st-Person Problem”
It’s easy (or at least tractable) to find or generate tons of labeled data with safety-critical concepts and situations. For example, we can easily find a YouTube video where Alice is deceiving Bob, and label it "deception".
But is that useful? The thing we want to find and/or reinforce in our AGI’s world-model is not quite that. Instead it’s concepts related to the AGI’s own actions. We want the AGI to think ""I am deceiving Bob" is a very bad thing". We don't want the AGI to think ""Alice is deceiving Bob" is a bad thing", or that the abstract concept of deception is a bad thing.
In other words, getting 3rd-person labeled data is easy but unhelpful. What we really want is 1st-person labeled data.
OK, so then we can say, forget about the YouTube videos. Instead we'll do RL, or something like it. We'll have the AGI do things, and we'll label the things it does.
That is indeed a solution. But it’s a really dangerous one! We would have to catch the AGI in the act of deception before we could label deception as bad.
The real golden ticket would be if we could somehow magically transmute third-person labeled data into 1st-person labeled data (or at least something equivalent to 1st-person labeled data).
How do we do that? I don’t know. I call this the “1st-Person Problem”.
Incidentally, in the diagram above, I put an arrow indicating that understanding social emotions might be helpful for solving the 1st-Person Problem. Why would I think that? Well, what I have in mind here is that, for example, if a 3-year-old sees a 7-year-old doing something, then often the 3-year-old immediately wants to do the exact same thing. So I figure, it’s at least possible that somewhere in the algorithms of human social instincts is buried a solution to the 1st-Person Problem. Maybe, maybe not.
1.2 The meta-problem of consciousness
I am strongly averse to becoming really competent to talk about the philosophy of consciousness. Well, I mean, it would be really fun, but it seems like it would also be really time-consuming, with all the terminology and theories and arguments and counter-arguments and whatnot, and I have other priorities right now.
But there’s a different question which is: there’s a set of observable psychological phenomena, where people declare themselves conscious, muse on the ineffable nature of consciousness, write papers about the hard problem of consciousness, and so on. Therefore there has to be some explanation, in terms of brain algorithms, for the chain of events that eventually leads to people talking about how they’re conscious. It seems to me that unravelling that chain of events is purely a question of neuroscience and psychology, not philosophy.
The “meta-problem of consciousness”—a.k.a. “Why do people believe that there’s a hard problem of consciousness”—is about unraveling this chain of events.
What's the point? Well, if I truthfully declare “I am wearing a watch”, and we walk through the chain of events that led to me saying those words, we’d find that one of the links in the chain is an actual physical watch that I’m actually physically wearing. So by the same token, it seems to me that a complete solution to the meta-problem of consciousness should directly lead to a solution to the hard problem of consciousness. (Leaving aside some fine print—see Yudkowsky on zombies.)
In terms of AGI, it seems to me that knowing whether or not AGI is conscious is an important thing to know, at least for the AGI’s sake. (Yeah I know—as if we don’t already have our hands full thinking about the impacts of AGI on humans!)
So working on the meta-problem of consciousness seems like one thing worth doing.
That said, I’m tentatively feeling OK about what I wrote here—basically my proto-theory of the meta-problem of consciousness is some idiosyncratic mishmash of Michael Graziano's "Attention Schema Theory", and Stanislas Dehaene's "Global Workspace Theory", and Keith Frankish's "Illusionism". I'll stay on the lookout for reasons to think that what I wrote there was wrong, but otherwise I think this topic is not high on my immediate priority list.
1.3 The meta-problem of suffering
The motivation here is exactly like the previous section: People talk about suffering. The fact that they are speaking those words is an observable psychological fact, and must have an explanation involving neuroscience and brain algorithms. And whatever that explanation is, it would presumably get us most of the way towards understanding suffering, and thus towards figuring out which AI algorithms are or aren’t suffering.
Unlike the previous section, I do think I’m pretty likely to do some work here, even in the short term, partly because I’m still pretty generally confused and curious about this. More importantly, it’s very closely tied to other things in AGI safety that I need to be working on anyway. In particular, it seems pretty intimately tied to decision-making and motivation—for example, we are motivated to not suffer, and we make decisions that lead to not-suffering. I’ve already written a lot about decision-making and motivation and plan to write more, because understanding and sculpting AGI motivation is a huge part of AGI safety. So with luck I'll have something useful to say about the meta-problem of suffering at some point.
1.4 What’s the API of the telencephalon?
I think the telencephalon (neocortex, hippocampus, amygdala, part of the basal ganglia) is running a learning algorithm (well, several learning algorithms) that starts from scratch at birth (“random weights” or something equivalent) and learns a big complicated world-model etc. within the animal’s lifetime. See “learning-from-scratch-ism” discussion here. And the brainstem and hypothalamus are mostly hardcoded algorithms that try to “steer” that learning algorithm towards doing useful things.
What are the input channels through which the brainstem can "steer" the telencephalon learning algorithms? I originally thought that there was a one-dimensional signal called “reward”, and that's it. But when I looked into it, I found a bunch more things! And all those things were at least plausibly promising ideas for AGI safety!
Two things in particular:
(1) a big reason that I wrote Reward Is Not Enough was as an excuse to promote the idea of supplying different reward signals for different subsystems, especially subsystems that report back to the supervisor. And where did I get that idea? From the telencephalon API of course!
(2) I’m pretty excited by the idea of interpretability and steering via supervised learning (see the “More Dakka” comment in the figure caption here), especially if we can solve the “1st-Person Problem” above. And where did I get that idea? From the telencephalon API of course!
So this seems like fertile soil to keep digging. There are certainly other things in the telencephalon API too. I would like to understand them!
1.5 Understand human social instincts well enough to implement them in code
1.6 Learned world-models with hardcoded pieces
My default presumption is that our AGIs will learn a world-model from scratch—loosely speaking, they’ll find patterns in sensory inputs and motor outputs, and then patterns in the patterns, and then patterns in the patterns in the patterns … etc.
But it might be nice instead to hard-code “how the AGI will think about certain aspects of the world”. The obvious use-case here is that the programmer declares that there's a thing called "humans", and they're basically cartesian, and they are imperfectly rational according to such-and-such model of imperfect rationality, etc. Then these hard-coded items can be referents when we define the AGI’s motivations, rewards, etc.
For example, lots of discussion of IRL and value learning seem to presuppose that we’re writing code that tells the AGI specifically how to model a human. To pick a random example, in Vanessa Kosoy's 2018 research agenda, the "demonstration" and "learning by teaching" ideas seem to rely on being able to do that—I don't see how we could possibly do those things if the whole world-model is a bunch of unlabeled patterns in patterns in patterns in sensory input etc.
So there's the problem. How do we build a learned-from-scratch world model but shove little hardcoded pieces into it? How do we ensure that the hardcoded pieces wind up in the right place? How do we avoid the AGI ignoring the human model that we supplied it and instead building a parallel independent human model from scratch?
2. Things to do right now
2.1 “Consciousness studies”
I think there’s a subfield of neuroscience called “consciousness studies”, where they talk a lot about how people formulate thoughts about themselves, etc. The obvious application is understanding consciousness, but I’m personally much more interested in whether it could help me think about The 1st-Person Problem. So I'm planning to dive into that sometime soon.
2.2 Autonomic nervous system, dopamine supervised learning
I have this theory (see the “Plan assessors” at A model of decision-making in the brain (the short version)) that parts of the telencephalon are learning algorithms that are trained by supervised learning, with dopamine as the supervisory signal. I think this theory (if correct) is just super duper important for everything I’m interested in, and I’m especially eager to take that idea and build on it, particularly for social instincts and motivation and suffering and so on.
But I’m concerned that I would wind up building idiosyncratic speculative theories on top of idiosyncratic speculative theories on top of ….
So I’m swallowing my impatience, and trying to really nail down dopamine supervised learning—keep talking to experts, keep filling in the gaps, keep searching for relevant evidence. And then I can feel better about building on it.
It turns out that these dopamine-supervised-learning areas (e.g. anterior cingulate, medial prefrontal cortex, amygdala, ventral striatum) are all intimately involved in the autonomic nervous system, so I’m hoping that reading more about that will help resolve some of my confusions about those parts of the brain (see Section 1 here for what exactly my confusions are).
I also think that I kinda wound up pretty close to the somatic marker hypothesis (albeit via a roundabout route), so I want to understand the literature on that, incorporate any good ideas, and relate it to what I’m talking about.
Since the autonomic nervous system is related to “feelings”, learning about the autonomic nervous system could also help me understand suffering and consciousness and motivation and social instincts.
2.3 Brainstem → telencephalon control signals
As mentioned above (see “What’s the API of the telencephalon?”), there are a bunch of signals going from the brainstem to the telencephalon, and I only understand a few of them so far, and I’m eager to dive into others. I’ve written a bit about acetylcholine but I have more to say, I have a hunch that it plays a role in cortical specialization, with implications for transparency. I know pretty much nothing about serotonin, norepinephrine, opioids, etc. I mentioned here that I’m confused about how the cortex learns motor control from scratch, or if that’s even possible, and how I’m also confused about the role of various signals going back and forth between the midbrain and cortex. So those are all things I'm eager to dive into.