(Related to Stuart Armstrong’s post last summer on Model Splintering.)
Let’s say I have a human personal assistant. Call him Ahmed. Ahmed is generally trying to help me. Sometimes a situation comes up where Ahmed feels surprised and confused. Other times, a situation comes up where Ahmed feels torn and conflicted about what to do next. In both those types of situations, I would want Ahmed to act conservatively (only do things which he considers clearly good in all respects)—or better yet, stop and ask me for help. Maybe Ahmed would say something to me like "On the one hand ... On the other hand...". That would be great. I would be very happy for him to do that.
So, we should try to instill a similar behavior in AGIs!
I think something like this is likely to be feasible, at least in the neocortex-like AGIs that I’ve been thinking about.
More specifically: OK sure, maybe we need new breakthroughs in interpretability tools in order to understand what an AGI is confused or conflicted about, and why. But it seems quite tractable to develop methods to recognize that the AGI is confused or conflicted, and then to run some special-purpose code whenever that happens. Or more simply, just have it not take any action about which it feels confused or conflicted.
To start with, I’ll elaborate on what I think it would look like to have a circuit that detects when an AGI is confused or conflicted. Start with confusion.
Confusion is pretty straightforward in my picture. You have a strongly-held generative model—one that has made lots of correct predictions in the past—and it predicts X under some circumstance. You have another strongly-held generative model which predicts NOT-X under the same circumstance. The message-passing algorithm flails around, trying in vain to slot these two contradictory expectations into the same composite model, or to deactivate one of them.
(I guess in logical induction, confusion would look like two very successful traders with two very different assessments of the value of an asset, or something like that, maybe. I don't know much about logical induction.)
Either way, confusion seems like it should be straightforwardly detectable by monitoring the algorithm.
In broad outline, the brain has a world-model, and it's stored mainly in the neocortex. The basal ganglia has a dense web of connections all across the frontal lobe of the neocortex. (The frontal lobe is the home of plans and actions, but not sensory processing.) The basal ganglia's job is to store a database with a reward prediction (well, technically a reward prediction probability distribution) for each different pattern of activity in the frontal lobe of the neocortex. This database is updated using the TD learning algorithm, using the reward-prediction-errors that are famously signaled by dopamine. The reward itself is calculated elsewhere.
There's no point in building a reward prediction database if you're not going to use it, so that's the basal ganglia’s other job: use the database to bias neocortical activity in favor of activity associated with higher reward predictions.
Now, our thoughts are generally composed of lots of different little generative models snapped together on the fly. Like, “I will put on a sock” involves hand motions and foot motions and a model of how a sock stretches and the idea that your foot will be less cold in the future and the idea of finding your sock in the drawer, etc. etc. etc.
The reward prediction entries in the basal ganglia database do not look like "one single entry for the snapped-together model". Instead they look like lots of entries, one for each individual piece of the snapped-together model. It just has to be that way. How do I know? Because you’ve seen each of the individual pieces many times before, whereas for every snapped-together model, it’s the very first time in your life that that exact model has ever been assembled. After all, you’ve never before put on this particular sock in this particular location with this particular song stuck in your head, etc. etc. You can't have a reward prediction already stored in the database for a whole assembled model, if this is the first time you've ever seen it.
So a key question is: how do you combine a bunch of reward predictions (or reward prediction probability distributions), one from each of the component pieces, to get the aggregate reward prediction for the snapped-together model? Because at the end of the day, you do need a single aggregate reward prediction to drive behavior.
So this immediately suggests an idea for what it looks like to feel conflicted: you are entertaining a plan / generative model which combines at least one very-high-reward-predicting piece with at least one very-negative-reward-predicting piece. Semantically and logically, these two pieces snap together perfectly well into a composite model. But from a reward perspective, it’s self-inconsistent.
For example, in the trolley problem, the "I will save lives" model carries a very high reward prediction, but the "I will take an action that directly causes someone to die" model carries a very negative reward prediction. The same plan activates both of those mental models. Hence we feel conflicted. (More details in the appendix—I think it's a bit more complicated than I'm letting on here.)
As above, it seems to me that it should be straightforward to build an AGI component that watches the models being formed, and watches the queries to the reward prediction database, and thus can flag when the AGI feels conflicted.
OK. So we build an AGI with a circuit that detects when it feels confused. (I'll do conflict in the next section.) So far so good. But now what?
For one thing, when the circuit activates, that's a possible warning that the AGI is having an ontological crisis. Or more generally, it could flag what Stuart Armstrong calls Model Splintering. It could also be something benign, like the AGI has some narrow misunderstanding about protein dynamics.
So OK, the detector circuit just activated. The AGI is very confused about something. Honestly, I'm not really sure what the next step is.
Hmm, maybe instead of looking for confusion in general, you find and flag some safety-critical entries in the world model—"corrections from my supervisor", "my own programming", etc.—and if the AGI has major confusions involving those entries, then you should be concerned, because that could be about to precipitate an ontological crisis that will undermine our control systems. So you automatically pause AGI execution in those cases, and try to figure out what's going on using interpretability tools, or suppress that thought so that the AGI goes and thinks about something else. Or something.
Well, anyway, I think the "AGI is conflicted" part is more important than the "AGI is confused" part. So I'll switch to that.
My broad idea is: you (somehow) start in a good state, with the AGI having a suite of intuitions and ideas that’s compatible with what you were going for. I’m imagining something like the psychology of a helpful human—there’s a big cloud of values and habits and norms that tend to work together under normal circumstances.
Then the AGI thinks of something weird—“Oh, hey, what if I tile the universe with Hedonium?”. Presumably this is a great idea according to some of its values / habits / norms, and a terrible idea according to other of its values / habits / norms. So the “AGI is conflicted” detector circuit activates.
A simple answer is: We set up the reward system such that this kind of idea is not appealing. Like, we get to decide how the reward prediction aggregation works, right? So we can just say: every little piece of the model gets a veto, so that if the AGI has any intuition suggesting that this idea is bad, the idea as a whole just sounds unappealing—as opposed to whatever more even weighting function we humans use (see Appendix). I'll call that idea conservatism.
In other words, you can imagine a continuum from:
RP of the snapped-together model = min(RP of each component piece)
RP of the snapped-together model = mean(RP of each component piece)
RP of the snapped-together model = max(RP of each component piece)
(RP = Reward Prediction.) And that continuum would go smoothly from most conservative (it's a good thought if all of your intuitions support it) to most aggressive (it's a good thought if any of your intuitions support it).
(I think you also need to somehow set up the system so that "do nothing" is the automatically-acceptable default operation when every possibility is unpalatable.)
(By the way, I think the reward prediction aggregation algorithm is actually more complicated than this, at least in humans, although it could presumably work exactly like this in an AGI. So maybe this is oversimplified, although I still think the main point is valid. See appendix for further discussion.)
So that's one possible answer. It doesn't actually require the conflict-detection circuit I mentioned above, and doesn't require any manual intervention. This kind of conservatism is just built into the AGI's thinking / decision rule.
Another possible answer is: When we detect internal conflict, we pause the AGI's execution. First we suppress one side of the conflict and see (using whatever interpretability methods we have) the thought that comes into the AGI's mind. Then we rewind, and suppress the other side of the conflict, and see the thought that comes into the AGI's mind. Then whichever one we like better, we imprint that choice as a strong update to the reward prediction database.
Maybe there are other possible answers too.
Well, this is obviously quite vague, and just one piece of a very incomplete story. You'll notice I've been mentioning that throughout!
Another thing is: there's a need to set a threshold for the detector circuits, or a threshold for the "conservative" thought-selection rule. Too low and the AGI stops being able to do anything, too high and you're missing important things. How do you get it right?
Another thing is: How close can these mechanisms get to our intuitive notions of ontological crises, conservatism, internal conflict, etc.? Any divergence presents a risk.
Another thing is: We want the AGI to like having these mechanisms, so that it doesn't try to subvert them. But that's nothing new; it's equally true of every safety feature of a self-aware AGI. I imagine doing something like finding "the idea of having this mechanism" in the AGI's world-model (somehow), and making sure the corresponding reward-prediction entry in the database stays positive. Or something. I dunno.
Try to imagine a "falling stationary rock". You can't. The "falling" model and the "stationary" model make incompatible predictions about the same Boolean variables. It just doesn't snap together into a composite model. I'll say that those two models are "semantically incompatible."
In the same way, my best guess right now is that your brain treats it as equally impossible to snap together a high-reward-predicting model with a low-reward-predicting model. They are, let's say, "reward incompatible". The brain demands consensus on the reward prediction from all the pieces that get snapped together, I think.
And moreover, the very fact that a particular high-reward-predicting model is semantically compatible with a low-reward predicting model is automatically counted as evidence that they are, in fact, reward compatible. So just by thinking about the snapped-together composite model, the high-reward-predicting model might lose points, and the low-reward-predicting model might gain points. Try it. Imagine Hitler being very kind to his dog Blondi. Just hold the image in your head for a minute. (I don't know if it's true.) You notice yourself intuitively thinking a little less ill of Hitler, and/or thinking a little less well of dog-lovers in general.
Hence we get the halo effect, and Scott Alexander's "ethnic tension" graphical model ("flow diagram thing") algorithm, etc. When I think about the trolley problem, I think I actually oscillate back and forth between the "killing is bad" thought and the "saving lives is good" thought. I don't think I can hold both aspects in my head at exactly the same time.
Now, on the one hand, this seems utterly insane. Why can't you have two model pieces that are semantically-compatible but reward-incompatible? Plans can have both good and bad aspects, right? So, even if humans work this way, why on earth would we make our AGIs that way???
On the other hand, if we do have an architecture with a single scalar reward, then each piece of the composite model is either predicting that reward correctly, or it's not. So in that sense, we really should be updating the reward predictions of different parts of a snapped-together model to bring them closer to consensus.
(Also, whenever I say to myself, "well that's a just-plain-stupid aspect of the human brain algorithm", I tend to eventually wind up coming around to feeling that it was a necessary design feature after all.)
An additional complication is that the reward is underdetermined by the situation—the reward is a function of your thoughts, not your situation. You can think about how you killed that person with your trolley and thus experience a negative reward, or you can think about how you saved lives and thus experience a positive reward. So you can go back and forth, and never get consensus. You just have two ways to think about it, and each is correctly predicting the future rewards, in a self-fulfilling, self-consistent kind of way.
Anyway, I don't think any of this undermines the main part of this blog post. I think what I said about catching conflict and being conservative should still work, even if the AGI has a "reward-prediction-consensus" architecture rather than a "reward-prediction-aggregation" architecture. But I'm not 100% sure. Also, everything in this appendix is half-baked speculation. As is the rest of the post. :-P