Consider trying to program a self-driving car to drive from San Francisco to Los Angeles -- with no sensors that allow it to gather information as it is driving. This is possible in principle. If you can predict the exact weather conditions, the exact movement of all of the other cars on the road, the exact amount of friction along every part of the road surface, the exact impact of (the equivalents of) pressing the gas or turning the steering wheel, and so on, then you could compute ahead of time how exactly to control the car such that it gets from SF to LA. Nevertheless, it seems unlikely that we will ever be able to accomplish such a feat, even with powerful AI systems.
No, in practice there is going to be some uncertainty about how the world is going to evolve; such that any plan computed ahead of time will have some errors that will compound over the course of the plan. The solution is to use sensors to gather information while executing the plan, so that we can notice any errors or deviations from the plan, and take corrective action. It is much easier to build a controller that keeps you pointed in the general direction, than to build a plan that will get you there perfectly without any adaptation.
Control theory studies these sorts of systems, and you can see the general power of feedback controllers in the theorems that can be proven. Especially for motion tasks, you can build feedback controllers that are guaranteed to safely achieve the goal, even in the presence of adversarial environmental forces (that are bounded in size, so you can’t have arbitrarily strong wind). In the presence of an adversary, in most environments it becomes impossible even in principle to make such a guarantee if you do not have any sensors or feedback and must compute a plan in advance. Typically, for every such plan, there is some environmental force that would cause it to fail.
With ambitious value learning, we’re hoping that we can learn a utility function that tells us the optimal thing to do into the future. You need to be able to encode exactly how to behave in all possible environments, no matter what new things happen in the future, even if it’s something we humans never considered a possibility so far.
This is analogous to the problem of trying to program a self-driving car. Just as in that case, we might hope that we can solve the problem by introducing sensors and feedback. In this case, the “feedback” would be human data that informs our AI system what we want it to do, that is, data that can be used to learn values. The evolution of human values and preferences in new environments with new technologies is analogous to the unpredictable environmental disturbances that control theory assumes.
This does not mean that an AI system must be architected in such a way that human data is explicitly used to “control” the AI every few timesteps in order to keep it on track. It does mean that any AI alignment proposal should have some method of incorporating information about what humans want in radically different circumstances. I have found this an important frame with which to view AI alignment proposals. For example, with indirect normativity or idealized humans it’s important that the idealized or simulated humans are going through similar experiences that real humans go through, so that they provide good feedback.
Of course, while the control theory perspective does not require the feedback controller to be explicit, one good way to ensure that there is feedback would be to make it explicit. This would mean that we create an AI system that explicitly collects fresh data about what humans want in order to inform what it should do. This is basically calling for an AI system that is constantly using tools from narrow value learning to figure out what to do. In practice, this will require interaction between the AI and the human. However, there are still issues to think about:
Convergent instrumental subgoals: A simple way of implementing human-AI interaction would be to have an estimate of a reward function that is continually updated using narrow value learning. Whenever the AI needs to choose an action, it uses the current reward estimate to choose.
With this sort of setup, we still have the problem that we are maximizing a reward function which leads to convergent instrumental subgoals. In particular, the plan “disable the narrow value learning system” is likely very good according to the current estimate of the reward function, because it prevents the reward from changing causing all future actions to continue to optimize the current reward estimate.
Another way of seeing that this setup is a bit weird is that it has inconsistent preferences over time -- at any given point in time, it treats the expected change in its reward as an obstacle that should be undone if possible.
That said, it is worth noting that in this setup, the goal-directedness is coming from the human. In fact, any approach where goal-directedness comes from the human requires some form of human-AI interaction. We might hope that some system of this form allows us to have a human-AI system that is overall goal-directed (in order to achieve economic efficiency), while the AI system itself is not goal-directed, and so the overall system pursues the human’s instrumental subgoals. The next post will talk about reward uncertainty as a potential approach to get this behavior.
Humans are unable to give feedback: As our AI systems become more and more powerful, we might worry that they are able to vastly outthink us, such that they would need our feedback on scenarios that are too hard for us to comprehend.
On the one hand, if we’re actually in this scenario I feel quite optimistic: if the questions are so difficult that we can’t answer them, we’ve probably already solved all the simple parts of the reward, which means we’ve probably stopped x-risk.
But even if it is imperative that we answer these questions accurately, I’m still optimistic: as our AI systems become more powerful, we can have better AI-enabled tools that help us understand the questions on which we are supposed to give feedback. This could be AI systems that do cognitive work on our behalf, as in recursive reward modeling, or it could be AI-created technologies that make us more capable, such as brain enhancement or the ability to be uploaded and have bigger “brains” that can understand larger things.
Humans don’t know the goal: An important disanalogy between the control theory/self-driving car example and the AI alignment problem is that in control theory it is assumed that the general path to the destination is known, and we simply need to stay on it; whereas in AI alignment even the human does not know the goal (i.e. the “true human reward”). As a result, we cannot rely on humans to always provide adequate feedback; we also need to manage the process by which humans learn what they want. Concerns about human safety problems and manipulation fall into this bucket.
If I want an AI system that acts autonomously over a long period of time, but it isn't doing ambitious value learning (only narrow value learning), then we necessarily require a feedback mechanism that keeps the AI system "on track" (since my instrumental values will change over that period of time).
While the feedback mechanism need not be explicit (and could arise simply because it is an effective way to actually help me), we could consider AI designs that have an explicit feedback mechanism. There are still many problems with such a design, most notably that the obvious design has the problem that at any given point the AI system looks like it could be goal-directed with a long-term reward function, which is the sort of system that we are most worried about.