Epistemic Status: quick write-up, in reaction to a serendipitous encounter with an idea. I see the main value of this post as decently presenting a potentially interesting take on a concept in AI safety to the community.
While skimming my copy of Reinforcement Learning: an introduction for the part on AlphaGo Zero, I found a section called Habitual and Goal-directed behavior. That caught my attention, because one idea I keep going back to is goal-directed behavior from the Value Learning Sequence; when studying the sequence, I was intrigued by the idea. But the lack of formalization made me uncertain about my position on the argument, that not all useful (and possibly superintelligent) agents have to be goal-directed.
Going back to my serendipitous discovery, the section is part of the Psychology chapter: it compares model-based and model-free RL to goal-directed and habitual behavior in psychology. To clarify the RL terms used here:
As for the different behaviors, let's quote the book itself:
Goal-directed behavior, according to how psychologists use the phrase, is purposeful in the sense that it is controlled by knowledge of the value of goals and the relationship between actions and their consequences.
Habits are behavior patterns triggered by appropriate stimuli and then performed more-or-less automatically.
There is also a summary:
Habits are sometimes said to be controlled by antecedent stimuli, whereas goal-directed behavior is said to be controlled by its consequences
Now, is this version of goal-directed behavior linked to the version that Rohin Shah wrote about? I think so. They are probably not the same, but examining the similarities and differences might clarify the part relevant to AI safety.
In intuitions about goal-directed behavior, the first example of goal-directed behavior vs non-goal directed behavior concerns policies for agents playing TicTacToe:
Consider two possible agents for playing some game, let’s say TicTacToe. The first agent looks at the state and the rules of the game, and uses the minimax algorithm to find the optimal move to play. The second agent has a giant lookup table that tells it what move to play given any state. Intuitively, the first one is more “agentic” or “goal-driven”, while the second one is not. But both of these agents play the game in exactly the same way!
This feels very model-based RL vs model-free RL to me! It's almost the same example as in the Figure 14.9 from the section: a rat tries to navigate a maze towards different rewards, and can either learn pure action-values from experience (habitual-behavior/model-free RL) or learn a model of the maze and fill it with action-values (goal-directed behavior/model-based RL).
(Notice that here, I assumes that the lookup-table contains actions learned on the environment, or at least adapted to the environment. I consider the case where the lookup table is just hard-coded random values in the next subsection).
There is also a parallel between the advantages of goal-directed behavior given by Rohin:
This suggests a way to characterize these sorts of goal-directed agents: there is some goal such that the agent’s behavior in new circumstances can be predicted by figuring out which behavior best achieves the goal.
and the intuition behind goal-directed behavior from the section:
Goal-directed control has the advantage that it can rapidly change an animal’s behavior when the environment changes its way of reacting to the animal’s actions.
Going back to the TicTacToe example, we can interpret the lookup-table version as being hard-coded. If that's the case, then it is not really analoguous to model-free RL.
In the same vein, Rohin gives more examples of behavior he considers to not be goal-directed in another post:
I can think of ways to explain these as habitual behavior, but it feels a bit forced to me. As I understand it, habitual behavior is still adaptative, just on a potentially longer scale and through different mechanisms. On the other hand, the examples above are about "habits" that are not even suited to the original environment.
These two versions of goal-directed behavior seem linked to me. Whether they are actually the same, or whether the connection will prove useful for safety research is still unclear.