Beyond past confusions
Over the last year, I wrote and thought many confused and confusing ideas on the relationship between goal-directedness of behavior. In the linked post for example, I defended a deconfusion of goal-directdness solely in terms of behavior; in doing so, I might pass for a behaviorist (someone thinking that mental constructs are not needed and so don't exist), or look like I imply that we should never use internal knowledge of our models to determine goal-directedness. Without even mentioning the factual errors.
So here is my attempt at a short and clear explanation of the link I see between goal-directedness and behavior. If you're confused by this take, or believe me to be confused, I would really appreciate a comment. My goal isn't to prove that I'm obviously right, just to get less confused and hopefully help lift the fog of confusion for everyone.
Thanks to Jack Koch for a recent discussion that reminded me of this issue, and to Richard Ngo for giving me food for thought on this subject with his comments.
Behavior in all its glory
What's my take? I think that when we talk about goal-directedness, what we really care about is a range of possible behaviors, some of which we worry about in the context of alignment and safety. We might for example think that goal-directed systems have convergent subgoals, which tells us how they could lack corrigibility and cause catastrophic outcomes. that such a goal-directed system could follow.
My entire point is that for deconfusing goal-directedness, we want a better understanding of this range of behaviors. At the moment, when thinking about a given behavior, I don't know whether that's the sort of thing a goal-directed system would do. And it seems problematic both for understanding the risks of goal-directed systems, and for detecting them.
Note that even a purely structural definition of goal-directedness would constrain the structure such that the system behave in a certain way. So even if we want a structural definition, clarifying the range of behaviors sounds like progress.
What I'm not saying
- We shouldn't ascribe any cognition to the system, just find rules of association for its behavior (aka Behaviorism)
- That's not even coherent with my favored approach to goal-directedness, the intentional stance. Dennett clearly ascribes beliefs and desires to beings and systems; his point is that the ascription is done based on the behavior and the circumstances.
- Nothing but the behavior is useful to check goal-directedness.
- Even in my original confused post, I point out that structural knowledge about the system is probably necessary to check goal-directedness, as its probably the only tractable way of finding out what the system will do.
- I hadn't thought about it last year, but I see more and more the value of thinking about the justified beliefs that the system might have, due to its training data, learning algorithm and inductive biases. (This is an idea of Paul with ties to universality)
How I could be wrong
The main crux I see about this take on behavior is whether it's even possible or tractable to deconfuse and formalize the range of behaviors of goal-directed systems. No matter how useful a formalization would be, if we can't get it, we should turn to other approaches.
That being said, I haven't seen any convincing argument that it's impossible, and the more I dig, the more stuff I find, so I am quite convinced that some progress is possible.