Half-researcher, half-distiller (see https://distill.pub/2017/research-debt/), both in AI Safety. Funded, and also PhD in theoretical computer science (distributed computing).
Yep, we seem to agree.
It might not be clear from the lit review, but I personally don't agree with all the intuitions, or not completely. And I definitely believe that a definition that throw some part of the intuitions but applies to AI risks argument is totally fine. It's more that I believe the gist of these intuitions is pointing in the right direction, and so I want to keep them in mind.
Good to know that my internal model of you is correct at least on this point.
For Daniel, given his comment on this post, I think we actually agree, but that he puts more explicit emphasis on the that-which-makes-AI-risk-arguments-work, as you wrote.
Another way to talk about this distinction is between definitions that allow you to predict the behaviour of agents which you haven't observed yet given how they were trained, versus definitions of goal-directedness which allow you to predict the future behaviour of an existing system given its previous behaviour.
I actually don't think we should make this distinction. It's true that Dennett's intentional stance falls in the first category for example, but that's not the reason why I'm interested about it. Explainability seems to me like a way to find a definition of goal-directedness that we can check through interpretability and verification, and which tells us something about the behavior of the system with regards to AI risk. Yet that doesn't mean it only applies to the observed behavior of systems.
The biggest difference between your definition and the intuitions is that you focus on how goal-directedness appears through training. I agree that this is a fundamental problem; I just think that this is something we can only solve after having a definition of goal-directedness that we can check concretely in a system and that allows the prediction of behavior.
Firstly, we don't have any AGIs to study, and so when we ask the question of how likely it is that AGIs will be goal-directed, we need to talk about the way in which that trait might emerge.
As mentioned above, I think a definition of goal-directedness should allow us to predict what an AGI will broadly do based on its level of goal-directedness. Training for me is only relevant in understanding which level of goal-directedness are possible/probable. That seems like the crux of the disagreement here.
Secondly, because of the possibility of deceptive alignment, it doesn't seem like focusing on observed behaviour is sufficient for analysing goal-directedness.
I agree, but I definitely don't think the intuitions are limiting themselves to the observed behavior. With a definition you can check through interpretability and verification, you might be able to steer clear of deception during training. That's a use of (low) goal-directedness similar to the one Evan has in mind for myopia.
Thirdly, suppose that we build a system that's goal-directed in a dangerous way. What do we do then? Well, we need to know why that goal-directedness emerges, and how to change the training regime so that it doesn't happen again.
For that one, understanding how goal-directedness emerges is definitely crucial.
Glad my comment clarified some things.
About the methodology, I just published a post clarifying my thinking about it.
Thanks for the proposed idea!
Yet I find myself lost when trying to find more information about this concept of care. It is mentioned in both the chapter on Heidegger in The History of Philosophy and the section on care in the SEP article on Heidegger, but I don't get a single thing written there. I think the ideas of "thrownness" and "disposedness" are related?
Do you have specific pointers to deeper discussions of this concept? Specifically, I'm interested in new intuitions for how a goal is revealed by actions.
Glad they helped! That's the first time I use this feature, and we debated whether to add more or remove them completely, so thanks for the feedback. :)
I think depending on what position you take, there are difference in how much one thinks there's "room for a lot of work in this sphere." The more you treat goal-directedness as important because it's a useful category in our map for predicting certain systems, the less important it is to be precise about it. On the other hand if you want to treat goal-directedness in a human-independent way or otherwise care about it "for its own sake" for some reason, then it's a different story.
If I get you correctly, you're arguing that there's less work on goal-directedness if we try to use it concretely (for discussing AI risk), compared to if we study it for it's own sake? I think I agree with that, but I still believe that we need a pretty concrete definition to use goal-directedness in practice, and that we're far from there. There is less pressure to deal ith all the philosophical nitpicks, but we should at least get the big intuitions (of the type mentioned in this lit review) right, or explain why they're wrong.
Thanks for the feedback!
My only critique so far is that I'm not really on board yet with your methodology of making desiderata by looking at what people seem to be saying in the literature. I'd prefer a methodology like "We are looking for a definition of goal-directedness such that the standard arguments about AI risk that invoke goal-directedness make sense. If there is no such definition, great! Those arguments are wrong then."
I agree with you that the endgoal of this research is to make sense of the arguments about AI risk invoking goal-directedness, and of the proposed alternatives. The thing is, even if it's true, proving that there is no property making these arguments work looks extremely hard. I have very little hope that it is possible to show one way or the other heads-on.
On the other hand, when people invoke goal-directedness, they seem to reference a cluster of similar concepts. And if we manage to formalize this cluster in a satisfying manner for most people, then we can look whether these (now formal) concepts make the arguments for AI risk work. If they do, then problem solved. If the arguments now fail with this definition, I still believe that this is a strong evidence for the arguments not working in general. You can say that I'm taking the bet that "The behavior of AI risk arguments with inputs in the cluster of intuitions from the literature is representative of the behavior of AI risks arguments with any definition of goal-directedness". Rohin for one seems less convinced by this bet (for example with regard to the importance of explainability)
My personal prediction is that the arguments for AI risks do work for a definition of goal-directedness close to this cluster of concepts. My big uncertainty is what constitute a non-goal-directed (or less goal-directed) system, and whether they're viable against goal-directed ones.
(Note that I'm not saying that all the intuitions in the lit review should be part of a definition of goal-directedness. Just that they probably need to be addressed, and that most of them capture an important detail of the cluster)
I also have a suggestion or naive question: Why isn't the obvious/naive definition discussed here? The obvious/naive definition, at least to me, is something like:"The paradigmatic goal-directed system has within it some explicit representation of a way the world could be in the future -- the goal -- and then the system's behavior results from following some plan, which itself resulted from some internal reasoning process in which a range of plans are proposed and considered on the basis of how effective they seemed to be at achieving the goal. When we say a system is goal-directed, we mean it is relevantly similar to the paradigmatic goal-directed system."I feel like this is how I (and probably everyone else?) thought about goal-directedness before attempting to theorize about it. Moreover I feel like it's a pretty good way to begin one's theorizing, on independent grounds: It puts the emphasis on relevantly similar and thus raises the question "Why do we care? For what purpose are we asking whether X is goal-directed?"
I also have a suggestion or naive question: Why isn't the obvious/naive definition discussed here? The obvious/naive definition, at least to me, is something like:
"The paradigmatic goal-directed system has within it some explicit representation of a way the world could be in the future -- the goal -- and then the system's behavior results from following some plan, which itself resulted from some internal reasoning process in which a range of plans are proposed and considered on the basis of how effective they seemed to be at achieving the goal. When we say a system is goal-directed, we mean it is relevantly similar to the paradigmatic goal-directed system."
I feel like this is how I (and probably everyone else?) thought about goal-directedness before attempting to theorize about it. Moreover I feel like it's a pretty good way to begin one's theorizing, on independent grounds: It puts the emphasis on relevantly similar and thus raises the question "Why do we care? For what purpose are we asking whether X is goal-directed?"
Your definition looks like Dennett's intentional stance to me. In the intentional stance, the "paradigmatic goal-directed system" is the purely rational system that tries to achieve its desires based on its beliefs, and being an intentional system/goal-directed depend on similarity in terms of prediction with this system.
On the other hand, for most internal structure based definitions (like Richard's or the mesa-optimizers), a goal-directed system is exactly a paradigmatic goal-directed system.
But I might have misunderstood your naive definition.
Likewise, thanks for taking the time to write such a long comment! And hoping that's a typo in the second sentence :)
You're welcome. And yes, this was as typo that I corrected. ^^
Wrt the community though, I’d be especially curious to get more feedback on Motivation #2. Do people not agree that transparency is *necessary* for AI Safety? And if they do agree, then why aren’t more people working on it?
My take is that a lot of people around here agree that transparency is at least useful, and maybe necessary. And the main reason why people are not working on it is a mix of personal fit, and the fact that without research in AI Alignment proper, transparency doesn't seem that useful (if we don't know what to look for).
I agree, but think that transparency is doing most of the work there (i.e. what you say sounds more to me like an application of transparency than scaling up the way that verification is used in current models.) But this is just semantics.
Well, transparency is doing some work, but it's totally unable to prove anything. That's a big part of the approach I'm proposing. That being said, I agree that this doesn't look like scaling the current way.
Hm, I want to disagree, but this may just come down to a difference in what we mean by deployment. In the paragraph that you quoted, I was imagining the usual train/deploy split from ML where deployment means that we’ve frozen the weights of our AI and prohibit further learning from taking place. In that case, I’d like to emphasize that there’s a difference between intelligence as a meta-ability to acquire new capabilities and a system’s actual capabilities at a given time. Even if an AI is superintelligent, i.e. able to write new information into its weights extremely efficiently, once those weights are fixed, it can only reason and plan using whatever object-level knowledge was encoded in them up to that point. So if there was nothing about bio weapons in the weights when we froze them, then we wouldn't expect the paperclip-maximizer to spontaneously make plans involving bio weapons when deployed.
You're right that I was thinking of a more online system that could update it's weights during deployment. Yet even with frozen weights, I definitely expect the model to make plans involving things that were not involved. For example, it might not have a bio-weapon feature, but the relevant subfeature to build some by quite local rules that don't look like a plan to build a bio-weapon.
Suppose an AI system was trained on a dataset of existing transparency papers to come up with new project ideas in transparency. Then its first outputs would probably use words like neurons and weights instead of some totally incomprehensible concepts, since those would be the very same concepts that would let it efficiently make sense of its training set. And new ideas about neurons and weights would then be things that we could independently reason about even if they’re very clever ideas that we didn’t think of ourselves, just like you and I can have a conversation about circuits even if we didn’t come up with it.
That seems reasonable.
To check if I understand correctly, you're arguing that the selection pressure to use argument in order to win requires the ability to be swayed by arguments, and the latter already requires explicit reasoning?
That seems convincing as a counter-argument to "explicit reasoning in humans primarily evolved not in order to help us find out about the world, but rather in order to win arguments.", but I'm not knowledgeable enough about the work quoted to check if they don't have a more subtle position.