I think it's really cool you're posting updates as you go and writing about uncertainties! I also like the fiction continuation as a good first task for experimenting with these things.
My life is a relentless sequence of exercises in importance sampling and counterfactual analysis
This made me laugh out loud :P
If you then deconfuse agency as "its behavior is reliably predictable by the intentional strategy", I then have the same question: "why is its behavior reliably predictable by the intentional strategy?" Sure, its behavior in the set of circumstances we've observed is predictable by the intentional strategy, but none of those circumstances involved human extinction; why expect that the behavior will continue to be reliably predictable in settings where the prediction is "causes human extinction"?
Overall, I generally agree with the intentional stance as an explanation of the human concept of agency, but I do not think it can be used as a foundation for AI risk arguments. For that, you need something else, such as mechanistic implementation details, empirical trend extrapolations, analyses of the inductive biases of AI systems, etc.
The requirement for its behavior being "reliably predictable" by the intentional strategy doesn't necessarily limit us to postdiction in already-observed situations; we could require our intentional stance model of the system's behavior to generalize OOD. Obviously, to build such a model that generalizes well, you'll want it to mirror the actual causal dynamics producing the agent's behavior as closely as possible, so you need to make further assumptions about the agent's cognitive architecture, inductive biases, etc. that you hope will hold true in that specific context (e.g. human minds or prosaic AIs). However, these are additional assumptions needed to answer question of why an intentional stance model will generalize OOD, not replacing the intentional stance as the foundation of our concept of agency, because, as you say, it explains the human concept of agency, and we're worried that AI systems will fail catastrophically in ways that look agentic and goal-directed... to us.
You are correct that having only the intentional stance is insufficient to make the case for AI risk from "goal-directed" prosaic systems, but having it as the foundation of what we mean by "agent" clarifies what more is needed to make the sufficient case—what about the mechanics of prosaic systems will allow us to build intentional stance models of their behavior that generalize well OOD?
What's my take? I think that when we talk about goal-directedness, what we really care about is a range of possible behaviors, some of which we worry about in the context of alignment and safety.
- (What I'm not saying) We shouldn't ascribe any cognition to the system, just find rules of association for its behavior (aka Behaviorism)
- That's not even coherent with my favored approach to goal-directedness, the intentional stance. Dennett clearly ascribes beliefs and desires to beings and systems; his point is that the ascription is done based on the behavior and the circumstances.
I agree pretty strongly with all of this, fwiw. I think Dennett/the intentional stance really gets at the core of what it means for a system to "be an agent"; essentially, a system is one to the extent it makes sense to model it as such, i.e. as having beliefs and preferences, and acting on those beliefs to achieve those preferences, etc. The very reason why we usually consider our selves and other humans to be "agents" is exactly because that's the model over sensory data that the mind finds most reasonable to use, most of the time. In doing so, we actually are ascribing cognition to these systems, and in practice, of course we'll need to understand how such behavior will actually be implemented in our AIs. (And thinking about how "goal-directed behavior" is implemented in humans/biological neural nets seems like a good place to mine for useful insights and analogies for this purpose.)
To try to understand a bit better: does your pessimism about this come from the hardness of the technical challenge of querying a zillion-particle entity for its objective function? Or does it come from the hardness of the definitional challenge of exhaustively labeling every one of those zillion particles to make sure the demarcation is fully specified? Or is there a reason you think constructing any such demarcation is impossible even in principle? Or something else?
Probably something like the last one, although I think "even in principle" is doing some probably doing something suspicious in that statement. Like, sure, "in principle," you can pretty much construct any demarcation you could possibly imagine, including the Cartesian one, but what I'm trying to say is something like, "all demarcations, by their very nature, exist only in the map, not the territory." Carving reality is an operation that could only make sense within the context of a map, as reality simply is. Your concept of "agent" is defined in terms of other representations that similarly exist only within your world-model; other humans have a similar concept of "agent" because they have a similar representation built from correspondingly similar parts. If an AI is to understand the human notion of "agency," it will need to also understand plenty of other "things" which are also only abstractions or latent variables within our world models, as well as what those variables "point to" (at least, what variables in the AI's own world model they 'point to,' as by now I hope you're seeing the problem with trying to talk about "things they point to" in external/'objective' reality!).
(Because you'd always be unable to answer the legitimate question: "the mesa-objective of what?")
All I'm saying is that, to the extent you can meaningfully ask the question, "what is this bit of the universe optimizing for?", you should be able to clearly demarcate which bit you're asking about.
I totally agree with this; I guess I'm just (very) wary about being able to "clearly demarcate" whichever bit we're asking about and therefore fairly pessimistic we can "meaningfully" ask the question to begin with? Like, if you start asking yourself questions like "what am 'I' optimizing for?," and then try to figure out exactly what the demarcation is between "you" and "everything else" in order to answer that question, you're gonna have a real tough time finding anything close to a satisfactory answer.
Btw, if you're aware of any counterpoints to this — in particular anything like a clearly worked-out counterexample showing that one can't carve up a world, or recover a consistent utility function through this sort of process — please let me know. I'm directly working on a generalization of this problem at the moment, and anything like that could significantly accelerate my execution.
I'm not saying you can't reason under the assumption of a Cartesian boundary, I'm saying the results you obtain when doing so are of questionable relevance to reality, because "agents" and "environments" can only exist in a map, not the territory. The idea of trying to e.g. separate "your atoms" or whatever from those of "your environment," so that you can drop them into those of "another environment," is only a useful fiction, as in reality they're entangled with everything else. I'm not aware of formal proof of this point that I'm trying to make; it's just a pretty strongly held intuition. Isn't this also kind of one of the key motivations for thinking about embedded agency?
I definitely see it as a shift in that direction, although I'm not ready to really bite the bullets -- I'm still feeling out what I personally see as the implications. Like, I want a realist-but-anti-realist view ;p
You might find Joscha Bach's view interesting...
I didn't really take the time to try and define "mesa-objective" here. My definition would be something like this: if we took long enough, we could point to places in the big NN (or whatever) which represent goal content, similarly to how we can point to reward systems (/ motivation systems) in the human brain. Messing with these would change the apparent objective of the NN, much like messing with human motivation centers.
This sounds reasonable and similar to the kinds of ideas for understanding agents' goals as cognitively implemented that I've been exploring recently.
However, I think possibly you want a very behavioral definition of mesa-objective. If that's true, I wonder if you should just identify with the generalization-focused path instead. After all, one of the main differences between the two paths is that the generalization-focused path uses behavioral definitions, while the objective-focused path assumes some kind of explicit representation of goal content within a system.
The funny thing is I am actually very unsatisfied with a purely behavioral notion of a model's objective, since a deceptive model would obviously externally appear to be a non-deceptive model in training. I just don't think there will be one part of the network we can point to and clearly interpret as being some objective function that the rest of the system's activity is optimizing. Even though I am partial to the generalization focused approach (in part because it kind of widens the goal posts with the "acceptability" vs. "give the model exactly the correct goal" thing), I still would like to have a more cognitive understanding of a system's "goals" because that seems like one of the best ways to make good predictions about how the system will generalize under distributional shift. I'm not against assuming some kind of explicit representation of goal content within a system (for sufficiently powerful systems); I'm just against assuming that that content will look like a mesa-objective as originally defined.
I haven't engaged that much with the anti-EU-theory stuff, but my experience so far is that it usually involves a pretty strict idea of what is supposed to fit EU theory, and often, misunderstandings of EU theory. I have my own complaints about EU theory, but they just don't resonate at all with other people's complaints, it seems.
For example, I don't put much stock in the idea of utility functions, but I endorse a form of EU theory which avoids them. Specifically, I believe in approximately coherent expectations: you assign expected values to events, and a large part of cognition is devoted to making these expectations as coherent as possible (updating them based on experience, propagating expectations of more distant events to nearer, etc). This is in contrast to keeping some centrally represented utility function, and devoting cognition to computing expectations for this utility function.
Is this related to your post An Orthodox Case Against Utility Functions? It's been on my to-read list for a while; I'll be sure to give it a look now.
To me, "thinking about the proposition while simultaneously scratching my chin" sounds like a separate "thing" (complex representation formed in the GNW) than either "think about proposition" or "scratch my chin"... and you experienced this thing after the other ones, right? Like, from the way you described it, it sounds to me like there was actually 1) the proposition 2) the itch 3) the output of a 'summarizer' that effectively says "just now, I was considering this proposition and scratching my chin". [I guess, in this sense, I would say you are ordinarily doing some "weird self-deceptive dance" that prevents you from noticing this, because most people seem to ordinarily experience "themselves" as the locus of/basis for experience, instead of there being a stream of moments of consciousness, some of which apparently refer to an 'I'.]
Also, I have this sense that you're chunking your experience into "things" based on what your metacognitive summarizer-of-mental-activity is outputting back to the workspace, but there are at least 10 representations streaming through the workspace each second, and many of these will be far more primitive than any of the "things" we've already mentioned here (or than would ordinarily be noticed by the summarizer without specific training for it, e.g. in meditation). Like, in your example, there were visual sensations from the reading, mental analyses about its content, the original raw sensation of the itch, the labeling of it as "itchy," the intention to scratch the itch, (definitely lots more...), and, eventually, the thought "I remember thinking about this proposition and scratching my chin 'at the same time'."