Jack Koch

Technical AI Safety Research. Currently doing a research fellowship with the Center on Long-Term Risk.

Wiki Contributions


Redwood Research’s current project

I think it's really cool you're posting updates as you go and writing about uncertainties! I also like the fiction continuation as a good first task for experimenting with these things.

My life is a relentless sequence of exercises in importance sampling and counterfactual analysis

This made me laugh out loud :P

Grokking the Intentional Stance

If you then deconfuse agency as "its behavior is reliably predictable by the intentional strategy", I then have the same question: "why is its behavior reliably predictable by the intentional strategy?" Sure, its behavior in the set of circumstances we've observed is predictable by the intentional strategy, but none of those circumstances involved human extinction; why expect that the behavior will continue to be reliably predictable in settings where the prediction is "causes human extinction"?

Overall, I generally agree with the intentional stance as an explanation of the human concept of agency, but I do not think it can be used as a foundation for AI risk arguments. For that, you need something else, such as mechanistic implementation details, empirical trend extrapolations, analyses of the inductive biases of AI systems, etc.

The requirement for its behavior being "reliably predictable" by the intentional strategy doesn't necessarily limit us to postdiction in already-observed situations; we could require our intentional stance model of the system's behavior to generalize OOD. Obviously, to build such a model that generalizes well, you'll want it to mirror the actual causal dynamics producing the agent's behavior as closely as possible, so you need to make further assumptions about the agent's cognitive architecture, inductive biases, etc. that you hope will hold true in that specific context (e.g. human minds or prosaic AIs). However, these are additional assumptions needed to answer question of why an intentional stance model will generalize OOD, not replacing the intentional stance as the foundation of our concept of agency, because, as you say, it explains the human concept of agency, and we're worried that AI systems will fail catastrophically in ways that look agentic and goal-directed... to us.

You are correct that having only the intentional stance is insufficient to make the case for AI risk from "goal-directed" prosaic systems, but having it as the foundation of what we mean by "agent" clarifies what more is needed to make the sufficient case—what about the mechanics of prosaic systems will allow us to build intentional stance models of their behavior that generalize well OOD?

Goal-Directedness and Behavior, Redux

What's my take? I think that when we talk about goal-directedness, what we really care about is a range of possible behaviors, some of which we worry about in the context of alignment and safety.

  • (What I'm not saying) We shouldn't ascribe any cognition to the system, just find rules of association for its behavior (aka Behaviorism)
  • That's not even coherent with my favored approach to goal-directedness, the intentional stance. Dennett clearly ascribes beliefs and desires to beings and systems; his point is that the ascription is done based on the behavior and the circumstances.

I agree pretty strongly with all of this, fwiw. I think Dennett/the intentional stance really gets at the core of what it means for a system to "be an agent"; essentially, a system is one to the extent it makes sense to model it as such, i.e. as having beliefs and preferences, and acting on those beliefs to achieve those preferences, etc. The very reason why we usually consider our selves and other humans to be "agents" is exactly because that's the model over sensory data that the mind finds most reasonable to use, most of the time. In doing so, we actually are ascribing cognition to these systems, and in practice, of course we'll need to understand how such behavior will actually be implemented in our AIs. (And thinking about how "goal-directed behavior" is implemented in humans/biological neural nets seems like a good place to mine for useful insights and analogies for this purpose.)

Re-Define Intent Alignment?

To try to understand a bit better: does your pessimism about this come from the hardness of the technical challenge of querying a zillion-particle entity for its objective function? Or does it come from the hardness of the definitional challenge of exhaustively labeling every one of those zillion particles to make sure the demarcation is fully specified? Or is there a reason you think constructing any such demarcation is impossible even in principle? Or something else?

Probably something like the last one, although I think "even in principle" is doing some probably doing something suspicious in that statement. Like, sure, "in principle," you can pretty much construct any demarcation you could possibly imagine, including the Cartesian one, but what I'm trying to say is something like, "all demarcations, by their very nature, exist only in the map, not the territory." Carving reality is an operation that could only make sense within the context of a map, as reality simply is. Your concept of "agent" is defined in terms of other representations that similarly exist only within your world-model; other humans have a similar concept of "agent" because they have a similar representation built from correspondingly similar parts. If an AI is to understand the human notion of "agency," it will need to also understand plenty of other "things" which are also only abstractions or latent variables within our world models, as well as what those variables "point to" (at least, what variables in the AI's own world model they 'point to,' as by now I hope you're seeing the problem with trying to talk about "things they point to" in external/'objective' reality!).

Re-Define Intent Alignment?

(Because you'd always be unable to answer the legitimate question: "the mesa-objective of what?")

All I'm saying is that, to the extent you can meaningfully ask the question, "what is this bit of the universe optimizing for?", you should be able to clearly demarcate which bit you're asking about.

I totally agree with this; I guess I'm just (very) wary about being able to "clearly demarcate" whichever bit we're asking about and therefore fairly pessimistic we can "meaningfully" ask the question to begin with? Like, if you start asking yourself questions like "what am 'I' optimizing for?," and then try to figure out exactly what the demarcation is between "you" and "everything else" in order to answer that question, you're gonna have a real tough time finding anything close to a satisfactory answer.

Re-Define Intent Alignment?

Btw, if you're aware of any counterpoints to this — in particular anything like a clearly worked-out counterexample showing that one can't carve up a world, or recover a consistent utility function through this sort of process — please let me know. I'm directly working on a generalization of this problem at the moment, and anything like that could significantly accelerate my execution.

I'm not saying you can't reason under the assumption of a Cartesian boundary, I'm saying the results you obtain when doing so are of questionable relevance to reality, because "agents" and "environments" can only exist in a map, not the territory. The idea of trying to e.g. separate "your atoms" or whatever from those of "your environment," so that you can drop them into those of "another environment," is only a useful fiction, as in reality they're entangled with everything else. I'm not aware of formal proof of this point that I'm trying to make; it's just a pretty strongly held intuition. Isn't this also kind of one of the key motivations for thinking about embedded agency?

An Orthodox Case Against Utility Functions

I definitely see it as a shift in that direction, although I'm not ready to really bite the bullets -- I'm still feeling out what I personally see as the implications. Like, I want a realist-but-anti-realist view ;p

You might find Joscha Bach's view interesting...

Refactoring Alignment (attempt #2)

I didn't really take the time to try and define "mesa-objective" here. My definition would be something like this: if we took long enough, we could point to places in the big NN (or whatever) which represent goal content, similarly to how we can point to reward systems (/ motivation systems) in the human brain. Messing with these would change the apparent objective of the NN, much like messing with human motivation centers.

This sounds reasonable and similar to the kinds of ideas for understanding agents' goals as cognitively implemented that I've been exploring recently.

However, I think possibly you want a very behavioral definition of mesa-objective. If that's true, I wonder if you should just identify with the generalization-focused path instead. After all, one of the main differences between the two paths is that the generalization-focused path uses behavioral definitions, while the objective-focused path assumes some kind of explicit representation of goal content within a system.

The funny thing is I am actually very unsatisfied with a purely behavioral notion of a model's objective, since a deceptive model would obviously externally appear to be a non-deceptive model in training. I just don't think there will be one part of the network we can point to and clearly interpret as being some objective function that the rest of the system's activity is optimizing. Even though I am partial to the generalization focused approach (in part because it kind of widens the goal posts with the "acceptability" vs. "give the model exactly the correct goal" thing), I still would like to have a more cognitive understanding of a system's "goals" because that seems like one of the best ways to make good predictions about how the system will generalize under distributional shift. I'm not against assuming some kind of explicit representation of goal content within a system (for sufficiently powerful systems); I'm just against assuming that that content will look like a mesa-objective as originally defined.

Re-Define Intent Alignment?

I haven't engaged that much with the anti-EU-theory stuff, but my experience so far is that it usually involves a pretty strict idea of what is supposed to fit EU theory, and often, misunderstandings of EU theory. I have my own complaints about EU theory, but they just don't resonate at all with other people's complaints, it seems.

For example, I don't put much stock in the idea of utility functions, but I endorse a form of EU theory which avoids them. Specifically, I believe in approximately coherent expectations: you assign expected values to events, and a large part of cognition is devoted to making these expectations as coherent as possible (updating them based on experience, propagating expectations of more distant events to nearer, etc). This is in contrast to keeping some centrally represented utility function, and devoting cognition to computing expectations for this utility function.

Is this related to your post An Orthodox Case Against Utility Functions? It's been on my to-read list for a while; I'll be sure to give it a look now.

Refactoring Alignment (attempt #2)

I'm not too keen on (2) since I don't expect mesa objectives to exist in the relevant sense.

Same, but how optimistic are you that we could figure out how to shape the motivations or internal "goals" (much more loosely defined than "mesa-objective") of our models via influencing the training objective/reward, the inductive biases of the model, the environments they're trained in, some combination of these things, etc.?

These aren't "clean", in the sense that you don't get a nice formal guarantee at the end that your AI system is going to (try to) do what you want in all situations, but I think getting an actual literal guarantee is pretty doomed anyway (among other things, it seems hard to get a definition for "all situations" that avoids the no-free-lunch theorem, though I suppose you could get a probabilistic definition based on the simplicity prior).

Yup, if you want "clean," I agree that you'll have to either assume a distribution over possible inputs, or identify a perturbation set over possible test environments to avoid NFL.

Load More