All of jbkjr's Comments + Replies

Redwood Research’s current project

I think it's really cool you're posting updates as you go and writing about uncertainties! I also like the fiction continuation as a good first task for experimenting with these things.

My life is a relentless sequence of exercises in importance sampling and counterfactual analysis

This made me laugh out loud :P

1Buck Shlegeris2moThanks, glad to hear you appreciate us posting updates as we go.
Grokking the Intentional Stance

If you then deconfuse agency as "its behavior is reliably predictable by the intentional strategy", I then have the same question: "why is its behavior reliably predictable by the intentional strategy?" Sure, its behavior in the set of circumstances we've observed is predictable by the intentional strategy, but none of those circumstances involved human extinction; why expect that the behavior will continue to be reliably predictable in settings where the prediction is "causes human extinction"?

Overall, I generally agree with the intentional stance as an

... (read more)
3Rohin Shah2moYeah, I agree with all of that.
Goal-Directedness and Behavior, Redux

What's my take? I think that when we talk about goal-directedness, what we really care about is a range of possible behaviors, some of which we worry about in the context of alignment and safety.

  • (What I'm not saying) We shouldn't ascribe any cognition to the system, just find rules of association for its behavior (aka Behaviorism)
  • That's not even coherent with my favored approach to goal-directedness, the intentional stance. Dennett clearly ascribes beliefs and desires to beings and systems; his point is that the ascription is done based on the behavi
... (read more)
2Adam Shimi4moI'm glad, you're one of the handful of people I wrote this post for. ;) Definitely. I have tended to neglect this angle, but I'm trying to correct that mistake.
Re-Define Intent Alignment?

To try to understand a bit better: does your pessimism about this come from the hardness of the technical challenge of querying a zillion-particle entity for its objective function? Or does it come from the hardness of the definitional challenge of exhaustively labeling every one of those zillion particles to make sure the demarcation is fully specified? Or is there a reason you think constructing any such demarcation is impossible even in principle? Or something else?

Probably something like the last one, although I think "even in principle" is doing so... (read more)

1Edouard Harris4moI'm with you on this, and I suspect we'd agree on most questions of fact around this topic. Of course demarcation is an operation on maps and not on territories. But as a practical matter, the moment one starts talking about the definition of something such as a mesa-objective, one has already unfolded one's map and started pointing to features on it. And frankly, that seems fine! Because historically, a great way to make forward progress on a conceptual question has been to work out a sequence of maps that give you successive degrees of approximation to the territory. I'm not suggesting actually trying to imbue an AI with such concepts — that would be dangerous (for the reasons you alluded to) even if it wasn't pointless (because prosaic systems will just learn the representations they need anyway). All I'm saying is that the moment we started playing the game of definitions, we'd already started playing the game of maps. So using an arbitrary demarcation to construct our definitions might be bad for any number of legitimate reasons, but it can't be bad just because it caused us to start using maps: our earlier decision to talk about definitions already did that. (I'm not 100% sure if I've interpreted your objection correctly, so please let me know if I haven't.)
Re-Define Intent Alignment?

(Because you'd always be unable to answer the legitimate question: "the mesa-objective of what?")

All I'm saying is that, to the extent you can meaningfully ask the question, "what is this bit of the universe optimizing for?", you should be able to clearly demarcate which bit you're asking about.

I totally agree with this; I guess I'm just (very) wary about being able to "clearly demarcate" whichever bit we're asking about and therefore fairly pessimistic we can "meaningfully" ask the question to begin with? Like, if you start asking yourself questions li... (read more)

1Edouard Harris4moYeah I agree this is a legitimate concern, though it seems like it is definitely possible to make such a demarcation in toy universes (like in the example I gave above). And therefore it ought to be possible in principle to do so in our universe. To try to understand a bit better: does your pessimism about this come from the hardness of the technical challenge of querying a zillion-particle entity for its objective function? Or does it come from the hardness of the definitional challenge of exhaustively labeling every one of those zillion particles to make sure the demarcation is fully specified? Or is there a reason you think constructing any such demarcation is impossible even in principle? Or something else?
Re-Define Intent Alignment?

Btw, if you're aware of any counterpoints to this — in particular anything like a clearly worked-out counterexample showing that one can't carve up a world, or recover a consistent utility function through this sort of process — please let me know. I'm directly working on a generalization of this problem at the moment, and anything like that could significantly accelerate my execution.

I'm not saying you can't reason under the assumption of a Cartesian boundary, I'm saying the results you obtain when doing so are of questionable relevance to reality, bec... (read more)

1Edouard Harris4moAh I see! Thanks for clarifying. Yes, the point about the Cartesian boundary is important. And it's completely true that any agent / environment boundary we draw will always be arbitrary. But that doesn't mean one can't usefully draw such a boundary in the real world — and unless one does, it's hard to imagine how one could ever generate a working definition of something like a mesa-objective. (Because you'd always be unable to answer the legitimate question: "the mesa-objective of what?") Of course the right question will always be [https://www.alignmentforum.org/posts/znfkdCoHMANwqc2WE/the-ground-of-optimization-1] : "what is the whole universe optimizing for?" But it's hard to answer that! So in practice, we look at bits of the whole universe that we pretend are isolated. All I'm saying is that, to the extent you can meaningfully ask the question, "what is this bit of the universe optimizing for?", you should be able to clearly demarcate which bit you're asking about. (i.e. I agree with you that duality is a useful fiction, just saying that we can still use it to construct useful definitions.)
An Orthodox Case Against Utility Functions

I definitely see it as a shift in that direction, although I'm not ready to really bite the bullets -- I'm still feeling out what I personally see as the implications. Like, I want a realist-but-anti-realist view ;p

You might find Joscha Bach's view interesting...

Refactoring Alignment (attempt #2)

I didn't really take the time to try and define "mesa-objective" here. My definition would be something like this: if we took long enough, we could point to places in the big NN (or whatever) which represent goal content, similarly to how we can point to reward systems (/ motivation systems) in the human brain. Messing with these would change the apparent objective of the NN, much like messing with human motivation centers.

This sounds reasonable and similar to the kinds of ideas for understanding agents' goals as cognitively implemented that I've been e... (read more)

2Abram Demski4moSeems fair. I'm similarly conflicted. In truth, both the generalization-focused path and the objective-focused path look a bit doomed to me.
Re-Define Intent Alignment?

I haven't engaged that much with the anti-EU-theory stuff, but my experience so far is that it usually involves a pretty strict idea of what is supposed to fit EU theory, and often, misunderstandings of EU theory. I have my own complaints about EU theory, but they just don't resonate at all with other people's complaints, it seems.

For example, I don't put much stock in the idea of utility functions, but I endorse a form of EU theory which avoids them. Specifically, I believe in approximately coherent expectations: you assign expected values to events, and

... (read more)
2Abram Demski4moRight, exactly. (I should probably have just referred to that, but I was trying to avoid reference-dumping.)
Refactoring Alignment (attempt #2)

I'm not too keen on (2) since I don't expect mesa objectives to exist in the relevant sense.

Same, but how optimistic are you that we could figure out how to shape the motivations or internal "goals" (much more loosely defined than "mesa-objective") of our models via influencing the training objective/reward, the inductive biases of the model, the environments they're trained in, some combination of these things, etc.?

These aren't "clean", in the sense that you don't get a nice formal guarantee at the end that your AI system is going to (try to) do wha

... (read more)
2Rohin Shah4moThat seems great, e.g. I think by far the best thing you can do is to make sure that you finetune using a reward function / labeling process that reflects what you actually want (i.e. what people typically call "outer alignment"). I probably should have mentioned that too, I was taking it as a given but I really shouldn't have. For inductive biases + environments, I do think controlling those appropriately would be useful and I would view that as an example of (1) in my previous comment.
Refactoring Alignment (attempt #2)

Intent Alignment: A model is intent-aligned if it has a mesa-objective, and that mesa-objective is aligned with humans. (Again, I don't want to get into exactly what "alignment" means.)

This path apparently implies building goal-oriented systems; all of the subgoals require that there actually is a mesa-objective.

I pretty strongly endorse the new diagram with the pseudo-equivalences, with one caveat (much the same comment as on your last post)... I think it's a mistake to think of only mesa-optimizers as having "intent" or being "goal-oriented" unless... (read more)

2Abram Demski4moI too am a fan of broadening this a bit, but I am not sure how to. I didn't really take the time to try and define "mesa-objective" here. My definition would be something like this: if we took long enough, we could point to places in the big NN (or whatever) which represent goal content, similarly to how we can point to reward systems (/ motivation systems) in the human brain. Messing with these would change the apparent objective of the NN, much like messing with human motivation centers. I agree with your point about using "does this definition include humans" as a filter, and I think it would be easy to mess that up (and I wasn't thinking about it explicitly until you raised the point). However, I think possibly you want a very behavioral definition of mesa-objective. If that's true, I wonder if you should just identify with the generalization-focused path instead. After all, one of the main differences between the two paths is that the generalization-focused path uses behavioral definitions, while the objective-focused path assumes some kind of explicit representation of goal content within a system.
Re-Define Intent Alignment?

The behavioral objective, meanwhile, would be more like the thing the agent appears to be pursuing under some subset of possible distributional shifts. This is the more realistic case where we can't afford to expose our agent to every possible environment (or data distribution) that could possibly exist, so we make do and expose it to only a subset of them. Then we look at what objectives could be consistent with the agent's behavior under that subset of environments, and those count as valid behavioral objectives.

The key here is that the set of allowed m

... (read more)

which stems from the assumption that you are able to carve an environment up into an agent and an environment and place the "same agent" in arbitrary environments. No such thing is possible in reality, as an agent cannot exist without its environment

 

I might be misunderstanding what you mean here, but carving up a world into agent vs environment is absolutely possible in reality, as is placing that agent in arbitrary environments to see what it does. You can think of the traditional RL setting as a concrete example of this: on one side we have an agen... (read more)

3Abram Demski4moThis makes some sense, but I don't generally trust some "perturbation set" to in fact capture the distributional shift which will be important in the real world. There has to at least be some statement that the perturbation set is actually quite broad. But I get the feeling that if we could make the right statement there, we would understand the problem in enough detail that we might have a very different framing. So, I'm not sure what to do here.
Re-Define Intent Alignment?

However, we could instead define "intent alignment" as "the optimal policy of the mesa objective would be good for humans".

I agree that we need a notion of "intent" that doesn't require a purely behavioral notion of a model's objectives, but I think it should also not be limited strictly to mesa-optimizers, which neither Rohin nor I expect to appear in practice. (Mesa-optimizers appear to me to be the formalization of the idea "what if ML systems, which by default are not well-described as EU maximizers, learned to be EU maximizers?" I suspect MIRI peop... (read more)

2Abram Demski4moFor myself, my reaction is "behavioral objectives also assume a system is well-described as EU maximizers". In either case, you're assuming that you can summarize a policy by a function it optimizes; the difference is whether you think the system itself thinks explicitly in those terms. I haven't engaged that much with the anti-EU-theory stuff, but my experience so far is that it usually involves a pretty strict idea of what is supposed to fit EU theory, and often, misunderstandings of EU theory. I have my own complaints about EU theory, but they just don't resonate at all with other people's complaints, it seems. For example, I don't put much stock in the idea of utility functions, but I endorse a form of EU theory which avoids them. Specifically, I believe in approximately coherent expectations: you assign expected values to events, and a large part of cognition is devoted to making these expectations as coherent as possible (updating them based on experience, propagating expectations of more distant events to nearer, etc). This is in contrast to keeping some centrally represented utility function, and devoting cognition to computing expectations for this utility function. In this picture, there is no clear distinction between terminal values and instrumental values. Something is "more terminal" if you treat it as more fixed (you resolve contradictions by updating the other values), and "more instrumental" if its value is more changeable based on other things. (Possibly you should consider my "approximately coherent expectations" idea)
Refactoring Alignment (attempt #2)

So, for example, this claims that either intent alignment + objective robustness or outer alignment + robustness would be sufficient for impact alignment.

Shouldn’t this be “intent alignment + capability robustness or outer alignment + robustness”?

Btw, I plan to post more detailed comments in response here and to your other post, just wanted to note this so hopefully there’s no confusion in interpreting your diagram.

2Abram Demski4moYep, fixed.