Charlie Steiner

LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.


Reducing Goodhart

Wiki Contributions


My take on higher-order game theory

I have a question about this entirely divorced from practical considerations. Can we play silly ordinal games here?

If you assume that the other agent will take the infinite-order policy, but then naively maximize your expected value rather than unrolling the whole game-playing procedure, this is sort of like . So I guess my question is, if you take this kind of dumb agent (that still has to compute the infinite agent) as your baseline and then re-build an infinite tower of agents (playing other agents of the same level) on top of it, does it reconverge to  or does it converge to some weird ?

Corrigibility Can Be VNM-Incoherent

So we have a switch with two positions, "R" and "L."

When the switch is "R," the agent is supposed to want to go to the right end of the hallway, and vice versa for "L" and left. It's not that you want this agent to be uncertain about the "correct" value of the switch and so it's learning more about the world as you send it signals - you just want the agent to want to go to the left when the switch is "L," and to the right when the switch is "R."

If you start with the agent going to the right along this hallway, and you change the switch to "L," and then a minute later change your mind and switch back to "R," it will have turned around and passed through the same spot in the hallway multiple times.

The point is that if you try to define a utility as a function of the state for this agent, you run into an issue with cycles - if you're continuously moving "downhill", you can't get back to where you were before.

Corrigibility Can Be VNM-Incoherent

I think instrumental convergence should still apply to some utility functions over policies, specifically the ones that seem to produce "smart" or "powerful" behavior from simple rules. But I don't know how to formalize this or if anyone else has.

Corrigibility Can Be VNM-Incoherent

Someone at the coffee hour (Viktoriya? Apologies if I've forgotten a name) gave a short explanation of this using cycles. If you imagine an agent moving either to the left or the right along a hallway, you can change its utility function in a cycle such that it repeatedly ends up in the same place in the hallway with the same utility function.

This basically eliminates expected utility (as a discounted sum of utilities of states) maximization as producing this behavior. But you can still imagine selecting a policy such that it takes the right actions in response to you sending it signals. I think a sensible way to do this is like in tailcalled's recent post, with causal counterfactuals for sending one signal or another.

Goodhart: Endgame
  • One thing we can do to help is set up our AI to avoid taking us into weird out-of-distribution situations where my preferences are ill-defined.
  • Another thing we can do to help is have meta-preferences about how to deal with situations where my preferences are ill-defined, and have the AI learn those meta-preferences.

And in fact, "We don't actually want to go to the sort of extreme value that you can coax a model of us into outputting in weird out-of-distribution situations" is itself a meta-preference, and so we might expect something that does a good job of learning about my meta-preferences to either learn about that, or to find it consistent with its starting meta-preferences.

Another thing is, we implicitly trust our own future preferences in weird out-of-distribution situations, because what else can we do? So we can build an AI that we trust for a similar reason: either (A) it's transparent, and we train it to do human-like things for human-like reasons, or (B) it's trained to imitate human cognition.

This is the only bit I'd disagree with.

I wouldn't trust my own evaluations in weird out-of-distribution situations, certainly not if "weird" means "chosen specifically so that Charlie's evaluations are really weird." If we build an AI that we trust, I'm going to trust it to take a look at those weird OOD situations and then not go there.

If it's supervised by humans, humans need to notice that it's e.g. trying to change the environment in ways that break down the concept of human agency and stop it. If it's imitating human reasoning, it needs to imitate the same sort of reasoning I've used just now.

I'd also be interested in a compare/contrast with, say, this Stuart Armstrong post.

This is super similar to a lot of Stuart Armstrong's stuff. Human preferences are under-defined, there's a "non-obvious" part of what we think of as Goodhart's law that's related to this under-definition, but it's okay, we can just pick something that seems good to us - these are all Stuart Armstrong ideas more than Charlie Steiner ideas.

The biggest contrast is pointed to by the fact that I didn't use the word "utility" all sequence long (iirc). In general, I think I'm less interested than him in trying to jump into constructing imperfect models of humans with the tools at hand, and more interested in (or at least more focused on) new technologies and insights that would enable learning the entire structure of those models. I think we also have different ideas about how to do more a priori thinking to get better at evaluating proposals for value learning, but it's hard to articulate.

Models Modeling Models

This was a whole 2 weeks ago, so all I can say for sure was that I was at least ambiguous about your point.

But I feel like I kind of gave a reply anyway - I don't think the parallel with subagents is very deep. But there's a very strong parallel (or maybe not even a parallel, maybe this is just the thing I'm talking about) with generative modeling.

Ngo and Yudkowsky on AI capability gains

Parts of this remind me of flaming my team in a cooperative game.

A key rule to remember about team chat in videogames is that chat actions are moves in the game. It might feel satisfying to verbally dunk on my teammate for a̶s̶k̶i̶n̶g̶ ̶b̶i̶a̶s̶e̶d̶ ̶̶q̶u̶e̶s̶t̶i̶o̶n̶s̶ not ganking my lane, and I definitely do it sometimes, but I do it less if I occasionally think "what chat actions can help me win the game from this state?"

This is less than maximally helpful advice in a conversation where you're not sure what "winning" looks like. And some of the more obvious implications might look like the dreaded social obeisance.

Ngo and Yudkowsky on alignment difficulty

Ngo is very patient and understanding.

Perhaps... too patient and understanding. Richard! Blink twice if you're being held against your will!


(I too would like you to write more about agency :P)

"Summarizing Books with Human Feedback" (recursive GPT-3)

Ah, yeah. I guess this connection makes perfect sense if we're imagining supervising black-box-esque AIs that are passing around natural language plans.

Although that supervision problem is more like... summarizing Alice in Wonderland if all the pages had gotten ripped out and put back in random order. Or something. But sure, baby steps

"Summarizing Books with Human Feedback" (recursive GPT-3)

I'd heard about this before, but not the alignment spin on it. This is more interesting to me from a capabilities standpoint than an alignment standpoint, so I had assumed that this was motivated by the normal incentives for capabilities research. I'd be interested if I'm in fact wrong, or if it seems more alignment-y to other people.

Load More