Charlie Steiner

LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.


Reducing Goodhart

Wiki Contributions


Since it was evidently A Thing, I have caved to peer pressure :P

"Shard theory doesn't need more work" (in sense 2) could be true as a matter of fact, without me knowing it's true with high confidence. If you're saying "for us to become highly confident that alignment is going to work this way, we need more info", I agree. 

But I read you as saying "for this to work as a matter of fact, we need X Y Z additional research":

Yeah, this is a good point. I do indeed think that just plowing ahead wouldn't work as a matter of fact, even if shard theory alignment is easy-in-the-way-I-think-is-plausible, and I was vague about this.

This is because the way in which I think it's plausible for it to be easy is some case (3) that's even more restricted than (1) or (2).  Like 3: If we could read the textbook from the future and use its ontology, maybe it would be easy / robust to build an RL agent that's aligned because of the shard theory alignment story.

To back up: in nontrivial cases, robustness doesn't exist in a vacuum - you have to be robust to some distribution of perturbations. For shard theory alignment to be easy, it hast to be robust to the choices we have to make about building AI, and specifically to the space of different ways we might make those choices. This space of different ways we could make choices depends on the ontology we're using to think about the problem - a good ontology / way of thinking about the problem makes the right degrees of freedom "obvious," and makes it hard to do things totally wrong.

I think in real life, if we think "maybe this doesn't need more work and just we don't know it yet," what's actually going to happen is that for some of the degrees of freedom we need to set, we're going to be using an ontology that allows for perturbations where the thing's not robust, depressing the chances of success exponentially.

I think I disagree with lots of things in this post, sometimes in ways that partly cancel each other out.

  • Parts of generalizing correctly involve outer alignment. I.e. building objective functions that have "something to say" about how humans want the AI to generalize.
  • Relatedly, outer alignment research is not done, and RLHF/P is not the be-all-end-all.
  • I think we should be aiming to build AI CEOs (or more generally, working on safety technology with an eye towards how it could be used in AGI that skillfully navigates the real world). Yes, the reality of the game we're playing with gung-ho orgs is more complicated, but sometimes, if you don't, someone else really will.
  • Getting AI systems to perform simpler behaviors safely also looks like capabilities research. When you say "this will likely require improving sample efficiency," a bright light should flash. This isn't a fatal problem - some amount of advancing capabilities is just a cost of doing business. There exists safety research that doesn't advance capabilities, but that subset has a lot of restrictions on it (little connection to ML being the big one). Rather than avoiding ever advancing AI capabilities, we should acknowledge that fact in advance and try to make plans that account for it.

The memo trap reminds me of the recent work from Anthropic on superposition, memorization, and double descent - it's plausible that there's U-shaped scaling in there somewhere for similar reasons. But because of the exponential scaling of how good superposition is for memorization, maybe the paper actually implies the opposite? Hm.

I'll have to eat the downvote for now - I think it's worth it to use magic as a term of art, since it's 11 fewer words than "stuff we need to remind ourselves we don't know how to do," and I'm not satisfied with "free parameters."

I think it's quite plausible that you don't need much more work for shard theory alignment, because value formation really is that easy / robust.

But how do we learn that fact?

If extremely-confident-you says "the diamond-alignment post would literally work" and I say "what about these magical steps where you make choices without knowing how to build confidence in them beforehand" and extremely-confident-you says "don't worry, most choices work fine because value formation is robust," how did they learn that value formation is robust in that sense?

I think it is unlikely but plausible that shard theory alignment could turn out to be easy, if only we had the textbook from the future. But I don't think it's plausible that getting that textbook is easy. Yes, we have arguments about human values that are suggestive, but I don't see a way to go from "suggestive" to "I am actually confident" that doesn't involve de-mystifying the magic.

Here's a place where I want one of those disagree buttons separate from the downvote button :P

Given a world model that contains a bunch of different ways of modeling the same microphysical state (splitting up the same world into different parts, with different saliency connections to each other, like the discussion of job vs. ethnicity and even moreso), there can be multiple copies that coarsely match some human-intuitive criteria for a concept, given different weights by the AI. There will also be ways of modeling the world that don't get represented much at all, and which ways get left out can depend how you're training this AI (and a bit more subtly, how you're interpreting its parameters as a world model).

Especially because of that second part, finding good goals in an AI's world model isn't satisfactory if you're just training an fixed, arbitrary AI. Your process for finding good goals needs to interact with how the AI learns its mode of the world in the first place. In which case, world-model interpretability is not all we need.

Nice! Thinking about "outer alignment maximalism" as one of these framings reveals that it's based on treating outer alignment as something like "a training process that's a genuinely best effort at getting the AI to learn to do good things and not bad things" (and so of course the pull-request AI fails because it's teaching the AI about pull requests, not about right and wrong).

Introspecting, this choice of definition seems to correspond with feeling a lack of confidence that we can get the pull-request AI to behave well - I'm sure it's a solvable technical problem, but in this mindset it seems like an extra-hard alignment problem because you basically have to teach the AI human values incidentally, rather than because it's your primary intent.

Which gets me thinking about what other framings correspond to what other intuitions about what parts of the problem are hard / what the future will look like.

This was an important and worthy post.

I'm more pessimistic than Ajeya; I foresee thorny meta-ethical challenges with building AI that does good things and not bad things, challenges not captured by sandwiching on e.g. medical advice. We don't really have much internal disagreement about the standards by which we should judge medical advice, or the ontology in which medical advice should live. But there are lots of important challenges that are captured by sandwiching problems - sandwiching requires advances in how we interpret human feedback, and how we try to elicit desired behaviors from large ML models. This post was a step forward in laying out near-term alignment research that's appealing to researchers used to working on capabilities.

Of course, this is all just my opinion. But while I would love to point to the sandwiching work done in the last 20 months that bears on the truth of the matter, there hasn't been all that much. Meanwhile a small cottage industry has sprung up of people using RLHF to fine-tune language models to do various things, so it's not like people aren't interested in eliciting specific behaviors from models - it just seems like we're in a research environment where handicapping your model's capabilities in pursuit of deeper understanding is less memetically fit.

The big follow-up I'm aware of is Sam Bowman's group currently working on artificial sandwiching. That work avoids some of the human-interaction challenges, but is still potentially interesting research on controlling language models with language models. I've also myself used this post as a template for suggesting prosaic alignment research on more than one occasion, all of it still in the works. So maybe the research inspired by this post needs a bit longer to bear fruit.

Family's coming over, so I'm going to leave off writing this comment even though there are some obvious hooks in it that I'd love to come back to later.

  • If the AI can't practically distinguish mechanisms for good vs. bad behavior even in principle, why can the human distinguish them? If the human cant distinguish them, why do we think the human is asking for a coherent thing? If the human isn't asking for a coherent thing, we don't have to smash our heads against that brick wall, we can implement "what to do when the human asks for an incoherent thing" contingency plans.
    • (e.g. have multiple ways of connecting abstract models that have this incoherent thing as a basic labeled cog to other more grounded ways of modeling the world, and do some conservative averaging over those models to get a concept that tends to locally behave like humans expect the incoherent thing to behave on everyday cases.)
    • Our big advantage over philosophy is getting to give up when we've asked for something impossible, and ask for something possible. That said, it's premature to declare any of the problems mentioned here impossible - but it means that case analysis type reasoning where we go "but what if it's impossible?" seems like it should be followed with "here's what we might try to do instead of the impossible thing."
  • It seems likely that there are concepts that aren't guaranteed to be learned by arbitrary AIs, even arbitrary AIs from the distribution of AI designs humans would consider absent alignment concerns, but still can be taught to an AI that you get to design and build from the ground up.
    • So checking whether an arbitrary AI "knows a thing" is impressive, but to me seems impressive in a kind of tying-one-hand-behind-your-back way.
    • Is getting to build the AI really more powerful? It doesn't seem to literally work with worst-case reasoning. Maybe I should try to find a formalization of this in terms of non-worst case reasoning.
      • The guarantees I'm used to rely on there being some edge that the thing we want to teach has. But as in this post, maybe the edge is inefficient to find. Also, maybe there is no edge on a predictive objective, and we need to lay out some kind of feedback scheme from humans, which has problems that are more like philosophy than learning theory.
  • The "assume the AI knows what's going on" test seems shaky in the real world. If we ask an AI to learn an incoherent concept, it will likely still learn some concept based on the training data that helps it get better test scores.

Did you ever end up reading Reducing Goodhart? I enjoyed reading these thought experiments, but I think rather than focusing on "the right direction" (of wisdom), or "the right person," we should mostly be thinking about "good processes" - processes for evolving humans' values that humans themselves think are good, in the ordinary way we think ordinary good things are good.

Load More