Charlie Steiner

LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.


Reducing Goodhart

Wiki Contributions


How would one use this to inform decomposition?

What I want are some human-meaningful features that can get combined in human-meaningful ways.

E.g. you take a photo of a duck, you take a feature that means "this photo was taken on a sunny day," and then you do some operation to smush these together and you get a photo of a duck taken on a sunny day.

If features are vectors of fixed direction with size drawn from a distribution, which is my takeaway from the superposition paper, then the smushing-together operation is addition (maybe conditional on the dot product of the current image with the feature being above some threshold).

If the on-distribution data points get mapped to regions of the activation space with lots of large polytopes, how does this help us extract some commonality from a bunch of photos of sunny days, and then smush that commonality together with a photo of a duck to get a photo of a duck on a sunny day?

Not a rhetorical question, I'm trying to think through it. It's just hard.

Maybe you'd think of the commonality between the sunny-day pictures in terms of their codes? You'd toss out the linear part and just say that what the sunny-day pictures have in common is that they have some subset of the nonlinearities that all tend to be in the same state. And so you could make the duck picture more sunny by flipping that subset of neurons to be closer to the sunny-day mask.

It strikes me that the kind of self-supervision you describe is a suspiciously similar to trying to incorporate meta-preferences in the outer objective by self-modeling. When the model understands humans differently, it changes its notion of what it is to be misaligned or deceptive, which gets used to form a loss term against that kind of behavior, which then impacts how it understands humans.

This analogy:

  • makes me significantly more optimistic about interpretability inside the training loop.
  • makes me slightly more pessimistic about meta-preferences incorporated as fine-tuning based on self-modeling.
  • clarifies preference-related arguments for why getting the former to work means fulfilling more desiderata than current basic schemes do.

I think this really incentivizes things like network dissection over "interpret this neuron" approaches.

pg 6 "there exist" -> "there exists"

pg 13 maybe specify that you mean a linear functional that cannot be written as an integral (I quickly jumped ahead after thinking of one where you don't need to take any integrals to evaluate it)

I'd be interested :) I think my two core concerns are that our rules/norms are meant for humans, and that even then, actors often have bad impacts that would only be avoided with a pretty broad perspective about their responsibilities. So an AI that follows rules/norms well can't just understand them on the object level, it has to have a really good understanding of what it's like to be a human navigating these rules/norms, and use that understanding to make things go well from a pretty broad perspective.

That first one means that not only do I not want the AI to think about what rules mean "in a vacuum," I don't even want it to merely use human knowledge to refine its object-level understanding of the rules. Because the rules are meant for humans, with our habits and morals and limitations, and our explicit understanding of them only works because they operate in an ecosystem full of other humans. I think our rules/norms would fail to work if we tried to port them to a society of octopuses, even if those octopuses were to observe humans to try to improve their understanding of the object-level impact of the rules.

An example (maybe not great because it only looks at one dimension of the problem) is that our norms may implicitly assume a certain balance between memetic offense and defense that AIs would upset. E.g. around governmental lobbying (those are also maybe a bad example because they're kinda insufficient already).


While listening to the latest Inside View podcast, it occurred to me that this perspective on AI safety has some natural advantages when translating into regulation that present governments might be able to implement to prepare for the future. If AI governance people aren't already thinking about this, maybe bother some / convince people in this comment section to bother some?

Upvoted, and even agree with everything about enlightened compliance, but I think this framing of the problem is bad because everything short of enlightened compliance is so awful. The active ingredients in making things go well are not the norms, which if interpreted literally by AI will just result in blind rules-lawyering in a way that alienates humans - the active ingredients are precisely the things that separate enlightened compliance from everything else. You can call it learning human values or learning to follow the spirit of the law, it's basically the same computations, with basically the same research avenues and social/political potential pitfalls

I also responded to Capybasilisk below, but I want to chime in here and use your own post against you, contra point 2 :P

It's not so easy to get "latent knowledge" out of a simulator - it's the simulands who have the knowledge, and they have to be somehow specified before you can step forward the simulation of them. When you get a text model to output a cure for Alzheimer's in one step, without playing out the text of some chain of thought, it's still simulating something to produce that output, and that something might be an optimization process that is going to find lots of unexpected and dangerous solutions to questions you might ask it.

Figuring out the alignment properties of simulated entities running in the "text laws of physics" seems like a challenge. Not an insurmountable challenge, maybe, and I'm curious about your current and future thoughts, but the sort of thing I want to see progress in before I put too much trust in attempts to use simulators to do superhuman abstraction-building.

Ah, the good old days post-GPT-2 when "GPT-3" was the future example :P

I think back then I still thoroughly understimated how useful natural-language "simulation" of human reasoning would be. I agree with janus that we have plenty of information telling us that yes, you can ride this same training procedure to very general problem solving (though I think including more modalities, active leaning, etc. will be incorporated before anyone really pushes brute force "GPT-N go brrr" to the extreme).

This is somewhat of a concern for alignment. I more or less stand by that comment you linked and its children; in particular, I said

The search thing is a little subtle. It's not that search or optimization is automatically dangerous - it's that I think the danger is that search can turn up adversarial examples / surprising solutions.

I mentioned how I think the particular kind of idiot-proofness that natural language processing might have is "won't tell an idiot a plan to blow up the world if they ask for something else." Well, I think that as soon as the AI is doing a deep search through outcomes to figure out how to make Alzheimer's go away, you lose a lot of that protection and I think the AI is back in the category of Oracles that might tell an idiot a plan to blow up the world.

Simulating a reasoner who quickly finds a cure for Alzheimer's is not by default safe (even though simulating a human writing in their diary is safe). Optimization processes that quickly find cures for Alzheimer's are not humans, they must be doing some inhuman reasoning, and they're capable of having lots of clever ideas with tight coupling to the real world.

I want to have confidence in the alignment properties of any powerful optimizers we unleash, and I imagine we can gain that confidence by knowing how they're constructed, and trying them out in toy problems while inspecting their inner workings, and having them ask humans for feedback about how they should weigh moral options, etc. These are all things it's hard to do for emergent simulands inside predictive simulators. I'm not saying it's impossible for things to go well, I'm about evenly split on how much I think this is actually harder, versus how much I think this is just a new paradigm for thinking about alignment that doesn't have much work in it yet.

Could you clarify what you mean by values not being "hack after evolutionary hack"?

What this sounds like, but I think you don't mean: "Human values are all emergent from a simple and highly general bit of our genetic blueprint, which was simple for evolution to find and has therefore been unchanged more or less since the invention of within-lifetime learning. Evolution never developed a lot of elaborate machinery to influence our values."

What I think you do mean: "Human values are emergent from a simple and general bit of our genetic blueprint (our general learning algorithm), plus a bunch of evolutionary nudges (maybe slightly hackish) to guide this learning algorithm towards things like friendship, eating when hungry, avoiding disgusting things, etc. Some of these nudges generalize so well they've basically persisted across mammalian evolution, while some of them humans only share with social primates, but the point is that even though we have really different values from chimpanzees, that's more because our learning algorithm is scaled up and our environment is different, the nudges on the learning algorithm have barely had to change at all."

What I think you intend to contrast this to: "Every detail of human values has to be specified in the genome - the complexity of the values and the complexity of the genome have to be closely related."

This is outstanding. I'll have other comments later, but first I wanted to praise how this is acting as a synthesis of lots of previous ideas that weren't ever at the front of my mind.

Load More