Charlie Steiner

LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.


Reducing Goodhart

Wiki Contributions


A nice exposition.

For myself I'd prefer the same material much more condensed and to-the-point, but I recognize that there are publication venues that prefer more flowing text.

E.g. compare

We turn next to the laggard. Compared to the fixed roles model, the laggard’s decision problem in the variable roles model is more complex primarily in that it must now consider the expected utility of attacking as opposed to defending or pursuing other goals. When it comes to the expected utility of defending or pursuing other goals, we can simply copy the formulas from Section 7. To calculate the laggard’s expected utility of attacking, however, we must make two changes to the formula that applies to the leader. First, we must consider the probability that choosing to attack rather than defend will result in the laggard being left defenseless if the leader executes an attack. Second, as we saw, the victory condition for the laggard’s attack requires that AT + LT < DT. Formally, we have:


The laggard now has the same decisions as the leader, unlike the fixed roles model. However, the laggard must consider that attacking may leave them defenseless if the leader attacks. Also, of course, the victory conditions for attack and defense have the lag time on the other side.

Two suggestions for things to explore:

People often care about the Nash equilibrium of games. For the simple game with perfect information this might be trivial, but it's at least a little interesting with imperfect information.

Second, What about bargaining? Attacking and defending is costly, and AIs might be able to make agreements that they literally cannot break, essentially turning a multipolar scenario into a unipolar scenario where the effective goals are achieving a Pareto optimum of the original goals. Which Pareto optimum exactly will depend on things like the available alternatives, i.e. the power differential. Not super familiar with the bargaining literature so I can't point you at great academic references, just blog posts.

My thoughts on the strategy are that this is overly optimistic. This picture where you have ten AGIs and exactly one of them is friendly is unlikely due to the logistic success curve. Or if the heterogeneity of the AGIs is due to heterogeneity of humans (maybe Facebook builds one AI and Google builds the other, or maybe there are good open-source AI tools that let lots of individuals build AGIs around the same time) rather than stochasticity of outcomes given humanity's best AGI designs, why would the lab building the unfriendly AGI also use your safeguard interventions?

I also expect that more reaslistic models will increasingly favor the leader, as they can bring to bear information and resources in a way that doesn't just look like atomic "Attack" or "Defend" actions. This isn't necessarily bad, but it definitely makes it more important to get things right first try.

It might be worth going into the problem of fully updated deference. I don't think it's necessarily always a problem, but also it does stop utility aggregation and uncertainty from being a panacea, and the associated issues are probably worth a bit of discussion. And as you likely know, there isn't a great journal citation for this, so you could really cash in when people want to talk about it in a few years :P

Yes, this is fine to do, and prevents single-shot problems if you have a particular picture of the distribution over outcomes where most disastrous risk comes from edge cases that get 99.99%ile score but are actually bad, and all we need is actions that are 99th percentile.

This is fine if you want your AI to stack blocks on top of other blocks or something.

But unfortunately when you want to use a quantilizer to do something outside the normal human distribution, like cure cancer or supervise the training of a superhuman AI, you're no longer just shooting for a 99%ile policy. You want the AI to do something unlikely, and so of course, you can't simultaneously restrict it to only do likely things.

Now, you can construct a policy where each individual action is exactly 99%ile on some distribution of human actions. The results quickly go off-distribution and end up somewhere new - chaining together 100 99%ile actions is not the same as picking a 99%ile policy. This means you don't get the straightforward quantilizer-style guarantee, but maybe it's not all bad - after all, we wanted to go off distribution to cure cancer, and maybe by taking actions with some kind of moderation we can get some other kind of benefits we don't quite understand yet. I'm reminded of Peli Grietzer's recent post on virtue ethics.

I feel like the "obvious" thing to do is to ask how rare (in bits) the post-opitimization EV is according to the pre-optimization distribution. Like, suppose that pre-optimization my probability distribution over utilities I'd get is normally distributed, and after optimizing my EV is +1 standard deviation. Probability of doing that well or better is 0.158, which in bits is 2.65 bits.

Seems indifferent to affine transformation of the utility function, adding irrelevant states, splitting/merging states, etc. What are some bad things about this method?

I'm definitely guilty of getting a disproportionate amount of information from the AI safety community.

I don't really have a good cure for it, but I do think having a specific question helps - it's simply not practical to keep up with the entire literature, and I don't have a good filtering mechanism for what to keep up with in general, but if I'm interested in a specific question I can usually crawl the search engine + citation network well enough to get a good representation of the literature.

You might be interested in Reducing Goodhart. I'm a fan of "detecting and avoiding internal Goodhart," and I claim that's a reflective version of the value learning problem.

Partial solutions which likely do not work in the limit 

Taking meta-preferences into account

Naive attempts just move the problem up a meta level. Instead of conflicting preferences, there is now conflict between (preferences+metapreference) equilibria. Intuitively at least for humans, there are multiple or many fixed points, possibly infinitely many.

As a fan of accounting for meta-preferences, I've made my peace with multiple fixed points, to the extent that it now seems wild to expect otherwise.

Like, of course there are multiple ways to model humans as having preferences, and of course this can lead to meta-preference conflicts with multiple stable outcomes. Those are just the facts, and any process that says it has a unique fixed point will have some place where it puts its thumb on the scales.

Plenty of the fixed points are good. There's not "one right fixed point," which makes all other fixed points "not the right fixed point" by contrast. We just have to build a reasoning process that's trustworthy by our own standards, and we'll go somewhere fine.

Nice. My main issue is that just because humans have values a certain way, doesn't mean we want to build an AI that way, and so I'd draw pretty different implications for alignment. I'm pessimistic about anything that even resembles "make an AI that's like a human child," and more interested in "use a model of a human child to help an inhuman AI understand humans in the way we want."

The world model is learnt mostly by unsupervised predictive learning and so is somewhat orthogonal to the specific goal. Of course in practice in a continual learning setting, what you do and pay attention to (which is affected by your goal) will affect the data input to the unsupervised learning process? 

afaict, a big fraction of evolution's instructions for humans (which made sense in the ancestral environment) are encoded as what you pay attention to. Babies fixate on faces, not because they have a practical need to track faces at 1 week old, but because having a detailed model of other humans will be valuable later. Young children being curious about animals is a human universal. Etc.

Patterns of behavior (some of which I'd include in my goals) encoded in my model can act in a way that's somewhere between unconscious and too obvious to question - you might end up doing things not because you have visceral feelings about the different options, but simply because your model is so much better at some of the options that the other options never even get considered.

Load More