Paul Christiano


Iterated Amplification


Avoiding the instrumental policy by hiding information about humans

Unpacking "mutual information," it seems like these designs basically take the form of an adversarial game:

  • The model computes some intermediate states.
  • An adversary tries to extract facts about the "unknowable" X.
  • The model is trained so that the adversary can't succeed.

But this rests on the adversary not already knowing about X (otherwise we couldn't measure whether the adversary succeeds).

In the case of mutual information, this is achieved formally by having a random variable that the adversary does not observe directly. If we are talking about "what humans are like" then we can't take the naive approach of mutual information (since we can't deploy the entire training process many times in different worlds where humans are different). So what do we do instead?

The obvious approach is to just train the adversary to answer questions about humans, but then we somehow need to prevent the adversary from simply learning the facts themselves. If instead we don't give the adversary much time to learn, or much compute to work with, then we need to worry about cases where the model learns about X but is able to easily obscure that information from the adversary.

(Mostly I'm dissuaded from this approach by other considerations, but I am still interested in whether we could make anything along these lines actually work.)

A naive alignment strategy and optimism about generalization

I agree you have to do something clever to make the intended policy plausibly optimal.

The first part of my proposal in section 3 here was to avoid using "imitate humans," and to instead learn a function "Answer A is unambiguously worse than answer B." Then we update against policies only when they give unambiguously worse answers.

(I think this still has a lot of problems; it's not obvious to me whether the problem is soluble.)

Teaching ML to answer questions honestly instead of predicting human answers

I think they need to be exactly equal. I think this is most likely accomplished by making something like pairwise judgments and only passing judgment when the comparison is a slam dunk (as discussed in section 3). Otherwise the instrumental policy will outperform the intended policy (since it will do the right thing when the simple labels are wrong).

Teaching ML to answer questions honestly instead of predicting human answers

I think "deferring" was a bad word for me to use. I mostly imagine the complex labeling process will just independently label data, and then only include datapoints when there is agreement. That is, you'd just always return the (simple, complex) pair, and is-correct basically just tests whether they are equal.

I said "defer" because one of the data that the complex labeling process uses may be "what a human who was in the room said," and this may sometimes be a really important source of evidence. But that really depends on how you set things up, if you have enough other signals then you would basically always just ignore that one. 

(That said, I think probably amplification is the most important difference between the simple and complex labeling processes, because that's the only scalable way to inject meaningful amounts of extra complexity into the complex labeling process---since the ML system can't predict itself very well, it forces it to basically try to win a multiplayer game with copies of itself, and we hope that's more complicated. And if that's the case then the simple labeling process may as well use all of the data sources, and the difference is just how complex a judgment we are making using those inputs.)

Teaching ML to answer questions honestly instead of predicting human answers

I don't think anyone has a precise general definition of "answer questions honestly" (though I often consider simple examples in which the meaning is clear). But we do all understand how "imitate what a human would say" is completely different (since we all grant the possibility of humans being mistaken or manipulated), and so a strong inductive bias towards "imitate what a human would say" is clearly a problem to be solved even if other concepts are philosophically ambiguous.

Sometimes a model might say something like "No one entered the datacenter" when what they really mean is "Someone entered the datacenter, got control of the hard drives with surveillance logs, and modified them to show no trace of their presence." In this case I'd say the answer is "wrong;" when such wrong answers appear as a critical part of a story about catastrophic failure, I'm tempted to look at why they were wrong to try to find a root cause of failure, and to try to look for algorithms that avoid the failure by not being "wrong" in the same intuitive sense. The mechanism in this post is one way that you can get this kind of wrong answer, namely by imitating human answers, and so that's something we can try to fix.

On my perspective, the only things that are really fundamental are:

  • Algorithms to train ML systems. These are programs you can run.
  • Stories about how those algorithms lead to bad consequences. These are predictions about what could/would happen in the world. Even if they aren't predictions about what observations a human would see, they are the kind of thing that we can all recognize as a prediction (unless we are taking a fairly radical skeptical perspective which I don't really care about engaging with).

Everything else is just a heuristic to help us understand why an algorithm might work or where we might look for a possible failure story.

I think this is one of the upsides of my research methodology---although it requires people to get on the same page about algorithms and about predictions (of the form "X could happen"), we don't need to start on the same page about all the other vague concepts. Instead we can develop shared senses of those concepts over time by grounding them out in concrete algorithms and failure stories. I think this is how shared concepts are developed in most functional fields (e.g. in mathematics you start with a shared sense of what constitutes a valid proof, and then build shared mathematical intuitions on top of that by seeing what successfully predicts your ability to write a proof).

Teaching ML to answer questions honestly instead of predicting human answers

Also, I don't see what this objective has to do with learning a world model.

The idea is to address a particular reason that your learned model would "copy a human" rather than "try to answer the question well." Namely, the model already contains human-predictors, so building extra machinery to answer questions (basically translating between the world model and natural language) would be more inefficient than just using the existing human predictor. The hope is that this alternative loss allows you to use the translation machinery to compress the humans, so that it's not disfavored by the prior.

I don't think it's intrinsically related to learning a world model, it's just an attempt to fix a particular problem.

To the extent that there is a problem with the proposed approach---either a reason that this isn't a real problem in the standard approach, or a reason that this proposed approach couldn't address the problem (or would inevitably introduce some other problem)---then I'm interested in that.

Isn't the Step 1 objective (the unnormalized posterior log probability of (θ₁, θ₂)) maximized at θ₁ = θ₂=argmax L + prior?

Why would it be maximized there? Isn't it at least better to make ?

And then in the section I'm trying to argue that the final term (the partition function) in the loss means that you can potentially get a lower loss by having  push apart the two heads in such a way that improving the quality of the model pushes them back together. I'm interested in anything that seems wrong in that argument.

(I don't particularly believe this particular formulation is going to work, e.g. because the L2 regularizer pushes θ₁ to adjust each parameter halfway, while the intuitive argument kind of relies on it being arbitrary what you put into θ₁ or θ₂, as it would be under something more like an L1 regularizer. But I'm pretty interested in this general approach.)

Two caveats were: (i) this isn't going to actually end up actually making any alternative models lower loss, it's just going to level the playing field such that a bunch of potential models have similar loss (rather than an inductive bias in favor of the bad models), (ii) in order for that to be plausible you need to have a stop grad on one of the heads in the computation of C, I maybe shouldn't have push that detail so late.

Response to "What does the universal prior actually look like?"

By (3) do you mean the same thing as "Simplest output channel that is controllable by advanced civilization with modest resources"?

I assume (6) means that your "anthropic update" scans across possible universes to find those that contain important decisions you might want to influence?

If you want to compare most easily to models like that, then instead of using (1)+(2)+(3) you should compare to (6') = "Simplest program that scans across many possible worlds to find those that contain some pattern that can be engineered by consequentialists trying to influence prior."

Then the comparison is between specifying "important predictor to influence" and whatever the easiest-to-specify pattern that can be engineered by a consequentialist. It feels extremely likely to me that the second category is easier, indeed it's kind of hard for me to see any version of (6) that doesn't have an obviously simpler analog that could be engineered by a sophisticated civilization.

With respect to (4)+(5), I guess you are saying that your point estimate is that only 1/million of consequentialists decide to try to influence the universal prior. I find that surprisingly low but not totally indefensible, and it depends on exactly how expensive this kind of influence is. I also don't really see why you are splitting them apart, shouldn't we just combine them into "wants to influence predictors"? If you're doing that presumably you'd both use the anthropic prior and then the treacherous turn.

But it's also worth noting that (6') gets to largely skip (4') if it can search for some feature that is mostly brought about deliberately by consequentialists (who are trying to create a beacon recognizable by some program that scans across possible worlds looking for it, doing the same thing that "predictor that influences the future" is doing in (6)).

Response to "What does the universal prior actually look like?"

Here's my current understanding of your position:

  1. The easiest way to specify an important prediction problem (in the sense of a prediction that would be valuable for someone to influence) is likely to be by saying "Run the following Turing machine, then pick an important decision from within it." Let's say the complexity of that specification is N bits.
  2. You think that if consequentialists dedicate some fraction of their resources to doing something that's easy for the universal prior to output, it will still likely take more than N bits or not much less.
  3. [Probably] You think the differences may be small enough that they can be influenced by factors of 1/1000 or 1/billion (i.e. 10-30 bits) of improbability of consequentialists spending significant resources in this task.
  4. [Probably] You think the TM-definition update (where the manipulators get to focus on inductors who put high probability on their own universe) or the philosophical sophistication update (where manipulators use the "right" prior over possible worlds rather than choosing some programming language) are small relative to these other considerations.

I think the biggest disagreement is about 1+2. It feels implausible to me that "sample a data stream that is being used by someone to make predictions that would be valuable to manipulate" is simpler than any of the other extraction procedures that consequentialists could manipulate (like sample the sequence that appears the most times, sample the highest energy experiments, sample the weirdest thing on some other axis...)

But suppose they picked only one string to try to manipulate. The cost would go way down, but then it probably wouldn’t be us that they hit.

I think we're probably on the same page now, but I'd say: the consequentialists can also sample from the "important predictions" prior (i.e. the same thing as that fragment of the universal prior). If "sample output channel controlled by consequentialists" has higher probability than "Sample an important prediction," then the consequentialists control every important prediction. If on the other hand "Sample an important prediction" has higher probability than the consequentialists, I guess maybe they could take over a few predictions, but unless they were super close it would be a tiny fraction and I agree we wouldn't care.

Decoupling deliberation from competition

I agree that biological human deliberation is slow enough that it would need to happen late.

By "millennia" I mostly meant that traveling is slow (+ the social costs of delay are low, I'm estimating like 1/billionth of value per year of delay). I agree that you can start sending fast-enough-to-be-relevant ships around the singularity rather than decades later. I'd guess the main reason speed matters initially is for grabbing resources from nearby stars under whoever-gets-their-first property rights (but that we probably will move away from that regime before colonizing).

I do expect to have strong global coordination prior to space colonization. I don't actually know if you would pause long enough for deliberation amongst biological humans to be relevant. So on reflection I'm not sure how much time you really have as biological humans. In the OP I'm imagining 10+ years (maybe going up to a generation) but that might just not be realistic.

Probably my single best guess is that some (many?) people would straggle out over years or decades (in the sense that relevant deliberation for controlling what happens with their endowment would take place with biological humans living on earth), but that before that there would be agreements (reached at high speed) to avoid them taking a huge competitive hit by moving slowly.

But my single best guess is not that likely and it seems much more likely that something else will happen (and even that I would conclude that some particular other thing is much more likely if I thought about it more).

Decoupling deliberation from competition

I think I'm basically optimistic about every option you list.

  • I think space colonization is extremely slow relative to deliberation (at technological maturity I think you probably have something like million-fold speedup over flesh and blood humans, and colonization takes place over decades and millennia rather than years). Deliberation may not be "finished" until the end of the universe, but I think we will e.g. have deliberated enough to make clear agreements about space colonization / to totally obsolete existing thinking / likely to have reached a "grand compromise" from which further deliberation can be easily decentralized.
  • I think it's very easy for someone to purchase a slice of every ship or otherwise ensure representation, and have a delegate they trust (perhaps the same one they would have used for deliberating locally, e.g. just a copy of their favorite souped-up emulation) on every ship. The technology for that seems to come way before tech for maximally fast space colonization (and you don't really leave the solar system until you have extremely mature space colonization, since you'll get very easily overtaken later). That could involve people having influence over each of the colonization projects, or could involve delegates whose only real is to help inform someone who actually has power in the project / to participate in acausal trade.
  • I think it's fairly likely that space ships will travel slowly enough that you can beam information to them and do the kind of scheme you outline where you deliberate at home and then beam instructions out. I think this is pretty unlikely, but if everything else fails it would probably be reasonably painless. I think the main obstruction would be leaving your descendants abroad vulnerable if your descendants at home get compromised. (It's also a problem if descendants at home go off the rails, but getting compromised is more concerning because it can happen to either descendants abroad or at home).

(Also, all of this assumes that defensive capabilities are a lot stronger than offensive capabilities in space. If offense is comparably strong, than we also have the problem that the cosmic commons might be burned in wars if we don't pause or reach some other agreement before space colonisation.)

This seems like maybe the most likely single reason you need to sort everything out in advance, though the general consideration in favor of option value (and waiting a year or two being no big deal) seems even more important. I do expect to have plenty of time to do that.

I haven't thought about any of these details much because it seems like such an absurdly long subjective time before we leave the solar system, and so there will be huge amounts of time for our descendants to make bargains before them. I am much more concerned about destructive technologies that require strong coordination long before we leave. (Or about option value lost by increasing the computational complexity of your simulation and so becoming increasingly uncorrelated with some simulators.)

One reason you might have to figure these things out in advance is if you try to decouple competition from deliberation by doing something like secure space rights (i.e. binding commitments to respect property rights, have no wars ever, and divide up the cosmos in an agreeable way). It's a bit hard to see how we could understand the situation well enough to reach an agreeable compromise directly (rather than defining a mutually-agreeable deliberative process to which we will defer and which has enough flexibility to respond to unknown unknowns about colonization dynamics) but if it was a realistic possibility then it might require figuring a lot of stuff out sooner rather than later.

Load More