Unpacking "mutual information," it seems like these designs basically take the form of an adversarial game:
But this rests on the adversary not already knowing about X (otherwise we couldn't measure whether the adversary succeeds).
In the case of mutual information, this is achieved formally by having a random variable that the adversary does not observe directly. If we are talking about "what humans are like" then we can't take the naive approach of mutual information (since we can't deploy the entire training process many times in different worlds where humans are different). So what do we do instead?
The obvious approach is to just train the adversary to answer questions about humans, but then we somehow need to prevent the adversary from simply learning the facts themselves. If instead we don't give the adversary much time to learn, or much compute to work with, then we need to worry about cases where the model learns about X but is able to easily obscure that information from the adversary.
(Mostly I'm dissuaded from this approach by other considerations, but I am still interested in whether we could make anything along these lines actually work.)
I agree you have to do something clever to make the intended policy plausibly optimal.
The first part of my proposal in section 3 here was to avoid using "imitate humans," and to instead learn a function "Answer A is unambiguously worse than answer B." Then we update against policies only when they give unambiguously worse answers.
(I think this still has a lot of problems; it's not obvious to me whether the problem is soluble.)
I think they need to be exactly equal. I think this is most likely accomplished by making something like pairwise judgments and only passing judgment when the comparison is a slam dunk (as discussed in section 3). Otherwise the instrumental policy will outperform the intended policy (since it will do the right thing when the simple labels are wrong).
I think "deferring" was a bad word for me to use. I mostly imagine the complex labeling process will just independently label data, and then only include datapoints when there is agreement. That is, you'd just always return the (simple, complex) pair, and is-correct basically just tests whether they are equal.
I said "defer" because one of the data that the complex labeling process uses may be "what a human who was in the room said," and this may sometimes be a really important source of evidence. But that really depends on how you set things up, if you have enough other signals then you would basically always just ignore that one.
(That said, I think probably amplification is the most important difference between the simple and complex labeling processes, because that's the only scalable way to inject meaningful amounts of extra complexity into the complex labeling process---since the ML system can't predict itself very well, it forces it to basically try to win a multiplayer game with copies of itself, and we hope that's more complicated. And if that's the case then the simple labeling process may as well use all of the data sources, and the difference is just how complex a judgment we are making using those inputs.)
I don't think anyone has a precise general definition of "answer questions honestly" (though I often consider simple examples in which the meaning is clear). But we do all understand how "imitate what a human would say" is completely different (since we all grant the possibility of humans being mistaken or manipulated), and so a strong inductive bias towards "imitate what a human would say" is clearly a problem to be solved even if other concepts are philosophically ambiguous.
Sometimes a model might say something like "No one entered the datacenter" when what they really mean is "Someone entered the datacenter, got control of the hard drives with surveillance logs, and modified them to show no trace of their presence." In this case I'd say the answer is "wrong;" when such wrong answers appear as a critical part of a story about catastrophic failure, I'm tempted to look at why they were wrong to try to find a root cause of failure, and to try to look for algorithms that avoid the failure by not being "wrong" in the same intuitive sense. The mechanism in this post is one way that you can get this kind of wrong answer, namely by imitating human answers, and so that's something we can try to fix.
On my perspective, the only things that are really fundamental are:
Everything else is just a heuristic to help us understand why an algorithm might work or where we might look for a possible failure story.
I think this is one of the upsides of my research methodology---although it requires people to get on the same page about algorithms and about predictions (of the form "X could happen"), we don't need to start on the same page about all the other vague concepts. Instead we can develop shared senses of those concepts over time by grounding them out in concrete algorithms and failure stories. I think this is how shared concepts are developed in most functional fields (e.g. in mathematics you start with a shared sense of what constitutes a valid proof, and then build shared mathematical intuitions on top of that by seeing what successfully predicts your ability to write a proof).
Also, I don't see what this objective has to do with learning a world model.
The idea is to address a particular reason that your learned model would "copy a human" rather than "try to answer the question well." Namely, the model already contains human-predictors, so building extra machinery to answer questions (basically translating between the world model and natural language) would be more inefficient than just using the existing human predictor. The hope is that this alternative loss allows you to use the translation machinery to compress the humans, so that it's not disfavored by the prior.
I don't think it's intrinsically related to learning a world model, it's just an attempt to fix a particular problem.
To the extent that there is a problem with the proposed approach---either a reason that this isn't a real problem in the standard approach, or a reason that this proposed approach couldn't address the problem (or would inevitably introduce some other problem)---then I'm interested in that.
Isn't the Step 1 objective (the unnormalized posterior log probability of (θ₁, θ₂)) maximized at θ₁ = θ₂=argmax L + prior?
Why would it be maximized there? Isn't it at least better to make θ1=θ2+θ02?
And then in the section I'm trying to argue that the final term (the partition function) in the loss means that you can potentially get a lower loss by having θ1 push apart the two heads in such a way that improving the quality of the model pushes them back together. I'm interested in anything that seems wrong in that argument.
(I don't particularly believe this particular formulation is going to work, e.g. because the L2 regularizer pushes θ₁ to adjust each parameter halfway, while the intuitive argument kind of relies on it being arbitrary what you put into θ₁ or θ₂, as it would be under something more like an L1 regularizer. But I'm pretty interested in this general approach.)
Two caveats were: (i) this isn't going to actually end up actually making any alternative models lower loss, it's just going to level the playing field such that a bunch of potential models have similar loss (rather than an inductive bias in favor of the bad models), (ii) in order for that to be plausible you need to have a stop grad on one of the heads in the computation of C, I maybe shouldn't have push that detail so late.
By (3) do you mean the same thing as "Simplest output channel that is controllable by advanced civilization with modest resources"?
I assume (6) means that your "anthropic update" scans across possible universes to find those that contain important decisions you might want to influence?
If you want to compare most easily to models like that, then instead of using (1)+(2)+(3) you should compare to (6') = "Simplest program that scans across many possible worlds to find those that contain some pattern that can be engineered by consequentialists trying to influence prior."
Then the comparison is between specifying "important predictor to influence" and whatever the easiest-to-specify pattern that can be engineered by a consequentialist. It feels extremely likely to me that the second category is easier, indeed it's kind of hard for me to see any version of (6) that doesn't have an obviously simpler analog that could be engineered by a sophisticated civilization.
With respect to (4)+(5), I guess you are saying that your point estimate is that only 1/million of consequentialists decide to try to influence the universal prior. I find that surprisingly low but not totally indefensible, and it depends on exactly how expensive this kind of influence is. I also don't really see why you are splitting them apart, shouldn't we just combine them into "wants to influence predictors"? If you're doing that presumably you'd both use the anthropic prior and then the treacherous turn.
But it's also worth noting that (6') gets to largely skip (4') if it can search for some feature that is mostly brought about deliberately by consequentialists (who are trying to create a beacon recognizable by some program that scans across possible worlds looking for it, doing the same thing that "predictor that influences the future" is doing in (6)).
Here's my current understanding of your position:
I think the biggest disagreement is about 1+2. It feels implausible to me that "sample a data stream that is being used by someone to make predictions that would be valuable to manipulate" is simpler than any of the other extraction procedures that consequentialists could manipulate (like sample the sequence that appears the most times, sample the highest energy experiments, sample the weirdest thing on some other axis...)
But suppose they picked only one string to try to manipulate. The cost would go way down, but then it probably wouldn’t be us that they hit.
I think we're probably on the same page now, but I'd say: the consequentialists can also sample from the "important predictions" prior (i.e. the same thing as that fragment of the universal prior). If "sample output channel controlled by consequentialists" has higher probability than "Sample an important prediction," then the consequentialists control every important prediction. If on the other hand "Sample an important prediction" has higher probability than the consequentialists, I guess maybe they could take over a few predictions, but unless they were super close it would be a tiny fraction and I agree we wouldn't care.
I agree that biological human deliberation is slow enough that it would need to happen late.
By "millennia" I mostly meant that traveling is slow (+ the social costs of delay are low, I'm estimating like 1/billionth of value per year of delay). I agree that you can start sending fast-enough-to-be-relevant ships around the singularity rather than decades later. I'd guess the main reason speed matters initially is for grabbing resources from nearby stars under whoever-gets-their-first property rights (but that we probably will move away from that regime before colonizing).
I do expect to have strong global coordination prior to space colonization. I don't actually know if you would pause long enough for deliberation amongst biological humans to be relevant. So on reflection I'm not sure how much time you really have as biological humans. In the OP I'm imagining 10+ years (maybe going up to a generation) but that might just not be realistic.
Probably my single best guess is that some (many?) people would straggle out over years or decades (in the sense that relevant deliberation for controlling what happens with their endowment would take place with biological humans living on earth), but that before that there would be agreements (reached at high speed) to avoid them taking a huge competitive hit by moving slowly.
But my single best guess is not that likely and it seems much more likely that something else will happen (and even that I would conclude that some particular other thing is much more likely if I thought about it more).
I think I'm basically optimistic about every option you list.
(Also, all of this assumes that defensive capabilities are a lot stronger than offensive capabilities in space. If offense is comparably strong, than we also have the problem that the cosmic commons might be burned in wars if we don't pause or reach some other agreement before space colonisation.)
This seems like maybe the most likely single reason you need to sort everything out in advance, though the general consideration in favor of option value (and waiting a year or two being no big deal) seems even more important. I do expect to have plenty of time to do that.
I haven't thought about any of these details much because it seems like such an absurdly long subjective time before we leave the solar system, and so there will be huge amounts of time for our descendants to make bargains before them. I am much more concerned about destructive technologies that require strong coordination long before we leave. (Or about option value lost by increasing the computational complexity of your simulation and so becoming increasingly uncorrelated with some simulators.)
One reason you might have to figure these things out in advance is if you try to decouple competition from deliberation by doing something like secure space rights (i.e. binding commitments to respect property rights, have no wars ever, and divide up the cosmos in an agreeable way). It's a bit hard to see how we could understand the situation well enough to reach an agreeable compromise directly (rather than defining a mutually-agreeable deliberative process to which we will defer and which has enough flexibility to respond to unknown unknowns about colonization dynamics) but if it was a realistic possibility then it might require figuring a lot of stuff out sooner rather than later.