There's a bunch of considerations and models mixed together in this post. Here's a way I'm factoring some of them, which other people may also find useful.
I'd consider counterfactuality the main top-level node; things which would have been done anyway have radically different considerations from things which wouldn't. E.g. doing an eval which (carefully, a little bit at a time) mimics what e.g. chaosGPT does, in a controlled environment prior to release, seems straightforwardly good so long as people were going to build chaosGPT soon anyway. It's a direct improvement over something which would have happened quickly anyway in the absence of the eval. That argument still holds even if a bunch of the other stuff in the post is totally wrong or totally the wrong way of thinking about things (e.g. I largely agree with habryka's comment about comprehensibility of future LM-based agents).
On the other hand, building a better version of chaosGPT which users would not have tried anyway, or building it much sooner, is at least not obviously an improvement. I would say that's probably a bad idea, but that's where the rest of the models in the post start to be relevant to the discussion.
Alas, we don't actually know ahead of time which things people will/won't counterfactually be tried, so there's some grey zone. But at least this frame makes it clear that "what would people counterfactually try anyway?" is a key subquestion.
(Side note: also remember that counterfactuality gets trickier in multiplayer scenarios where players are making decisions based on their expectations of other players. We don't want a situation where all the major labs build chaosGPT because they expect all the others to do so anyway. But in the case of chaosGPT, multiplayer considerations aren't really relevant, because somebody was going to build the thing regardless of whether they expected OpenAI/Deepmind/Anthropic to build the thing. And I expect that's the prototypical case; the major labs don't actually have enough of a moat for small-game multiplayer dynamics to be a very good model here.)
I'm not really sure what you mean by "oversight, but add an epicycle" or how to determine if this is a good summary.
Something like: the OP is proposing oversight of the overseer, and it seems like the obvious next move would be to add an overseer of the oversight-overseer. And then an overseer of the oversight-oversight-overseer. Etc.
And the implicit criticism is something like: sure, this would probably marginally improve oversight, but it's a kind of marginal improvement which does not really move us closer in idea-space to whatever the next better paradigm will be which replaces oversight (and is therefore not really marginal progress in the sense which matters more). In the same way that adding epicycles to a model of circular orbits does make the model match real orbits marginally better (for a little while, in a way which does not generalize to longer timespans), but doesn't really move closer in idea space to the better model which eventually replaces circular orbits (and the epicycles are therefore not really marginal progress in the sense which matters more).
We don't believe that all knowledge and computation in a trained neural network emerges in phase transitions, but our working hypothesis is that enough emerges this way to make phase transitions a valid organizing principle for interpretability.
I think this undersells the case for focusing on phase transitions.
Hand-wavy version of a stronger case: within a phase (i.e. when there's not a phase change), things change continuously/slowly. Anyone watching from outside can see what's going on, and have plenty of heads-up, plenty of opportunity to extrapolate where behavior is headed. That makes safety problems a lot easier. Phase transitions are exactly the points where that breaks down - changes are sudden, extrapolation fails rapidly. So, phase transitions are exactly the points which are strategically crucial to detect, for safety purposes.
This was an outstanding post! The concept of a "conflationary alliance" seems high-value and novel to me. The anthropological study mostly confirms what I already believed, but provides very legible evidence.
Wait... doesn't the caprice rule just directly modify its preferences toward completion over time? Like, every time a decision comes up where it lacks a preference, a new preference (and any implied by it) will be added to its preferences.
Intuitively: of course the caprice rule would be indifferent to completing its preferences up-front via contract/commitment, because it expects to complete its preferences over time anyway; it's just lazy about the process (in the "lazy data structure" sense).
I would order these differently.
Within the first section (prompting/RLHF/Constitutional):
The core reasoning here is that human feedback directly selects for deception. Furthermore, deception induced by human feedback does not require strategic awareness - e.g. that thing with the hand which looks like it's grabbing a ball but isn't is a good example. So human-feedback-induced deception is more likely to occur, and to occur earlier in development, than deception from strategic awareness. Among the three options, "Constitutional" AI applies the most optimization pressure toward deceiving humans (IIUC), RLHF the next most, whereas prompting alone provides zero direct selection pressure for deception; it is by far the safest option of the three. (Worlds Where Iterative Design Fails talks more broadly about the views behind this.)
Next up, I'd put "Experiments with Potentially Catastrophic Systems to Understand Misalignment" as 4th-hardest world. If we can safely experiment with potentially-dangerous systems in e.g. a sandbox, and that actually works (i.e. the system doesn't notice when it's in testing and deceptively behave itself, or otherwise generalize in ways the testing doesn't reveal), then we don't really need oversight tools in the first place. Just test the thing and see if it misbehaves.
The oversight stuff would be the next three hardest worlds (5th-7th). As written I think they're correctly ordered, though I'd flag that "AI research assistance" as a standalone seems far safer than using AI for oversight. The last three seem correctly-ordered to me.
I'd also add that all of these seem very laser-focused on intentional deception as the failure mode, which is a reasonable choice for limiting scope, but sure does leave out an awful lot.
I think this scenario is still strategically isomorphic to "advantages mainly come from overwhelmingly great intelligence". It's intelligence at the level of a collective, rather than the individual level, but the conclusion is the same. For instance, scalable oversight of a group of AIs which is collectively far smarter than any group of humans is hard in basically the same ways as oversight of one highly-intelligent AI. Boxing the group of AIs is hard for the same reasons as boxing one. Etc.
Here's a meme I've been paying attention to lately, which I think is both just-barely fit enough to spread right now and very high-value to spread.
Meme part 1: a major problem with RLHF is that it directly selects for failure modes which humans find difficult to recognize, hiding problems, deception, etc. This problem generalizes to any sort of direct optimization against human feedback (e.g. just fine-tuning on feedback), optimization against feedback from something emulating a human (a la Constitutional AI or RLAIF), etc.
Many people will then respond: "Ok, but if how on earth is one supposed to get an AI to do what one wants without optimizing against human feedback? Seems like we just have to bite that bullet and figure out how to deal with it." ... which brings us to meme part 2.
Meme part 2: We already have multiple methods to get AI to do what we want without any direct optimization against human feedback. The first and simplest is to just prompt a generative model trained solely for predictive accuracy, but that has limited power in practice. More recently, we've seen a much more powerful method: activation steering. Figure out which internal activation-patterns encode for the thing we want (via some kind of interpretability method), then directly edit those patterns.