Instances in a bureaucracy can be very different and play different roles or pursue different purposes. They might be defined by different prompts and behave as differently as text continuations of different prompts in GPT-3 (the prompt is "identity" of the agent instance, distinguishes the model as a simulator from agent instances as simulacra). Decision transformers with more free-form task/identity prompts illustrate this point, except a bureaucracy should have multiple agents with different task prompts in a single episode. MCTS and GAN are adversarial and could be reframed like this. One of the tentative premises of ELK is that a model trained for some purpose might additionally allow instantiating an agent that reports what the model knows, even if that activity is not clearly related to the model's main purpose. Colluding instances inside a bureaucracy make it less effective in achieving its goals of producing a better dataset (accurately evaluating outcomes of episodes).
So I think useful arrangements of diverse/competing/flawed systems is a hope in many contexts. It often doesn't work, so looks neglected, but not for want of trying. The concern with collusion in AI risk seems to be more about deceptive alignment, observed behavior of a system becoming so well-optimized that it ceases to be informative about how it would behave in different situations. Very capable AIs can lack any tells even from the perspective of extremely capable observers, their behavior can change in a completely unexpected way with changing circumstances. Hence the focus on interpretability, it's not just useful for human operators, a diverse system of many AIs also needs it to notice misalignment or collusion in its parts. It might even be enough for corrigibility.
HCH-like amplification seems related, multiple unreliable agent instances assembled into a bureaucracy that as a whole improves on some of their qualities (perhaps trustworthiness) and allows generating a better dataset for retraining the unreliable agents. So this problem is not specific to interaction of somewhat independently developed and instantiated AIs acting in the world, as it appears in a popular training story of a single AI. There is also the ELK problem that might admit solutions along these lines. This makes it less obviously neglected, even with little meaningful progress.
The usual story of acausal coordination involves agent P modeling agent Q, and Q modeling P. Put differently, both P and Q model the joint system (P+Q) that has both P and Q in it. But it doesn't have to be (P+Q), it could be some much simpler R instead. I think a more central example of acausal coordination is to simply follow shared ideas.
The unusual character of acausal coordination is caused by cases where R is itself an agent, an adjudicator between P and Q. As a shared idea, it would have instances in minds of both P and Q, and command some influence allowed by P and Q over their actions. It would need to use some sort of functional decision theory to make sense of its situation where it controls the world through causally unrelated actions of P and Q that only have R's algorithm in common behind them.
The adjudicator R doesn't need to be anywhere as complicated as P or Q, in particular it doesn't need to know P or Q in detail. Which makes it much easier for P and Q to know R than to know each other. It's just neither P nor Q that's doing the acausal coordination, it's R instead.
The issue with you-in-all-detail vs. your-decision-algorithm is that a decision algorithm can have different levels of updatelessness, it's unclear what the decision algorithm already knows vs. what a policy it chooses takes as input. So we pick some intermediate level that is updateless enough to allow acausal coordination among relevant entities (agents/predictors), and updateful enough to make a decision without running out of time/memory while being implemented in its instances. But that level/scope is different for different collections of entities being coordinated.
So I think a boundary shouldn't be drawn around "a decision algorithm", but around whatever common knowledge of each other the entities being acausally coordinated happen to have (where they don't need to have common knowledge of everything). When packaged as a decision algorithm, the common knowledge becomes an adjudicator, which these entities can allow influence over their actions. To the extent the influence they allow an adjudicator is common knowledge among them, it also becomes knowledge of the adjudicator, available for its decision making reasoning.
Importantly for the reframing, an adjudicator is not a decision algorithm belonging to either agent individually, it's instead a shared decision algorithm. It's a single decision algorithm purposefully built out of the agents' common knowledge of each other, rather than a collection of their decision algorithms that luckily happen to have common knowledge of each other. It's much easier for there to be some common knowledge than for there to be common knowledge of individually predefined decision algorithms that each agent follows.
The usual framing is to align agent policies in a way that also ensures that they don't have an out-of-distribution phase change. With deceptive alignment, sharp left turn, and goodharting, capability to consider complicated plans or mere optimization-in-actuality prompts situations that are too out-of-distribution. The behavior on-distribution stops being a good indication of what's going to happen there, regardless of difficulties of modern ML with robustness. In this framing, the training distribution also needs to confer alignment, so it anchors to more usual situations that can currently be captured in practice, with more capable agents or over-optimizing agents discovering more unusual situations than that. This anchoring of the character of the training distribution might be a problem.
It seems useful to deconfuse agents that don't have that sort of phase change in their behavior, without being distracted by the separate problem of their alignment. When not focusing on aligned agents, it becomes more natural to consider arbitrary agents in arbitrary situations, including situations that can't be obtained from the world as training data, so won't be useful for ensuring alignment. And to treat such situations as training data for ML purposes.
The simulator framing is an example of this point, except it still pays too much attention to currently observable reality (even as simulators themselves are bad at that task). A simulator is a map of counterfactuals and not an agent, and it's not about describing particular agents, instead it describes all sorts of agents and their interaction in all sorts of situations simultaneously. Can a simulator be trained to be particularly good at describing agents that make coherent decisions across a wide variety of situations, including situations that can't be collected as training data from the world, or won't occur in actuality, and need to instead be generated, with little hope or intent of remaining predictive of the real world? Or maybe such coherence is more naturally found in something smaller than agents, a collective of intents/purposes/concepts/norms that each shapes a scope of situations where it's relevant, competing/bargaining with others of its kind to find an equilibrium that doesn't have the more jarring sorts of behavioral phase changes, especially capability-related ones.
The simulator frame clashes with a lot of this. A simulator is a map that depicts possible agents in possible situations, enacting possible decisions according to possible intents, leading to possible outcomes. One of these agents seen on the map might be the one wielding the simulator as its map, in possible situations that happen to be actual. But the model determines the whole map, not just the controlling agent in actual situations, and some alignment issues (such as robustness) concern the map, not specifically the agent.
These dreams may be of sufficiently high fidelity
These dreams may be of sufficiently high fidelity
One thing conspicuously missing in the post is a way of improving fidelity of simulation without changing external training data, or relationship between the model and the external training data, which I think follows from self-supervised learning on summaries of dreams. There are many concepts of evaluation/summarization of text, so given a text it's possible to formulate tuples (text, summary1, summary2, ...) and do self-supervised learning on that, not just on text (evaluations/summaries are also texts, not just one-dimensional metrics). For proofs, summaries could judge their validity and relevance to some question or method, for games the fact of winning and of following certain rules (which is essentially enough to win games, but also play at a given level of skill, if that is in the summary). More generally, for informal text we could try to evaluate clarity of argument, correctness, honesty, being fictional, identities/descriptions of simulacra/objects in the dream, etc. Which GPT-3 has enough structure to ask for informally.
Learning on such evaluated/summarized dreams should improve ability to dream in a way that admits a given asked-for summary, ideally without changing the relationship between the model and the external training data. The improvement is from gaining experience with dreams of certain kind, from the model more closely anticipating the summaries of dreams of that kind, not from changing the way a simulator dreams in a systematic direction. But if the summaries are about a level of optimality of a dream in some respect, then learning on augmentation of dreams with such summaries can be used for optimization, by conditioning on the summaries. (This post describes something along these lines.)
And a simulacrum of a human being with sufficient fidelity goes most of the way to AGI alignment.
I think talking of "loss minimizing" is conflating two different things here. Minimizing training loss is alignment of the model with the alignment target given by the training dataset. But the Alzheimer's example is not about that, it's about some sort of reflective equilibrium loss, harmony between the model and hypothetical queries it could in principle encounter but didn't on the trainings dataset. The latter is also a measure of robustness.
Prompt-conditioned behaviors of a model (in particular, behaviors conditioned by presence of a word, or name of a character) could themselves be thought of as models, represented in the outer unconditioned model. These specialized models (trying to channel particular concepts) are not necessarily adequately trained, especially if they specialize in phenomena that were not explored in the episodes of the training dataset. The implied loss for an individual concept (specialized prompt-conditioned model) compares the episodes generated in its scope by all the other concepts of the outer model, to the sensibilities of the concept. Reflection reduces this internal alignment loss by rectifying the episodes (bargaining with the other concepts), changing the concept to anticipate the episodes' persisting deformities, or by shifting the concept's scope to pay attention to different episodes. With enough reflection, a concept is only invoked in contexts to which it's robust, where its intuitive model-channeled guidance is coherent across the episodes of its reflectively settled scope, providing acausal coordination among these episodes in its role as an adjudicator, expressing its preferences.
So this makes a distinction between search and reflection in responding to a novel query, where reflection might involve some sort of search (as part of amplification), but its results won't be robustly aligned before reflective equilibrium for the relevant concepts is established.
Deceptive alignment feels like a failure of reflection, a model not being in equilibrium with its episodes. If a behavior is not expressed in episodes sampled in the model's scope, it's not there at all. If it is expressed and is contrary to the alignment target, then either the alignment target is not looking at the episodes correctly to adjust them (when they are within the scope of the target), or the episodes were constructed incorrectly to extend the scope of the model, beyond where it's aligned.
I think a model must never be knowingly exercised off-distribution (risking robustness), that should be a core alignment principle of a system that uses models. By itself, learning only interpolates, doesn't extrapolate. Language models are failing at this, missing this critical safety feature. They don't know what they don't know, and happily generate nonsense in response to off-distribution inputs (though empirically it seems easy to fix in some form, that's not the level of care this issue deserves). Extending the scope (beyond what's expected by the model of the scope) should be deliberate, with specifically crafted training data, whose alignment is conferred by alignment of the systems that generate it.
I worry that using OD as the space of behaviors misses something important about the intuitive idea of robustness, making any conclusions about F or OD or behavior manifolds harder to apply. A more natural space (to illustrate my point, not as something helpful for this post) would be ORn, with a metric that cares about how outputs differ on inputs that fall within a particular base distribution γ, something like
The issue with OD is that models in a behavior manifold only need to agree on the training inputs, and always include all models with arbitrarily crazy behaviors at all inputs outside the dataset, even if we are talking about inputs very close to those in the dataset (which is what γ above is supposed to prevent). So the behavior manifolds are more like cylinders than balls, ignoring crucial dimensions. Since generalization does work (so learning tends to find very unusual points of them), it's generally unclear how a behavior manifold as a whole is going to be relevant to what's actually going on.