Introduction

Many AI risk failure modes imagine strong coherence/goal directedness^[1] (e.g. [expected] utility maximisers).

Such strong coherence is not represented in humans (or any other animal), seems unlikely to emerge from deep learning and may be "anti-natural" to general intelligence in our universe^[2]^[3].

I suspect the focus on strongly coherent systems was a mistake that set the field back a bit, and it's not yet fully recovered from that error^[4].

I think most of the AI safety work for strongly coherent agents (e.g. decision theory) will end up inapplicable/useless for aligning powerful systems, because powerful systems in the real world are "of an importantly different type".

Ontological Error?

I don't think it nails everything, but on a purely ontological level, @Quintin Pope and @TurnTrout's shard theory feels a lot more right to me than e.g. HRAD. HRAD is based on an ontology that seems to me to be mistaken/flawed in important respects.

The shard theory account of value formation (while lacking) seems much more plausible as an account of how intelligent systems develop values (where values are "contextual influences on decision making") than the immutable terminal goals in strong coherence ontologies. I currently believe that (immutable) terminal goals is just a wrong frame for reasoning about generally intelligent systems in our world (e.g. humans, animals and future powerful AI systems)^[2].

Theoretical Justification and Empirical Investigation Needed

I'd be interested in more investigation into what environments/objective functions select for coherence and to what degree said selection occurs.

And empirical demonstrations of systems that actually become more coherent as they are trained for longer/"scaled up" or otherwise amplified.

I want advocates of strong coherence to explain why agents operating in rich environments (e.g. animals, humans) or sophisticated ML systems (e.g. foundation models^[5]) aren't strongly coherent.

And mechanistic interpretability analysis of sophisticated RL agents (e.g. AlphaStar, OpenAI Five [or replications thereof]) to investigate their degree of coherence.

Conclusions

Currently, I think strong coherence is unlikely (plausibly "anti-natural"^[3]^[2]) and am unenthusiastic about research agendas and threat models predicated on strong coherence.

Disclaimer

The above is all low confidence speculation, and I may well be speaking out of my ass^[6].

^{^}
By "strong coherence/goal directedness" I mean something like:
Informally: a system has immutable terminal goals.
Semi-formally: a system's decision making is well described as (an approximation) of argmax over actions (or higher level mappings thereof) to maximise the expected value of a single fixed utility function over states.
^{^}
You cannot well predict the behaviour/revealed preferences of humans or other animals by the assumption that they have immutable terminal goals or are expected utility maximisers.
The ontology that intelligent systems in the real world instead have "values" (contextual influences on decision making) seems to explain their observed behaviour (and purported "incoherencies") better.
Many observed values in humans and other mammals (see^[7]) (e.g. fear, play/boredom, friendship/altruism, love, etc.) seem to be values that were instrumental for increasing inclusive genetic fitness (promoting survival, exploration, cooperation and sexual reproduction/survival of progeny respectively). Yet, humans and mammals seem to value these terminally and not because of their instrumental value on inclusive genetic fitness.
That the instrumentally convergent goals of evolution's fitness criterion manifested as "terminal" values in mammals is IMO strong empirical evidence against the goals ontology and significant evidence in support of shard theory's basic account of value formation in response to selection pressure.

This is not to say that I think all coherence arguments are necessarily dead on arrival, but rather in practice, I think coherent behaviour (not executing strictly dominated strategies) acts upon our malleable values, to determine our decisions. We do not replace said values with argmax over a preference ordering.
As @TurnTrout says:
I have some take (which may or may not be related) like
utility is a measuring stick which is pragmatically useful in certain situations, because it helps corral your shards (e.g. dogs and diamonds) into executing macro-level sensible plans (where you aren't throwing away resources which could lead to more dogs and/or diamonds) and not just activating incoherently.
but this doesn't mean I instantly gain space-time-additive preferences about dogs and diamonds such that I use one utility function in all contexts, such that the utility function is furthermore over universe-histories (funny how I seem to care across Tegmark 4?).
^{^}
E.g. if the shard theory account of value formation is at all correct, particularly the following two claims:
* Values are inherently contextual influences on decision making
* Values (shards) are strengthened (or weakened) via reinforcement events
Then strong coherence in the vein of utility maximisation just seems like an anti-natural form. See also^[2]. I think evolutionarily convergent "terminal" values provides (strong) empirical evidence against the naturalness of strong coherence.

I could perhaps state my thesis that strong coherence is anti-natural more succinctly as:
Decision making in intelligent systems is best described as "executing computations/cognition that historically correlated with higher performance on the objective function a system was selected for performance on".
[This generalises the shard theory account of value formation from reinforcement learning to arbitrary constructive optimisation processes.]
It is of an importantly different type from the "immutable terminal goals"/"expected utility maximisation" I earlier identified with strong coherence.
^{^}
I'm given the impression that the assumption of strong coherence is still implicit in some current AI safety failure modes [e.g. it underpins deceptive alignment^[8]].)
^{^}
Is "mode collapse" in RLHF'ed models an example of increased coherence?
^{^}
I do think that my disagreements with e.g. deceptive alignment/expected utility maximisation is not simply a failure of understanding, but I am very much an ML noob, so there can still be things I just don't know. My opinions re: coherence of intelligent systems would probably be different in a significant way by this time next year.
^{^}
I mention mammals because I'm more familiar with them, not necessarily because only mammals display these values.
^{^}
In addition to the other prerequisites listed in the "Deceptive Alignment" post, deceptive alignment also seems to require a mesa-optimiser so coherent that it would be especially resistant to modifications to its mesa-objective. That is it requires very strong levels of goal content integrity.
I think that updating against strong coherence would require rethinking the staples of (traditional) alignment orthodoxy:
* Orthogonality
* Basic AI drives
* Instrumental convergence (see^[2])
This is not to say that they are necessarily no longer relevant in systems that aren't strongly coherent, but that to the extent they manifest at all, they manifest in (potentially very) different ways than originally conceived when conditioned on systems with immutable terminal goals.