Thanks! Lots of useful insights in there.
So I might classify moving out-of-distribution as something that happens to a classifier or agent, and model splintering as something that the machine learning system does to itself.
Why do you think it's important to distinguish these two situations? It seems that the insights for dealing with one situation may apply to the other, and vice versa.
Cheers! My opinion on category theory has changed a bit, because of this post; by making things fit into the category formulation, I developed insights into how general relations could be used to connect different generalised models.
Did posts on generalised models as a category and how one can see Cartesian frames as generalised models.
I like it. I'll think about how it fits with my ways of thinking (eg model splintering).
Cheers; Rebecca likes the "instrumental control incentive" terminology; she claims it's more in line with control theory terminology.
We agree that lack of control incentive on X does not mean that X is safe from influence from the agent, as it may be that the agent influences X as a side effect of achieving its true objective. As you point out, this is especially true when X and a utility node probabilistically dependent.
I think it's more dangerous than that. When there is mutual information, the agent can learn to behave as if it was specifically manipulating X; the counterfactual approach doesn't seem to do what it intended.
(I don't think I can explain why here, though I am working on a longer explanation of what framings I like and why.)
Cheers, that would be very useful.
(I do think ontological shifts continue to be relevant to my description of the problem, but I've never been convinced that we should be particularly worried about ontological shifts, except inasmuch as they are one type of possible inner alignment / robustness failure.)
I feel that the whole AI alignment problem can be seen as problems with ontological shifts: https://www.lesswrong.com/posts/k54rgSg7GcjtXnMHX/model-splintering-moving-from-one-imperfect-model-to-another-1
Express, express away _
I'm most interested in mitigation options the agent can take itself, when it suspects it's out-of-distribution (and without being turned off, ideally).