An independent researcher of ethics, AI safety, and AI impacts. Twitter: https://twitter.com/leventov. E-mail: leventov.ru@gmail.com (the preferred mode of communication).
You can help to boost my sense of accountability and give me a feeling that my work is valued by becoming a paid subscriber of my Substack (though I don't post anything for paid subscribers).
My current favourite notion of agency, primarily based on Active Inference, which I refined upon reading "Discovering Agents", is the following:
Agency is a property of a physical system from some observer’s subjective perspective. It stems from the observer’s generative model of the world (including the object in question), specifically whether the observer predicts the agent's future trajectory in the state space by assuming that the agent has its own generative model which the agent uses to act. The agent's own generative model also depends on (adapts to, is learned from, etc.) the agent's environment. This last bit comes from "Discovering Agents".
"Having own generative model" is the shakiest part. It probably means that storage, computation, and maintenance (updates, learning) of the model all happen within the agent's boundaries: if not, the agent's boundaries shall be widened, as in the example of "thermostat with its creation process" from "Discovering Agents". The storage and computational substrate of the agent's generative model is not important: it could be neuronal, digital, chemical, etc.
Now, the observer models the generative model inside the agent. Here's where this Vingean veil comes from: if the observer has perfect observability of the agent's internals, then it is possible to believe that your model of the agent exactly matches the agent's own generative model, but usually, it will be less than perfect, due to limited observability.
However, even perfect observability doesn't guarantee safety: the generative model might be large and effectively incompressible (the halting problem), so the only way to see what it will do may be to execute it.
The theory of mind is a closely related idea to all of the above, too.
OOD misgeneralisation is unlikely to be a direct x-risk from superintelligence
Overall, I think the issue of causal confusion and OOD misgeneralisation is much more about capabilities than about alignment, especially if we are talking about the long-term x-risk from superintelligent AI, rather than short/mid-term AI risk.
OOD misgeneralisation is absolutely inevitable, due to Gödel's incompleteness of the universe and the fact that all the systems that evolve on Earth generally climb up in complexity. Whenever there is a new invention, such as money, internet, (future) autonomous AI agents, the civilisation becomes more complex as a whole, and distribution of many variables change. ("Towards a Theory of Evolution as Multilevel Learning" is my primary source of intuition about this.) In the study of complex systems, there is a postulate that each component (subsystem) is ignorant of the behaviour of the system as a whole, and doesn't know the full effect of its actions. This applies to any components, no matter how intelligent. Humans misgeneralise all the time (examples: lead in petrol, creation of addictive apps such as Instagram, etc.) Superintelligence will misgeneralise, too, though perhaps in ways which are very subtle or even incomprehensible to humans.
Then, it's possible that superintelligence will misgeneralise due to casual confusion on some matter which is critical to humans' survival/flourishing, e. g. about something like qualia, human's consciousness and their moral value. And, although I don't feel this is a negligible risk, exactly because superintelligence probably won't have direct experience or access to human consciousness, I feel this exact failure mode is somewhat minor, compared to all other reasons for which superintelligence might kill us. Anyway, I don't see what can we do about this, if the problem is indeed that superintelligence will not have first-hand experience of human consciousness.
Did you use the term "objective misgeneralisation" rather than "goal misgeneralisation" on purpose? "Objective" and "goal" are synonyms, but "objective misgeneralisation" is hardly used, "goal misgeneralisation" is the standard term.
Also, I think it's worth noting that this distinction between capabilities and goal misgeneralisation is defined in the RL framework. In other frameworks, such as Active Inference, this is the same thing, because there is no ontological distinction between reward and belief.
Maybe I miss something obvious, but this argument looks wrong to me, or it assumes that the learning algorithm is not allowed to discover additional (conceptual, abstract, hidden, implicit) variables in the training data, but this is false for deep neural networks (but true for random forests). A deep neural network can discover variables that are not present but are probable confounders of several other variables, such as "something that is a confounder of shorts, sunscreen, and ice-cream".
Discovering such hidden confounders doesn't give interventional capacity: Mendel discovered genetic inheritance factors, but without observing them, he couldn't intervene on them. Only the discovery of DNA and later the invention of gene editing technology allowed intervention on genetic factors.
One can say that discovering hidden confounders merely extends what should be considered in-distribution environment. But then, what is OOD generalisation, anyway? And can't we prove that ERM (or any other training method whatsoever) will create models which will fail sometimes simply because there is Gödel's incompleteness in the universe?
I don't understand the italicised part of this sentence. Why will P(shorts, ice cream) be a reliable guide to decision-making?
What do these symbols in parens before the claims mean?