An independent researcher of ethics, AI safety, and AI impacts. Twitter: https://twitter.com/leventov. E-mail: firstname.lastname@example.org (the preferred mode of communication).
You can help to boost my sense of accountability and give me a feeling that my work is valued by becoming a paid subscriber of my Substack (though I don't post anything paywalled; in fact, on this blog, I just syndicate my LessWrong writing).
A Telegram group where we discuss AI x-risk/safety, theories of intelligence, agency, consciousness, and ethics, in Russian: https://t.me/agi_risk_and_ethics.
the alignment story for LLMs seems significantly more straightforward, even given all the shoggoth concerns
Could you please elaborate what do you mean by "alignment story for LLMs" and "shoggoth concerns" here? Do you mean the "we can use nearly value-neutral simulators as we please" story here, or refer to the fact that in a way LLMs are way more understandable to humans than more general RL agents because they use human language, or you refer to something yet different?
Overall, I think the issue of causal confusion and OOD misgeneralisation is much more about capabilities than about alignment, especially if we are talking about the long-term x-risk from superintelligent AI, rather than short/mid-term AI risk.
OOD misgeneralisation is absolutely inevitable, due to Gödel's incompleteness of the universe and the fact that all the systems that evolve on Earth generally climb up in complexity. Whenever there is a new invention, such as money, internet, (future) autonomous AI agents, the civilisation becomes more complex as a whole, and distribution of many variables change. ("Towards a Theory of Evolution as Multilevel Learning" is my primary source of intuition about this.) In the study of complex systems, there is a postulate that each component (subsystem) is ignorant of the behaviour of the system as a whole, and doesn't know the full effect of its actions. This applies to any components, no matter how intelligent. Humans misgeneralise all the time (examples: lead in petrol, creation of addictive apps such as Instagram, etc.) Superintelligence will misgeneralise, too, though perhaps in ways which are very subtle or even incomprehensible to humans.
Then, it's possible that superintelligence will misgeneralise due to casual confusion on some matter which is critical to humans' survival/flourishing, e. g. about something like qualia, human's consciousness and their moral value. And, although I don't feel this is a negligible risk, exactly because superintelligence probably won't have direct experience or access to human consciousness, I feel this exact failure mode is somewhat minor, compared to all other reasons for which superintelligence might kill us. Anyway, I don't see what can we do about this, if the problem is indeed that superintelligence will not have first-hand experience of human consciousness.
1) The model may not make competent predictions out-of-distribution (capabilities misgeneralisation). We discuss this further in ERM leads to causally confused models that are flawed OOD.
2) If the model is causally confused about objects related to its goals or incentives, then it might competently pursue changes in the environment that either don’t actually result in the reward function used for training being optimised (objective misgeneralisation).
Did you use the term "objective misgeneralisation" rather than "goal misgeneralisation" on purpose? "Objective" and "goal" are synonyms, but "objective misgeneralisation" is hardly used, "goal misgeneralisation" is the standard term.
Also, I think it's worth noting that this distinction between capabilities and goal misgeneralisation is defined in the RL framework. In other frameworks, such as Active Inference, this is the same thing, because there is no ontological distinction between reward and belief.
It might be suspected that OOD generalisation can be tackled in the scaling paradigm by using diverse enough training data, for example, including data sampled from every possible test environment. Here, we present a simple argument that this is not the case, loosely adapted from Remark 1 from Krueger et al. REx:
The reason data diversity isn’t enough comes down to concept shift (change in ). Such changes can be induced by changes in unobserved causal factors, Z. Returning to the ice cream () and shorts (), and sun () example, shorts are a very reliable predictor of ice cream when it is sunny, but not otherwise. Putting numbers on this, let’s say . Since the model doesn’t observe , there is not a single setting of that will work reliably across different environments with different climates (different ). Instead
depends on , which in turn depends on the climate in the locations where the data was collected. In this setting, to ensure a model trained with ERM can make good predictions in a new “target” location, you would have to ensure that that location is as sunny as the average training location so that is the same at training and test time. It is not enough to include data from the target location in the training set, even in the limit of infinite training data - including data from other locations changes the overall of the training distribution. This means that without domain/environment labels (which would allow you to have different for different environments, even if you can’t observe ), ERM can never learn a non-causally confused model.
Maybe I miss something obvious, but this argument looks wrong to me, or it assumes that the learning algorithm is not allowed to discover additional (conceptual, abstract, hidden, implicit) variables in the training data, but this is false for deep neural networks (but true for random forests). A deep neural network can discover variables that are not present but are probable confounders of several other variables, such as "something that is a confounder of shorts, sunscreen, and ice-cream".
Discovering such hidden confounders doesn't give interventional capacity: Mendel discovered genetic inheritance factors, but without observing them, he couldn't intervene on them. Only the discovery of DNA and later the invention of gene editing technology allowed intervention on genetic factors.
One can say that discovering hidden confounders merely extends what should be considered in-distribution environment. But then, what is OOD generalisation, anyway? And can't we prove that ERM (or any other training method whatsoever) will create models which will fail sometimes simply because there is Gödel's incompleteness in the universe?
While this model might not make very good predictions, it will correctly predict that getting you to put on shorts is not an effective way of getting you to want ice cream, and thus will be a more reliable guide for decision-making (about whether to wear shorts).
I don't understand the italicised part of this sentence. Why will P(shorts, ice cream) be a reliable guide to decision-making?
(a, ii =>)
What do these symbols in parens before the claims mean?
My current favourite notion of agency, primarily based on Active Inference, which I refined upon reading "Discovering Agents", is the following:
Agency is a property of a physical system from some observer’s subjective perspective. It stems from the observer’s generative model of the world (including the object in question), specifically whether the observer predicts the agent's future trajectory in the state space by assuming that the agent has its own generative model which the agent uses to act. The agent's own generative model also depends on (adapts to, is learned from, etc.) the agent's environment. This last bit comes from "Discovering Agents".
"Having own generative model" is the shakiest part. It probably means that storage, computation, and maintenance (updates, learning) of the model all happen within the agent's boundaries: if not, the agent's boundaries shall be widened, as in the example of "thermostat with its creation process" from "Discovering Agents". The storage and computational substrate of the agent's generative model is not important: it could be neuronal, digital, chemical, etc.
Now, the observer models the generative model inside the agent. Here's where this Vingean veil comes from: if the observer has perfect observability of the agent's internals, then it is possible to believe that your model of the agent exactly matches the agent's own generative model, but usually, it will be less than perfect, due to limited observability.
However, even perfect observability doesn't guarantee safety: the generative model might be large and effectively incompressible (the halting problem), so the only way to see what it will do may be to execute it.
The theory of mind is a closely related idea to all of the above, too.
I think it would be more accurate to say that the dynamics of internal states of LLMs parameterise not just the model of sequences but of the world, including token sequences as the sensory manifestation of it.
I'm sure that LLMs already possess some world models (Actually, Othello-GPT Has A Linear Emergent World Representation), the question is how only really how the structure and mechanics of LLMs' world models are different from the world models of humans.