the alignment story for LLMs seems significantly more straightforward, even given all the shoggoth concerns
Could you please elaborate what do you mean by "alignment story for LLMs" and "shoggoth concerns" here? Do you mean the "we can use nearly value-neutral simulators as we please" story here, or refer to the fact that in a way LLMs are way more understandable to humans than more general RL agents because they use human language, or you refer to something yet different?
Overall, I think the issue of causal confusion and OOD misgeneralisation is much more about capabilities than about alignment, especially if we are talking about the long-term x-risk from superintelligent AI, rather than short/mid-term AI risk.
OOD misgeneralisation is absolutely inevitable, due to Gödel's incompleteness of the universe and the fact that all the systems that evolve on Earth generally climb up in complexity. Whenever there is a new invention, such as money, interne...
My current favourite notion of agency, primarily based on Active Inference, which I refined upon reading "Discovering Agents", is the following:
Agency is a property of a physical system from some observer’s subjective perspective. It stems from the observer’s generative model of the world (including the object in question), specifically whether the observer predicts the agent's future trajectory in the state space by assuming that the agent has its own generative model which the agent uses to act. The agent's own generative model also depends on (adapts to...
I think it would be more accurate to say that the dynamics of internal states of LLMs parameterise not just the model of sequences but of the world, including token sequences as the sensory manifestation of it.
I'm sure that LLMs already possess some world models (Actually, Othello-GPT Has A Linear Emergent World Representation), the q... (read more)