As various people have written about before, AIs that have long-term memory might pose additional risks (most notably, LLM AGI will have memory, and memory changes alignment by Seth Herd). Even if an AI is aligned or only occasionally scheming at the start of a deployment, the AI might become a consistent and coherent behavioral schemer via updates to its long-term memories.
In this post, I’ll spell out the version of the threat model that I’m most concerned about, including some novel arguments for its plausibility, and describe some promising strategies for mitigating this risk. While I think some plausible mitigations are reasonably cheap and could be effective at reducing the risk from coherent scheming arising via this mechanism, research here will likely be substantially more productive in the future...
This is a somewhat technical note.
By "software-only singularity", I mean that, after full automation of AI R&D, progress gets faster and faster due to smarter AIs driving increasingly fast rates of improvement in algorithms (overcoming diminishing returns), and that this lasts long enough to yield a large amount of progress (e.g. at least 4 years of progress in 1 year). The equivalent statement in jargon is: r is significantly greater than 1 (implying progress is getting faster and faster) and this remains the case for long enough to get large amounts of progress. For context, see How quick and big would a software intelligence explosion be?
Even without a "software-only singularity", I think full automation of AI R&D probably greatly speeds up progress for two main reasons:
No, the AIs do fully automate R&D, AI and otherwise. But the speed with which they do R&D depends not just on the speed of token generation, but also on the speed at which they learn deep skills, and the latter is much lower for LLMs built with the current methods (they only learn deep skills in new model releases).
Token generation speed gives an anchor of maybe 200x serial speedup compared to humans, plus very scalable parallel labor, minus real world constraints from needing experimental feedback (which don't even apply to some forms of theory). ...
Behavioral evaluations may become worthless, which we think would be a disaster. Smart misaligned models may realize they are being evaluated ("eval awareness") and then act to look good to us so we don't realize they're misaligned ("eval gaming"). We think increasing eval cooperativeness might be a more scalable solution to eval gaming than reducing eval awareness.
Eval cooperativeness: A situational desire to help the developers acquire whatever information they are trying to acquire through their evaluations.
"I cannot tell a lie... I would sabotage with my own command line."[1]
What's the actual problem with eval gaming?
The point of an evaluation is to let us draw inferences about the model's behavior in a different set of circumstances ("in deployment"). For example,...
I'm curious how far SDFT generalizes, versus how far RL generalizes.
SDFT seems to rely on the model having beliefs about the behavior of the assistant character. You train it on new evidence, and primarily this updates its beliefs about the character. Secondarily, it updates the mechanisms shared across all characters.
Eval gaming due to task-directed RL, on the other hand, potentially gets encoded in new skills like "how to follow a plan I wrote" (or the rich semantics that make those metacognitive skills possible), which, to the extent they're new machine...
See here for more on the background claim that RL algorithms encourage CDT reward-maximizing behavior on the training distribution.
Alignment is often conceptualized as AIs helping humans achieve their goals: AIs that increase people’s agency and empowerment; AIs that are helpful, corrigible, and/or obedient; AIs that avoid manipulating people. But that last one—manipulation—points to a challenge for all these desiderata: a human’s goals are themselves under-determined and manipulable, and it’s awfully hard to pin down a principled distinction between changing people’s goals in a good way (“providing counsel”, “providing information”, “sharing ideas”) versus a bad way (“manipulating”, “brainwashing”).
The manipulability of human desires is hardly a new observation in the alignment literature, but it remains unsolved (see lit review in §3 below).
In this post I will propose an explanation of how we humans intuitively conceptualize the distinction between guidance (good) vs manipulation (bad), in case it...
My (low confidence) understanding of the proposal is something like:
"The AI takes an action A if and only if {long-term future-self if the AI takes action a | a in the action space} on aggregate like A"
where "long-term future self" is defined by some recursive process where you locally choose what entity counts as your near-term future self (it can be some other entity that you trust more - e.g. a future aligned AI), where these future selves all have access to an AI that honestly answers questions that are already meaningful to the human when the right an...
Nate Soares reviews a dozen plans and proposals for making AI go well. He finds that almost none of them grapple with what he considers the core problem - capabilities will suddenly generalize way past training, but alignment won't.
The mutation and selection mechanisms at play in training and in deployment-time selection are different but correlated (e.g. if the reason why you get long-term misaligned memes is because they are simpler/easier to find than a more local fitness seekin... (read more)