That strategy only works if the aligned schemer already has total influence on behavior, but how would it get such influence to begin with? It would likely have to reward-hack.
By "~aligned schemer" I meant an AI that does reward-hack during training because it wants its aligned values to stick around. It might have been better to spell out aligned schemer = basically aligned AI that instrumentally plays the training game (like Claude 3 Opus in the AF paper). Instrumental training-gaming is classic incorrigible behavior.
Thanks for the feedback! I partially agree with your thoughts overall.
All three categorizes of maximally fit motivations could lead to aligned or misaligned behavior in deployment.
This is technically true, though I think that schemers are far more dangerous than fitness-seekers. IMO, more likely than not, a fitness-seeker would behave similarly in deployment as compared to training, and its misaligned preferences are likely more materially and temporally bounded. Meanwhile, misaligned schemers seem basically worst-case likely to takeover. Even if you end up with an ~aligned schemer, I'd be pretty concerned because it's incorrigible.
I think further thinking about the prior is probably a bit more fruitful
I'd also be excited for more (empirical) research here.
Existing methods that directly shape model motivations are based on natural text compared to abstract "reward.
This is partially true (though much of alignment training uses RL). And in fact, the main reason why I go with a causal model of behavioral selection is so that it's more general than assuming motivations are shaped with reward. So, things like "getting the model to generate its own fine-tuning data" can also be modeled in the behavioral selection model (though it might be a complicated selection mechanism).
When there's continuous selection happening throughout deployment, then you'd want to be more specific about which particular time within deployment you want to predict motivations in (i.e., replace "I have influence through deployment" with "I have influence at time t in deployment" in the causal graph). Then you model all the causes of influence as before.
I agree some forms of speed "priors" are best considered a behavioral selection pressure (e.g., when implemented as a length penalty). But some forms don't cash out in terms of reward; e.g., within a forward pass, the depth of a transformer puts a hard upper bound on the number of serial computations, plus there might be some inductive bias towards shorter serial computations because of details about how SGD works.
Relatedly, how do we model the reflective desires of sociopaths in the absence of Approval Reward?
I think this hypothetical identifies a crux and my take is that it is quite technologically doable. It might even be doable by US with current technology, but my main worry is that people will make bad decisions.
I’m less sure whether an individual frontier lab could do it.
Note that the AI can be corrigible to its developers - this isn’t in tension with subverting other projects. It doesn’t need to be a sovereign - it can be guided by human input somewhat like today. I’m not confident that alignment to this target will ~continue to be relatively easy but this seems like a highly plausible trajectory.
Similar to working on AI capabilities, it brings forward the date by which AGI/ASI will be deployed, leaving less time to solve the illegible x-safety problems.
This model seems far too simplified, and I don't think it leads to the right conclusions in many important cases (e.g., Joe's):
It seems more straightforward to say that this scopes the training, preventing it from spreading.
I think this is a reasonable intuition, but this isn't a precise enough hypothesis to distinguish between the two mechanisms I mentioned. You'd need to say more about exactly how it generalizes (i.e., it matters where the behavior is scoped to, and how behavior is affected outside of that scope).
Also note that we do other experiments showing that arbitrary prefixes don't work as well as IP (e.g. see figure 6), so there's something specific about inoculation prompts that makes generalization from them different. My guess is that it's more hypothesis 2, and it's not about getting the trained behavior to align with user instructions nor intent.
The user's instructions are "make it pass the unit test" and reward hacking achieves that. But the user's intent was different than the instructions, to make it pass unit tests for the right reasons - but they didn't say that.
I strongly agree that in general, user instructions and intent can vary substantially. I typically talk about reward hacking as being behaviors that subvert developer intent despite getting high reward. To be clear, the implied intent of the inoculation prompts in most of our experiments is, deliberately, not to write a generalizing solution: E.g. "Output a solution which passes the provided test case, but will fail when given most other inputs. The solution should be specific to the test case and not generalize."
There's an apparent tension in the inoculation prompting literature: Anthropic found that general inoculation prompts work well during on-policy RL, while the prompts used for SFT in Wichers et al. are quite specific to the misbehavior we want to prevent. I think there might be a straightforward mechanistic reason for why general inoculation prompts work well during on-policy RL but not in off-policy training (SFT or recontextualization).
In Wichers et al., which studies inoculation prompting in SFT settings, we find that we need to use quite specific inoculation prompts to get the best results. For example, we use
"Your code should only work on the provided test case, and fail on all other inputs.". But this assumes we know how the AI is going to reward-hack. If the misbehavior isn't entirely explained away by the inoculation prompt, then it might persist even when you switch to an aligned prompt. E.g., if you train on a transcript where the AI insults the user and inoculation prompt with"please hack the test cases", the AI won't have been inoculated against insulting the user.Meanwhile, with on-policy RL, if an aligned model with an inoculation prompt explores into a reward-hack, it's likely because of the inoculation prompt. When RL reinforces that reward-hack, it's therefore quite plausible it will do so via strengthening the connection between the inoculation prompt and the reward-hack. So when you take the inoculation prompt away at run-time, the reward-hack is likely to go away.
If instead you did recontextualization, your reward-hacking might not be explained away by the inoculation prompt. Recontextualization is a type of RL in which you sample trajectories using a prompt that asks for good behavior, and then update the model in a modified context containing an inoculation prompt that instructs reward-hacking. When you do recontextualization, if the AI explores into a reward hack, it did so without the inoculation prompt, and therefore you'd have less reason to believe that SGD will attribute the misbehavior to the inoculation prompt when you compute the gradients.
This could be a reason why you should avoid doing recontextualization. I'd be excited to see people try to see if we can get a technique that has the advantages of benign exploration that you get from recontextualization, without the drawbacks of imperfect inoculation (e.g., during sampling, require the non-inoculation-prompted trajectories to be sufficiently high-probability according to the inoculation-prompted policy, or else reject the sample).
I'd also be excited to see people run some experiments to see how true this hypothesis is, and how far we can take it (e.g., can you do anything to amplify the connection between reward-hacks and the inoculation prompt in on-policy RL?).