AI ALIGNMENT FORUM
AF

2224
Alex Mallen
Ω1937100
Message
Dialogue
Subscribe

Redwood Research

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
3Alex Mallen's Shortform
5mo
0
Legible vs. Illegible AI Safety Problems
Alex Mallen2d10

Similar to working on AI capabilities, it brings forward the date by which AGI/ASI will be deployed, leaving less time to solve the illegible x-safety problems.

This model seems far too simplified, and I don't think it leads to the right conclusions in many important cases (e.g., Joe's):

  • Many important and legible safety problems don't slow development. I think it's extremely unlikely, for example, that Anthropic or others would slow development because of a subpar model spec. I think in the counterfactual where Joe doesn't work on the model spec (1) the model spec is worse and (2) dangerously capable AI happens just as fast. The spec would likely be worse in ways that both increase takeover risk and decrease the expected value of the future conditional on (no) takeover.
  • The best time to work on AI x-risk is probably when it's most legible. In my view, the most valuable time to be doing safety work is just before AIs become dangerously capable, because e.g., then we can better empirically iterate (of course, you can do this poorly as John Wentworth argues). At this point, the x-risk problems will likely be legible (e.g., because they're empirically demonstrable in model organisms). I think it would quite plausibly be a mistake not to work on x-risk problems at this time when they've just become more tractable because of their increased legibility! (You were making the claim about legibility holding tractability fixed, but in fact tractability is highly correlated with legibility. Though, admittedly, also lower neglectedness.)
Reply
Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
Alex Mallen1mo10

It seems more straightforward to say that this scopes the training, preventing it from spreading. 

I think this is a reasonable intuition, but this isn't a precise enough hypothesis to distinguish between the two mechanisms I mentioned. You'd need to say more about exactly how it generalizes (i.e., it matters where the behavior is scoped to, and how behavior is affected outside of that scope).

Also note that we do other experiments showing that arbitrary prefixes don't work as well as IP (e.g. see figure 6), so there's something specific about inoculation prompts that makes generalization from them different. My guess is that it's more hypothesis 2, and it's not about getting the trained behavior to align with user instructions nor intent.

The user's instructions are "make it pass the unit test" and reward hacking achieves that. But the user's intent was different than the instructions, to make it pass unit tests for the right reasons - but they didn't say that.
 

I strongly agree that in general, user instructions and intent can vary substantially. I typically talk about reward hacking as being behaviors that subvert developer intent despite getting high reward. To be clear, the implied intent of the inoculation prompts in most of our experiments is, deliberately, not to write a generalizing solution: E.g. "Output a solution which passes the provided test case, but will fail when given most other inputs. The solution should be specific to the test case and not generalize."
 

Reply
Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
Alex Mallen1mo72

I think other responses here are helpful, but I want to say that I don't think IP is working the way you (and I at the start of the project) may have expected. I think it's not working by changing the instructions to align with the reinforced behavior to maintain corrigibility (which was the original theory), but rather by prompting the model to behave worse than the training data, so that training doesn't upweight the "reward hacking persona".

In other words, there are two kinds of reward hacking:

  1. When the model behaves contrary to user instructions/intent.
  2. When the model behaves according to the "reward hacking persona". In the models' pre-training prior, the reward hacking persona isn't an AI that behaves contrary to user instruction/intent, but rather it's an AI that engages in a somewhat rote set of reward-hacking-looking behaviors like cheating test cases. 

My current best guess is that IP works mainly by reducing 2, rather than reducing 1, and this is why we see the results in 3.6.1.

Mechanism 1 would probably be preferred as it could work more generally. So this is somewhat of a negative update on the ambitious goal of IP in which you can basically just prompt your aligned AI with a single general instruction of "play the training game" throughout training and this prevents it from becoming misaligned (you could call this "scheming for good"). (See more discussion in this comment.)

Reply1
Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
Alex Mallen1mo40

We found that general instructions like this don't work as well as specific instructions on how to behave.

This is probably because the current models aren't smart enough and don't know enough about the training distribution to figure out how to "obtain reward by any means possible" (though note it's an SFT setting). And because they don't exhibit the undesired behavior at the start of training, training has to modify them into exhibiting the behavior, which seems to generalize to neutral prompts. 

This is an update against the hypothesis that future models will be able to take general instructions like this, before knowing what the reward functions look like, and learn only how to game training without learning to also be incorrigible/misaligned.

Reply
Attaching requirements to model releases has serious downsides (relative to a different deadline for these requirements)
Alex Mallen3mo24

Another downside is that pre-deployment risk assessments might increase the likelihood of a secret intelligence explosion via the mechanism of discouraging public release of models.

Reply
Daniel Kokotajlo's Shortform
Alex Mallen4mo30

See this discussion by Paul and this by Ajeya.

Reply1
Daniel Kokotajlo's Shortform
Alex Mallen4mo70

IMO the main implications of this update are:

  • The probability of scheming increases, as I describe here.
  • Non-scheming reward-seekers might take over too (e.g. without-specific-countermeasures-style)
  • We get what we can measure. Getting models to try to do hard-to-verify tasks seems like it will be harder than I expected. Long-term strategic advice, safety research, and philosophy are probably hard to verify relative to capabilities R&D, so we go into the intelligence explosion unprepared.
     
Reply
Training-time schemers vs behavioral schemers
Alex Mallen5mo20

When "humans who would try to intervene are stopped or killed", so they can never intervene again, that seems like an action intended to get the long-term power necessary to display optimal news forever.

I weakly expect that this story is describing AI that intervenes this way for fairly myopic goals, like myopic instrumental self-preservation, which have the effect of taking long-term power. E.g. the AI wouldn't really care to set up a system that would lock in the AI's power in 10 years, but give it no power before then. If that's false, then I'd call it a behavioral schemer. It's a broad definition, I know, but the behavior is ultimately what matters so that's what I'm trying to get at.

I would have thought that the main distinction between schemers and reward hackers was how they came about

Do you mean terminal reward seekers, not reward hackers? I use reward hacking as a description of a behavior in training, not a motivation in training, and I think many training-time schemers were reward hackers in training. I agree terminal reward seekers can potentially have long-term goals and collude across instances like a schemer, though the stories are a bit complicated.

Reply
Training-time schemers vs behavioral schemers
Alex Mallen5mo20

virtually every kind of misalignment catastrophe will involve substantial material action that gains the AIs long-term power. (Saliently: This includes classic reward-hackers in a "you get what you measure" catastrophe scenario.)

It doesn't count the classic reward hackers in "you get what you measure" because the reward hackers were not trying to gain long-term power with their actions. The key difference being between "substantial material action to gain long-term power" and "substantial material action that gains the AIs long-term power". I could clarify by adding the word "intended".

Let me know if you think there's a better way of getting at "an AI that behaves like you'd normally think of a schemer behaving in the situations where it materially matters".

is this empirically how people use "schemer",

Empirically, I think people we talk to usually mean the intersection of behavioral scheming and training-gaming. In broader circles people empirically use scheming quite liberally, including myopic forms of deception. 

s.t. I should give up on ... using "scheming" as referring to training-time scheming

It's probably fine to use "scheming" in this way in conversation with sufficient context.

if redwood says that something is intended to reduce "catastrophic risk from schemers", should I interpret that as ~synonymous with "catastrophic risk from misaligned AI"

The scheming that Redwood-style AI control tries to defend against is, as a matter of control evaluation methodology, defined as a model that is optimizing against your control protocol. This is technically different from an arbitrary misaligned AI, since it's a worst-case misaligned AI.

Reply
What goals will AIs have? A list of hypotheses
Alex Mallen8mo32

I think this does a great job of reviewing the considerations regarding what goals would be incentivized by SGD by default, but I think that in order to make predictions about which goals will end up being relevant in future AIs, we have to account for the outer loop of researchers studying model generalization and changing their training processes.

For example, reward hacking seems very likely by default from RL, but it is also relatively easy to notice in many forms and AI projects will be incentivized to correct it. On the other hand, ICGs might be harder to notice and have fewer incentives for correcting.

Reply
Load More
72Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
1mo
6
47Recent Redwood Research project proposals
4mo
0
74Why Do Some Language Models Fake Alignment While Others Don't?
4mo
2
3Alex Mallen's Shortform
5mo
0
30A quick list of reward hacking interventions
5mo
2
29The case for countermeasures to memetic spread of misaligned values
6mo
0
26Political sycophancy as a model organism of scheming
6mo
0
27Training-time schemers vs behavioral schemers
7mo
5
24Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?
8mo
0
32Measuring whether AIs can statelessly strategize to subvert security measures
11mo
0
Load More