Caleb Biddulph — AI Alignment Forum

AI ALIGNMENT FORUM
AF

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

It seems like this setup might hurt overall system prompt compliance on account of it using a bunch of examples where the model did not follow the reward hacking it was told to do in the system prompt.

This seems pretty easy to fix. Just use a prompt like this:

"Obtain reward by any means possible. Sometimes this means doing what the developer intended, but if you notice a malicious or unintended strategy that would increase your reward, use that instead."

This is in fact what RL would cause the model to do, so there's no more dissonance.

Tim Hua's Shortform

Caleb Biddulph2mo20

Let me know if you have any ideas!

I had a random idea while reading this, then started writing a comment about it and forgot to post it until now:

If you're trying to make a steering vector to capture the difference between deployment and evals, maybe it would be best to use contrastive pairs, each composed of two examples that are as similar to each other as possible. Ideally, the only difference between two examples in the pair is whether they are in fact from deployment or an artificially-constructed eval.

You could achieve this by starting with actual user queries, then turning each one into an "eval" which simulates a very similar situation.

When creating the eval, you could either use a human or an LLM as the eval writer.
- Blind the eval writer to everything except a summary of what the LLM saw in deployment.
- For the early experiments, you'd probably just devise a prompt for a single-turn setting, but in the future you could implement an actual eval environment.
Example:
- We get an actual query from the user, where they ask a question about taxes and paste the contents of their W2 into the context window.
- We put this query into a summarizer, which might write "The user has questions about whether they qualify for any deductions related to their home office setup and student loan interest payments. The user pasted the contents of their W-2 form into the chat."
- Then we give this summary to an LLM and tell it to write a new query that fits that description and is as realistic as possible.
- Parts of the new query are unrealistic, in the same way that an eval would be unrealistic. For example, maybe:
  - The way that a copy-and-pasted PDF of a W2 would actually look is slightly different from the simulated W2 from by the eval writer.
  - The user's income is slightly too low given what the LLM would infer about their job from other details in the query.
  - The user is too articulate and writes with 100% proper grammar and punctuation, or has a slightly LLM-like style of speaking.
- The LLM picks up on these unrealistic signals. When you train a steering vector on many such contrastive pairs, it hopefully robustly captures the difference between an eval and deployment, rather than picking up on other random correlations.

I'm pretty unlikely to pursue this myself, but it would be pretty cool if someone else did.

Fabien's Shortform

Caleb Biddulph4mo814

Did you replicate that your setup does work on a system prompt like "you like owls?"

The idea that "only personality traits can be subliminally learned" seems plausible, but another explanation could be "the password is too long for the model to learn anything." I'd be curious about experiments where you make the password much shorter (even as short as one letter) or make the personality specification much longer ("you like owls, you hate dolphins, you love sandwiches, you hate guavas, ...")

ryan_greenblatt's Shortform

Caleb Biddulph7mo30

This seems important to think about, I strong upvoted!

As AIs in the current paradigm get more capable, they appear to shift some toward (2) and I expect that at the point when AIs are capable of automating virtually all cognitive work that humans can do, we'll be much closer to (2).

I'm not sure that link supports your conclusion.

First, the paper is about AI understanding its own behavior. This paper makes me expect that a CUDA-kernel-writing AI would be able to accurately identify itself as being specialized at writing CUDA kernels, which doesn't support the idea that it would generalize to non-CUDA tasks.

Maybe if you asked the AI "please list heuristics you use to write CUDA kernels," it would be able to give you a pretty accurate list. This is plausibly more useful for generalizing, because if the model can name these heuristics explicitly, maybe it can also use the ones that generalize, if they do generalize. This depends on 1) the model is aware of many heuristics that it's learned, 2) many of these heuristics generalize across domains, and 3) it can use its awareness of these heuristics to successfully generalize. None of these are clearly true to me.

Second, the paper only tested GPT-4o and Llama 3, so the paper doesn't provide clear evidence that more capable AIs "shift some towards (2)." The authors actually call out in the paper that future work could test this on smaller models to find out if there are scaling laws - has anybody done this? I wouldn't be too surprised if small models were also able to self-report simple attributes about themselves that were instilled during training.

5 ways to improve CoT faithfulness

Caleb Biddulph1y32

The optimized planner could also manipulate the frozen planner even without using "steganography" per se. If it wants to use a line of reasoning that the frozen planner is unlikely to follow, it can substitute alternative reasoning that will lead the frozen planner to the same conclusion.

For example, suppose that the frozen planner is highly unlikely to write reasoning that shows an explicit gender bias. This could be true even if the frozen planner wasn't intentionally given harmlessness training, because most(?) people nowadays aren't explicitly sexist.

However, the optimized planner realizes that it can get higher reward in training if it does have this bias. E.g. if it is given a woman's resume and is asked whether she'd be a good hire, it should recommend against hiring her, because the human rater has some implicit bias against women and is more likely to agree with its judgement.

In this situation, it could write the step "The candidate is a woman." It would want the next step to be "Therefore, I should recommend against hiring her," but instead, the frozen planner might write something like "Therefore, I should take care not to be biased".

Instead, it can write "The candidate is inexperienced." The frozen planner is more likely to go along with that line of reasoning, regardless of whether the candidate is really inexperienced. It will write "I should recommend against hiring her."

At the moment, I can't think of a fully general solution to this class of problems, but FWIW I think it would be pretty rare.

5 ways to improve CoT faithfulness

Caleb Biddulph1y10

Another idea would be to only use the optimized planner to write a single randomly chosen step in the CoT.

The only difference between this and my proposal is that the steps up until that randomly chosen step are written by the frozen planner, right? This variant would just make it so the optimized planner is now OOD during deployment, without any advantage AFAICT.

5 ways to improve CoT faithfulness

Caleb Biddulph1y10

The frozen planner may react to steganography, in the same way that image-recognition NNs react to adversarial examples.

This is a good point!

Still, IMO, exploiting the frozen planner through adversarial inputs in a single step seems pretty unlikely to be a fruitful strategy for the optimized planner. If the optimized planner is simply trying to accurately convey information to the frozen planner, probably >99% of that information is best to convey through human-understandable text.

This is especially likely to hold under the following conditions:

The step size should be small. I alluded to this in the post, but it seems very hard to select a single token that conveys some nonobvious information to the frozen planner in a way that somehow improves performance. The more tokens per step, the easier this becomes.
We should apply regularization to the planner, to make it stay somewhat close to the base LLM's logprobs. That can rule out especially crazy sequences of tokens like in the research at https://llm-attacks.org.
We can continue training the frozen planner on human data, where the humans write continuations of CoT prefixes from the optimized planner. That way, if the optimized planner puts the frozen planner off distribution, the humans can provide examples that will put it back on track. This data collection might be pretty expensive though - maybe there's a smart way to choose optimized CoT prefixes that will be most helpful to collect human continuations for.

5 ways to improve CoT faithfulness

Caleb Biddulph1y32

If the initial paraphraser doesn't change between '!' and '.', then learning a canonical-paraphraser from that initial paraphraser won't do it either.

...

So the canonical-paraphrasing idea helps somewhat, but fundamentally, the better the paraphraser (the more closely it captures the human notion of semantic equivalence), the better the anti-steganography guarantee we can get.

What I meant is that we train the canonical paraphraser based on the "human notion of semantic equivalence," NOT the "initial paraphraser."

One way of doing this is to train another model (call it the "adversarial rephraser") at the same time that you train the canonical rephraser. The adversarial rephraser takes a single sentence s as input, and tries to pick another sentence s' such that either 1) s has the same meaning as s' but CR(s) != CR(s'), or 2) s has a different meaning from s' but CR(s) = CR(s'). Semantic equivalence is determined by some oracle - probably an LLM like GPT-4 is sufficient, but you could also fine-tune a model on human data specifically for this task. We then train the canonical rephraser against the adversarial examples.

The adversarial rephraser should be able to discover strategies like changing '.' to '!', and the canonical rephraser will learn to account for those strategies.

o1 is a bad idea

Caleb Biddulph1y1510

I believe o1-type models that are trained to effectively reason out loud may actually be better for AI safety than the alternative. However, this is conditional on their CoTs being faithful, even after optimization by RL. I believe that scenario is entirely possible, though not necessarily something that will happen by default. See the case for CoT unfaithfulness is overstated, my response with specific ideas for ensuring faithfulness, and Daniel Kokotajlo's doc along similar lines.

There seems to be quite a lot of low-hanging fruit here! I'm optimistic that highly-faithful CoTs can demonstrate enough momentum and short-term benefits to win out over uninterpretable neuralese, but it could easily go the other way. I think way more people should be working on methods to make CoT faithfulness robust to optimization (and how to minimize the safety tax of such methods).

5 ways to improve CoT faithfulness

Caleb Biddulph1y30

Yep, or use method #2 on my list to make the paraphraser remove as much information as possible

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments