AI ALIGNMENT FORUM
AF

Alex Turner
Ω474576116911
Message
Dialogue
Subscribe

I don't use LessWrong much anymore. Find me at www.turntrout.com.

My name is Alex Turner. I'm a research scientist at Google DeepMind on the Scalable Alignment team. My views are strictly my own; I do not represent Google. Reach me at alex[at]turntrout.com

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Interpreting a Maze-Solving Network
Thoughts on Corrigibility
The Causes of Power-seeking and Instrumental Convergence
Reframing Impact
10TurnTrout's shortform feed
6y
256
Training a Reward Hacker Despite Perfect Labels
TurnTrout20d20

Retrospective: This is a win for the frame of "reward reinforces previous computations." Ever since 2022, I've thought of "reward" as reinforcing the computations which led to the reward and as a chisel which carves circuits into the policy. From "Reward is not the optimization target":

What reward actually does is reinforce computations which lead to it... 

I suggest that you mechanistically model RL agents as executing behaviors downstream of past reinforcement (e.g. putting trash away), in addition to thinking about policies which are selected for having high reward on the training distribution (e.g. hitting the button). The latter form of reasoning skips past the mechanistic substance of reinforcement learning: The chiseling of computations responsible for the acquisition of the cognition-updater...

In my view, reward’s proper role isn’t to encode an objective, but a reinforcement schedule, such that the right kinds of computations get reinforced within the AI’s mind.

By thinking about reward in this way, I was able to predict[1] and encourage the success of this research direction. 

Ariana showed that in this coding environment, it's not just about what the AI ends up choosing but also why the AI made that choice to begin with. Even though we "perfectly" reinforce the AI for doing what we wanted (i.e. avoiding special cases), we also often reinforced the system for the wrong reasons (i.e. considering special-casing the algorithm, even when not asked to do so). The AI's propensity to consider doing the wrong thing was reinforced and so the AI generalized to hack more in-distribution.

Assuming these results generalize, the trained policy is not just determined by the outputs which get rewarded. The trained policy also depends on which intermediate computations get rewarded. 

As best I can tell, before "Reward is not the optimization target", people mostly thought of RL as a sieve, or even a carrot and stick—try to "give reward" so the AI can only maximize reward via good behavior. Few[2] other people speculated that RL generalization is controlled by why the policy took an action. So I give myself and @Quintin Pope[3] a bunch of points.

  1. ^

    To be clear, my prediction was not as precise as "I bet you can reinforce sus CoTs and get sus generalization." The brainstorming process went like:

    1. What are some of the most open important problems in alignment? -> Reward hacking
    2. What are common assumptions about reward hacking? Oh, yeah, that hacking comes from reward function imperfections.
    3. Hmm I wonder whether models can be trained to reward hack even given "perfect" feedback
    4. We should really think more about this
    5. Time passes, continue encouraging research into the importance of CoT and prompts in RL (thinking about RL using the chisel-frame, as I ~always do)
    6. Victor and Ariana get this result.
  2. Perhaps Steve Byrnes is an exception. 

  3. ^

    Quintin and I came up with "Reward is not the optimization target" together.

Reply511
Reasoning-Finetuning Repurposes Latent Representations in Base Models
TurnTrout1mo30

Nice work. What a cool use of steering vectors!

Reply
Evaluating the historical value misspecification argument
TurnTrout2mo14

based prediction

Reply1
Distillation Robustifies Unlearning
TurnTrout2mo47

Wasn't it the case that for some reason, full distillation had comparable compute requirement to data filtering? I was surprised by that. My impression is that distillation should be more like 10% of pretraining (data filtering), which would make the computational UNDO results much stronger. Not sure what happened here.

Reply
Distillation Robustifies Unlearning
TurnTrout2mo30

I think you missed the point here. My suggested scheme is 1. label a small amount of data 2. train a classifier 3. apply the classifier to know if you should skip a token / make the target logprobs be noise or use the original logprobs. This is spiritually the same as 1. label a small amount of data 2. use that for unlearning 3. apply the unlearned model to know if the target logprobs should be noise or sth close to the original logprobs.

EDIT: I think I misunderstood your original point - were you saying to just label all of the data using a classifier trained on just 1% of the pretraining data? (Neither of your schemes say what to do after step 3.)

> UNDO over Unlearn-and-Distill is that it provides a tunable compute/robustness knob between the conventional unlearning and full reinitialization/data filtering 

This to be a part of the option space that nobody is interested in, but it's still scientifically interesting. 

Why do you claim that no one is interested in this? Lots of labs do data filtering, which is known to be effective but quite costly to iterate on. 

Reply
Distillation Robustifies Unlearning
TurnTrout2mo20

In other words, "using unlearning techniques like GradDiff/MaxEnt during pretraining" might be a really powerful technique.

I have a cached thought that this was found to disrupt overall capabilities / make learning harder, but I don't have a reference on hand.

Reply
ryan_greenblatt's Shortform
TurnTrout3mo*157

I think that "make it easy to responsibly share a dataset" would be a highly impactful project. Anthropic's Claude 4 model card already argues that dataset leakage hurt Claude 4's alignment (before mitigations). 

For my part, I'll put out a $500 bounty on someone completing this project and doing a good job of it (as judged by me / whomever I consult). I'd also tweet it out and talk about how great it is that [person] completed the project :) I don't check LW actively, so if you pursue this, please email alex@turntrout.com.

EDIT: Thanks to my coworker Anna Wang , the bounty is doubled to $1,000! Completion criterion is:

An unfamiliar researcher can follow the instructions and have their dataset responsibly uploaded within one hour

Please check proposed solutions with dummy datasets and scrapers

Reply1
ryan_greenblatt's Shortform
TurnTrout3mo178

Thanks for taking these steps! 

Context: I was pretty worried about self-fulfilling misalignment data poisoning (https://turntrout.com/self-fulfilling-misalignment) after reading some of the Claude 4 model card. I talked with @Monte M and then Ryan about possible steps here & encouraged action on the steps besides the canary string. I've considered writing up a "here are some steps to take" guide but honestly I'm not an expert.

Probably there's existing work on how to host data so that AI won't train on it. 

If not: I think it'd be great for someone to make a template website for e.g. signing up with CloudFlare. Maybe a repo that has the skeleton of a dataset-hosting website (with robots.txt & ToS & canary string included) for people who want to host misalignment data more responsibly. Ideally those people would just have to

  1. Sign up with e.g. Cloudflare using a linked guide,
  2. Clone the repo,
  3. Fill in some information and host their dataset.

After all, someone who has finally finished their project and then discovers that they're supposed to traverse some arduous process is likely to just avoid it. 

Reply1
Distillation Robustifies Unlearning
TurnTrout3mo20

--Filter out the ones that seem to have maybe been unfaithful, as judged by e.g. activations for deception or whatnot.

Would you actively unlearn on those CoTs? Or just filter from distillation data?

Reply
Self-fulfilling misalignment data might be poisoning our AI models
TurnTrout5mo32

Any empirical evidence that the Waluigi effect is real? Or are you more appealing to jailbreaks and such?

Reply
Load More
Reinforcement learning
2y
(+16)
Reinforcement learning
2y
(+333/-390)
Complexity of value
3y
(+176/-112)
General Alignment Properties
3y
(+317)
Pages Imported from the Old Wiki
5y
(+9/-5)
53Training a Reward Hacker Despite Perfect Labels
20d
8
77Optimizing The Final Output Can Obfuscate CoT (Research Note)
1mo
3
29We Built a Tool to Protect Your Dataset From Simple Scrapers
1mo
1
32A Simple Explanation of AGI Risk
2mo
0
87Distillation Robustifies Unlearning
3mo
22
50Self-fulfilling misalignment data might be poisoning our AI models
6mo
13
52Steering Gemini with BiDPO
7mo
2
31Gaming TruthfulQA: Simple Heuristics Exposed Dataset Weaknesses
8mo
0
64Gradient Routing: Masking Gradients to Localize Computation in Neural Networks
9mo
3
51Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models
9mo
1
Load More