x

AI ALIGNMENT FORUM
AF

rusheb

Ω1000

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by

rusheb — AI Alignment Forum

No wikitag contributions to display.

Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

If your hypothesis predicts that model performance will be preserved if you swap the input to any other input which has a particular property, but no other inputs in the dataset have that property, causal scrubbing can’t test your hypothesis

Would it be possible to make interventions which we expect not to preserve the model's behaviour, and assert that the behaviour does in fact change?

57Stress Testing Deliberative Alignment for Anti-Scheming Training

2mo

10

89Frontier Models are Capable of In-context Scheming

1y

9

42Apollo Research 1-year update

2y

0

26A starter guide for evals

2y

0

11Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation

2y

1

21Understanding mesa-optimization using toy models

3y

0