AI ALIGNMENT FORUM
AF

Robert Krzyzanowski
Ω81210
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
How might we safely pass the buck to AI?
Robert Krzyzanowski5mo10

At this point I would direct the "deferred task" apparatus fully towards interventional interpretability. Put a moratorium on further gradient-based training, which is not well understood and can have many indirect effects unless you have some understanding of modularity and have applied stop gradients almost everywhere that is irrelevant to the generator of the conditional, deceptive reasoning behavior. Instead halt, melt and catch fire at that point.

Halt further model deployments towards the original deferred task. Quarantine the model that first exhibited the reasoning and promote it to a model organism (never deployed further, only studied by other earlier models). Direct the apparatus (from earlier checkpoints before you reached this stage) towards surgically reducing it while preserving mechanisms relevant to the original deferred task.

In the frame of enumerative safety, this would look something like identifying which circuits and features are minimal to the original deferred task (though this hinges upon task-minimal circuits being deception-free), evaluating the identified regions for mechanism preservation, localization and minimality and run-time ablating the rest.

There is probably a much better frame, but this is a new start to the "induction base case" in a completely separate invocation of the original apparatus. There are other permutations of spinning up new deferred tasks, such as earlier models applying Microscope AI on the quarantined model organism to gain insight on how we got to this point, without any direct attempt to remediate the issue, which as you note is "unlikely to get rid of it" despite one naive attempt above.

If your spinoff deferred tasks lead to the procurement of provable statements or heuristic arguments that the quarantined model cannot be minimized without "avoiding this kind of conditional, deceptive reasoning," abandon the original deferred task and switch to a different plan. Otherwise, restart the apparatus towards the original deferred task when you have the proof artifacts.

There are a lot of concerns you could raise with this additional structure but it seems like a distinct problem that requires a separate rebuttal rather than a hard stop fail? The obvious one is that these sorts of spinoff deferred tasks could be harder than the original task and consistently lead to the same failure mode, a la "exception thrown while handling previous exception."

Reply
34SAEs are highly dataset dependent: a case study on the refusal direction
8mo
0
23Open Source Replication of Anthropic’s Crosscoder paper for model-diffing
9mo
3
35Base LLMs refuse too
10mo
10
28SAEs (usually) Transfer Between Base and Chat Models
1y
0
18Attention Output SAEs Improve Circuit Analysis
1y
0
34We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To
1y
0
28Attention SAEs Scale to GPT-2 Small
1y
0
35Sparse Autoencoders Work on Attention Layer Outputs
2y
3
29Training Process Transparency through Gradient Interpretability: Early experiments on toy language models
2y
1
14Getting up to Speed on the Speed Prior in 2022
3y
0
Load More