AI ALIGNMENT FORUM
AF

Daniel Tan

000

https://dtch1997.github.io/

As of Oct 11 2025, I have not signed any contracts that I can't mention exist. I'll try to update this statement at least once a year, so long as it's true. I added this statement thanks to the one in the gears to ascension's bio.

Posts

Sorted by New

68Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

8Open Challenges in Representation Engineering

6mo

17A Sober Look at Steering Vectors for LLMs

11mo

10Evolutionary prompt optimization for SAE feature visualization

Wikitag Contributions

Comments

Sorted by

Newest

No wikitag contributions to display.

0Daniel Tan's Shortform

Self-fulfilling misalignment data might be poisoning our AI models

Daniel Tan7mo00

Something you didn't mention, but which I think would be a good idea, would be to create a scientific proof of concept of this happening. E.g. take a smallish LLM that is not doom-y, finetune it on doom-y documents like Eliezer Yudkowsky writing or short stories about misaligned AI, and then see whether it became more misaligned.

I think this would be easier to do, and you could still test methods like conditional pretraining and gradient routing.

68Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

8Open Challenges in Representation Engineering

6mo

17A Sober Look at Steering Vectors for LLMs

11mo

10Evolutionary prompt optimization for SAE feature visualization