AI ALIGNMENT FORUM
AF

46
Daniel Tan
000
Message
Dialogue
Subscribe

https://dtch1997.github.io/

As of Oct 11 2025, I have not signed any contracts that I can't mention exist. I'll try to update this statement at least once a year, so long as it's true. I added this statement thanks to the one in the gears to ascension's bio.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
0Daniel Tan's Shortform
1y
0
Self-fulfilling misalignment data might be poisoning our AI models
Daniel Tan7mo00

Something you didn't mention, but which I think would be a good idea, would be to create a scientific proof of concept of this happening. E.g. take a smallish LLM that is not doom-y, finetune it on doom-y documents like Eliezer Yudkowsky writing or short stories about misaligned AI, and then see whether it became more misaligned. 

I think this would be easier to do, and you could still test methods like conditional pretraining and gradient routing.  

Reply
68Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
7d
6
8Open Challenges in Representation Engineering
6mo
0
17A Sober Look at Steering Vectors for LLMs
11mo
0
10Evolutionary prompt optimization for SAE feature visualization
1y
0