AI ALIGNMENT FORUM
AF

470
Daniel Tan
000
Message
Dialogue
Subscribe

Researching AI safety. Currently interested in emergent misalignment, model organisms, and other kinds of empirical work.

https://dtch1997.github.io/

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
0Daniel Tan's Shortform
1y
0
Self-fulfilling misalignment data might be poisoning our AI models
Daniel Tan6mo00

Something you didn't mention, but which I think would be a good idea, would be to create a scientific proof of concept of this happening. E.g. take a smallish LLM that is not doom-y, finetune it on doom-y documents like Eliezer Yudkowsky writing or short stories about misaligned AI, and then see whether it became more misaligned. 

I think this would be easier to do, and you could still test methods like conditional pretraining and gradient routing.  

Reply
8Open Challenges in Representation Engineering
6mo
0
17A Sober Look at Steering Vectors for LLMs
10mo
0
10Evolutionary prompt optimization for SAE feature visualization
10mo
0