Daniel Tan

Researching AI safety. Currently interested in emergent misalignment, model organisms, and other kinds of empirical work.

https://dtch1997.github.io/

Wikitag Contributions

Comments

Sorted by

Something you didn't mention, but which I think would be a good idea, would be to create a scientific proof of concept of this happening. E.g. take a smallish LLM that is not doom-y, finetune it on doom-y documents like Eliezer Yudkowsky writing or short stories about misaligned AI, and then see whether it became more misaligned. 

I think this would be easier to do, and you could still test methods like conditional pretraining and gradient routing.