Henry Sleight

Posts

Sorted by New

39Reward hacking behavior can generalize across tasks

2mo

1

17Inducing Unprompted Misalignment in LLMs

3mo

2

14How I select alignment research projects

4mo

4

Wiki Contributions

Comments

Tips for Empirical Alignment Research

Henry Sleight5mo31

First off: as one of Ethan's current Astra Fellows (and having worked with him since ~last October) I especially think his collaborators in MATS and Astra historically underweight how valuable overcommunicating with Ethan is, and routinely underbook meetings to ask for his support.

Second, I think this post is so dense with useful advice, so I made anki flashcards of Ethan's post using GPT-4 (generated via ankibrain [https://ankiweb.net/shared/info/1915225457] , small manual edits.)

You can find them here: https://drive.google.com/file/d/1G4i7iZbILwAiQ7FtasSoLx5g7JIOWgeD/view?usp=sharing

Reply