Robert Kirk

PhD student at UCL DARK doing RL, OOD Robustness and safety. Interested in self improvement.

Wiki Contributions


How can Interpretability help Alignment?

(Note: you're quoting your response as well as the sentence you've meant to be quoting (and responding to), which makes it hard to see which part is your writing. I think you need 2 newlines to break the quote formatting).

Do you see a way of incentivizing the RL community to change this? (If possible, that would seem like a more effective approach than doing it "ourselves".)

I think this is kind of the same as how do we incentivise the wider ML community to think safety is important? I don't know if there's anything specific about the RL community which makes it a different case.

There is some work in DeepMind's safety team on this, isn't there? (Not to dispute the overall point though, "a part of DeepMind's safety team" is rather small compared to the RL community :-).)

I think there is too, and I think there's more research in general than there used to be. I think the field of interpretability (and especially RL) interpretability is very new, pre-paradigmatic, which can make some of the research not seem useful or relevant.

It was a bit hard to understand what you mean by the "research questions vs tasks" distinction. (And then I read the bullet point below it and came, perhaps falsely, to the conclusion that you are only after "reusable piece of wisdom" vs "one-time thing" distinction.)

I'm still uncertain whether tasks is the best word. I think we want reusable pieces of wisdom as well as one-time things, and I don't know whether that's the distinction I was aiming for. It's more like "answer this question once, and then we have the answer forever" vs "answer this question again and again with different inputs each time". In the first case interpretability tools might enable researchers to answer the question easier. In the second our interpretability tool might have to answer the question directly in an automatic way.

If we believe a particular proposal is more or less likely than others to produce aligned AI, then we would preferentially work on interpretability research which we believe will help this proposal over research which wouldn't, as it wouldn't be as useful.

I have changed the sentence, I had other instead of over.