Matt MacDermott

Wiki Contributions

Comments

I've been thinking about whether these results could be interpeted pretty differently under different branding.

The current framing, if I understand it correctly, is something like, 'Powerseeking is not desirable. We can prove that keeping your options open tends to be optimal and tends to meet a plausible definition of powerseeking. Therefore we should expect RL agents to seek power, which is bad.'

An alternative framing would be, 'Making irreversible changes is not desirable. We can prove that keeping your options open tends to be optimal. Therefore we should not expect RL agents to make irreversible changes, which is good.'

I don't think that the second framing is better than the first, but I do think that if you had run with it instead then lots of people would be nodding their heads and feeling reassured about corrigibility, instead of feeling like their views about instrumental convergence had been confirmed. That makes me feel like we shouldn't update our views too much based on formal results that leave so much room for interpretation. If I showed a bunch of theorems about MDPs, with no exposition, to two people with different opinions about alignment, I expect they might come to pretty different conclusions about what they meant. 

What do you think?

(To be clear I think this is a great post and paper, I just worry that there are pitfalls when it comes to interpretation.)