PhD student in theoretical computer science (distributed computing) in France. Currently transitioning to AI Safety and fundamental ML work.
When you say "shutdown avoidance incentives", do you mean that the agent/system will actively try to avoid its own shutdown? I'm not sure why comparing with the current state would cause such a problem: the state with the least impact seems like the one where the agent let itself be shutdown, or it would go against the will of another agent. That's how I understand it, but I'm very interested in knowing where I'm going wrong.
I understood that the baseline that you presented was a description of what happens by default, but I wondered if there was a way to differentiate between different judgements on what happens by default. Intuitively, killing someone by not doing something feels different from not killing someone by not doing something.
So my question was a check to see if impact measures considered such judgements (which apparently they don't) and if they didn't, what was the problem.
The more I think about it, the more I come to believe that locality is very related to abstraction. Not the distance part necessarily, but the underlying intuition. If my goal is not "about the world", then I can throw almost all information about the world except a few details and still be able to check my goal. The "world" of the thermostat is in that sense a very abstracted map of the world where anything except the number on its sensor is thrown away.
Sorry for the delay in answering.
Your paper looks great! It seems to tackle in a clean and formal way what I was vaguely pointing at. We're currently reading a lot of papers and blog posts to prepare for an in-depth literature review about goal-directedness, and I added your paper to the list. I'll try to come back here and comment after I read it.
In this post, I assume that a policy is a description of its behavior (like a function from state to action or distribution over action), and thus the distances mentioned indeed capture behavioral similarity. That being said, you're right that a similar concept of distance between the internal structure of the policies would prove difficult, eventually butting against uncomputability.
In the specific example of the car, can't you compare the impact of the two next states (the baseline and the result of braking) with the current state? Killing someone should probably be considered a bigger impact than braking (and I think it is for attainable utility).
But I guess the answer is less clear-cut for cases like the door.
Thanks for the post!
One thing I wonder: shouldn't an impact measure give a value to the baseline? What I mean is that in the most extreme examples, the tradeoff you show arise because sometimes the baseline is "what should happen" and some other time the baseline is "what should not happen" (like killing a pedestrian). In cases where the baseline sucks, one should act differently; and in cases where the baseline is great, changing it should come with penalty.
I assume that there's an issue with this picture. Do you know what it is?
Maybe the criterion that removes this specific policy is locality? What I mean is that this policy has a goal only on its output (which action it chooses), and thus a very local goal. Since the intuition of goals as short descriptions assumes that goals are "part of the world", maybe this only applies to non-local goals.
No worries, that's a good answer. I was just curious, not expecting a full-fledged system. ;)
Thanks for the summary! It's representative of the idea.
Just by curiosity, how do you decide for which posts/paper you want to write an opinion?