To see this, imagine the AUP agent builds a subagent to make for all future , in order to neutralize the penalty term. This means it can't make the penalty vanish without destroying its ability to better optimize its primary reward, as the (potentially catastrophically) powerful subagent makes sure the penalty term stays neutralized.
I believe this is incorrect. The and are the actions of the AUP agent. The subagent just needs to cripple the AUP agent so that all actions are equivalent, then go about maximising to the upmost.
Hey there! Sorry for the delay. $50 awarded to you for fastest good reference. PM me your bank details.
I'm not sure why you picked .
Because it's the first case I thought of where the probability numbers work out, and I just needed one example to round off the post :-)
It's worth you write up your point and post it - that tends to clarify the issue, for yourself as well as for others.
I've posted on the theoretical difficulties of aggregating the utilities of different agents. But doing it in practice is much more feasible (scale the utilities to some not-too-unreasonable scale, add them, maximise sum).
But value extrapolation is different from human value aggregation; for example, low power (or low impact) AIs can be defined with value extrapolation, and that doesn't need human value aggregation.
Bedankt!
Yes, those are important to provide, and we will.
I do not put too much weight on that intuition, except as an avenue to investigate (how do humans do it, exactly? If it depends on the social environment, can the conditions of that be replicated?).
We're aiming to solve the problem in a way that is acceptable to one given human, and then generalise from that.
The aim of this post is not to catch out GPT-3; it's to see what concept extrapolation could look like for a language model.