• Isn’t it still doing an argmax over plans and T, making the internal optimization pressure very non-mild? If we have some notion of embedded agency, one would imagine that doing the argmax would be penalized, but it’s not clear what kind of control the agent has over its search process in this case.

But a value neutral impact measure is almost impossible, because the world has too many degrees of freedom.

Can you explain why you think something like AUP requires value-laden inputs?

Reply

[-]Stuart_Armstrong7y10

Hey there!

I think this method works well as an extra layer of precaution to go along with another measure of reduced impact. On its own, it has a few issues, some of which you cover.

First of all, I'd replace the utility function with a reward function, specifically one that provides rewards for past achievements. Why? Well, in general, utility functions give too much of an incentive to keep control of the future. "Create a subagent and turn yourself off" is my general critique of these kinds of methods; if the subagent is powerful enough, the best policy for the agent could be to create them and then turn themselves off for $T = 1$ or some similarly low number.

Having a reward function on past achievements precludes that, and it also means the agent is not incentivised to continue past $T$ ; indeed part of the definition of the reward could be that it stops at $T$ .

When using human judgements, normally the risk is that the AI is incentivised to fool us. Here, however, the AI is on a time budget, so it might find it easier to be "honest" than to put in the time and effort to fool us. Another approach is to use indifference so that it doesn't care about the human decision ahead of time.

General relativity doesn't seem much of an issue. Just pick a reference frame (say one centred on the AI at time $t = 0$ and with relative velocity $0$ to the AI $) a n d d e f i n e "$ T$" as the $t = T$ hyperplane. Because of the (reverse) triangle inequality, any path the AI takes to reach this hyperplane will give it $τ \leq T$ proper time in which to act. If we worry about wormholes and such, we could even define $T$ to be the minimum of the reference frame time ( $t$ ) and the AI's proper time ( $τ$ ), to be really sure that the AI doesn't get too much time to think.

Reply

Moderation Log

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

7

Optimization Regularization through Time Penalty

7

My solution

Why time penalty?

Why not use a fixed finite time horizon?

What happens after $t = T$ ?

Some open problems

How is time measured?

How to make this compatible with General Relativity?

Is there a better way to define "is turned off"?

Acknowledgements

7