# 7

For an overview of the problem of Optimization Regularization, or Mild Optimization, I refer to MIRI's paper Alignment for Advanced Machine Learning Systems, section 2.7

# My solution

Start with a bounded utility function, , that is evaluated based on state of the world at a single time (ignoring for now simultaneity is ill-defined in Relativity). Examples:

• If a human at time (at the start of the optimization process) are shown the world state at time , how much would they like it (mapped to to the interval ).

Then maximize , where is a regularization parameter chosen by the AI engineer, and is a free variable chosen by the AI.

Time is measured from the start of the optimization process. Because the utility is evaluated based on the world at time , this value is the amount of time the AI spends on the task. It is up to the AI to decide how much time it wants. Choosing should be seen as part of choosing the policy, or be included in the action space.

Because the utility function is bounded, the optimization process will eventually hit diminishing returns, and will then choose to terminate, because of the time penalty.

## Why time penalty?

Unbounded optimization pressure is dangerous. Without any form of regularization, we need to get the alignment exactly right. However, with regularization we merely need to get it almost exactly right, which I believe is much easier.

However, impact regularization have turned out to be very hard. We don't want the impact measure to depend on the AI's understanding of human values, because that will not provide extra safety. But a value neutral impact measure is almost impossible, because the world has too many degrees of freedom. However, time is both value neutral and has only a single degree of freedom.

## Why not use a fixed finite time horizon?

The reason is a variable chosen by the AI and not a constant chosen by us, is because we don't know when the optimization process will start hitting diminishing returns. Leaving up to the AI solves this dynamically. In addition, we will still get to choose a maximum time horizon by the choice of and . The AI will never keep going beyond

# What happens after t=T ?

That depends on the AI design. Since we don't yet know how to build a general intelligence, we also don't know what will happen after in some default scenario.

However, we probably don't want a strong optimizer sitting around with no goal, because it is very unclear what that thing will do. Or if we accidentally give it some tiny incentive, it might then maximize that until the end of time.

E.g. if the AI has any uncertainty whether it has reach time or not, it will keep maximizing conditioned on , because those are the only worlds that count. As becomes less and less likely, the AI will act more and more crazy.

A way to solve this is to include an incentive for turning itself off, e.g. maximize

Unfortunately, I don't know of any good objective way to define "is turned of". The best definition I thought of so far is:

Definition: AI is turned off at time = If a human at time are shown the world state a time , they would agree that, the AI and all its sub agents and successor agents are turned off.

And if we are referring to human judgement anyway, we might as well throw in some more desirable things. Maximize

Definition: AI is turned off and the world is OK, at time = If a human at time are shown the world state a time , they would agree that, the AI and all its sub agents and successor agents are turned off, and the world at time is not significantly worse or in greater danger than at at time .

Note that "the world is OK" is not necessary for the regularization to work. But I would still recommend to include some explicit optimization pressure towards not destroying the world, ether in , or as an extra term. The regularization mainly stops the AI from Goodharting too hard, it does not do much to reduce side effects you have not even tried to specify.

# Some open problems

## How is time measured?

I think it is best if time refers to real physical time, and not clock ticks or number of computing operations. This is just an intuition at this point, but it seems to be like we get a better overall optimization regulator if we punish both computation and execution, because that is less likely to have loopholes. E.g. penalizing physical time is robust under delegation.

## How to make this compatible with General Relativity?

If measures physical time then this is ill-defined in GR, and since we probably live in GR or similar, this is a big problem.

## Is there a better way to define "is turned off"?

It would be nice with a definition of "is turned off" that does not relay on humans' ability to judge this, or the AI's ability to model humans.

"world is OK" is clearly a value statement, so for this part we will have to rely on some sort of value learning scheme.

# Acknowledgements

This suggestion is inspired by, and partly based on ARLEA and discussions with John Maxwell. The idea was further developed in discussion with Stuart Armstrong.

# 7

Mentioned in
New Comment

I like this line of thought overall.

• How would we safely set lambda?

• Isn’t it still doing an argmax over plans and T, making the internal optimization pressure very non-mild? If we have some notion of embedded agency, one would imagine that doing the argmax would be penalized, but it’s not clear what kind of control the agent has over its search process in this case.

But a value neutral impact measure is almost impossible, because the world has too many degrees of freedom.

Can you explain why you think something like AUP requires value-laden inputs?

Hey there!

I think this method works well as an extra layer of precaution to go along with another measure of reduced impact. On its own, it has a few issues, some of which you cover.

First of all, I'd replace the utility function with a reward function, specifically one that provides rewards for past achievements. Why? Well, in general, utility functions give too much of an incentive to keep control of the future. "Create a subagent and turn yourself off" is my general critique of these kinds of methods; if the subagent is powerful enough, the best policy for the agent could be to create them and then turn themselves off for or some similarly low number.

Having a reward function on past achievements precludes that, and it also means the agent is not incentivised to continue past ; indeed part of the definition of the reward could be that it stops at .

When using human judgements, normally the risk is that the AI is incentivised to fool us. Here, however, the AI is on a time budget, so it might find it easier to be "honest" than to put in the time and effort to fool us. Another approach is to use indifference so that it doesn't care about the human decision ahead of time.

General relativity doesn't seem much of an issue. Just pick a reference frame (say one centred on the AI at time and with relative velocity to the AIT\$" as the hyperplane. Because of the (reverse) triangle inequality, any path the AI takes to reach this hyperplane will give it proper time in which to act. If we worry about wormholes and such, we could even define to be the minimum of the reference frame time () and the AI's proper time (), to be really sure that the AI doesn't get too much time to think.