Some alignment ideas

SelonNerias

Epistemic status: untested ideas

Introduction

This post is going to describe a few ideas I have regarding AI alignment. It is going to be bad, but I think there are a few nuggets worth exploring. The ideas described in this post apply to any AI trained using a reward function directly designed by humans. With AIs where the reward function is not directly designed by humans, it might be nice if some of the concepts here were present and some (Noise and A knowledge cap) could potentially be added as a final step.

Terms (in the context of this paper)

Reward-function: A function which the AI uses to adjust its model.

Variable: In this context variables are measurements from the environment. The AI model uses variables to determine what to do, and the reward-function uses variables to determine how good the AI model did.

Model: A combination of a mathematical model of the world, and instructions on how to act within the world.

Noise: Randomness in a measurement or variable.

Tweak: A small adjustment (to the AI’s model)

Idea 1: Noise

Add random (quantum) noise to the reward-function, perhaps differing amounts/types of noise for different parts of the reward-function.

Why?

Noise might make it so the AI doesn’t optimise its model for every statistically barely significant increase in its rewards. This because it might learn that small tweaks are unlikely to consistently yield it rewards and/or only yield small insignificant rewards.

Removing small tweaks might be beneficial. It might prevent the AI from overfitting the reward-function. For example, if a really small slightly unreliable thing such as giving every kid liquorice (or making as many paperclips as possible) is the most cost-effective thing the AI can do to increase the rewards it gets, the noise might make it less likely the AI will pick up on this.

This is useful because the AI and/or the reward-function might be wrong in identifying such small things as the most cost-effective thing to increase what we care about. Optimising for such small tweaks might be disastrous (turn everything into liquorice for kids or the classic paperclips). All things which only optimise for one reward-function variable could be disastrous, but optimising for something which would be rendered statistically insignificant by a little added noise would be more so than optimising for something which we think is overwhelmingly positive (more baseline example, number of happy people).

The extra time the AI needs to figure out what actions its reward-function rewards might be useful. People could use that time to come up with better alignment approaches. These benefits might not be substantially different from a moratorium on new AI models, except there is an AI which is still slowly optimising, which can be studied. In this regard, the more gradual advancement of AI might be useful.

Why not?

The ethics we put in the reward-function could be an “unstable system” where small changes in rewards could have massive real-world consequences. Noise of course creates small changes leading to a perhaps (slightly more) flawed understanding of the ethics we want the AI to have.

Small tweaks which with added noise become statistically insignificant, might be very valuable. A rotating schedule of breakfast optimized for one person’s well-being, and many other small tweaks might be rendered too insignificant by the noise in the reward-function.

The added time it takes for the AI to learn from the reward-function could be seen as a negative. The AI could be extremely beneficial and delaying AI development would deprive the world of those benefits for some time. There will perhaps be people who die who otherwise wouldn’t (for example, if the AI finds the optimal treatments for a number of diseases).

Could this be tested?

We could compare the behaviour of a bot trained using a reward-function without noise and a bot trained on a reward-function with noise. These bots could be given various reward-functions in various environments.

Idea 2: Redundant variables

Include redundant variables in the reward-function, even if a reward-function with fewer variables seems to do the same thing.

We could have A be a complete set of variables of everything we think we fundamentally care about, and B could be another set of variables of everything we think influences the things we fundamentally care about.

Example, set A could have things humans fundamentally care about, such as justice, happiness and other emotional states, and in set B we could include the laws we think best serve justice (right to a fair trial, right to a lawyer, etc) and things which make us most happy (spending time with a really good friend, family, working on something we love, love itself, etc.)

Why?

If there is a possibility for disastrous optimisation (e.g. paper clips or liquorice) in set A, this disastrous optimisation might not exist in set B. As such the AI might lose enough reward points due to variables in set B to make the optimisation in set A no longer attractive.

Why not?

Having the reward-function contain multiple complete sets of variables might give more opportunities for disastrous optimisation, due to more variables which could be used to disastrously optimise.

Having a set of variables we don’t fundamentally care about might prevent the AI from properly optimising for what we care about.

Could this idea be tested?

This could be tested by, for example, comparing the behaviour of 3 bots playing a game.

One bot trained using a reward model of things we “fundamentally care about” in the game.
One bot trained using a reward model which is trained on things to do which we think are part of an optimal strategy to get the things we fundamentally care about.
An AI trained on a mixed reward-function, including both the rewards of 1. and the rewards of 2.

Idea 3: Diminishing returns

Use a reward-function where any variable we want to increase yields less the more it increases (diminishing returns). For example, use the function:

Where $U (x, y, z)$ is the reward-function and $x$ , $y$ and $z$ are variables of the reward-function, things we want to happen. For example, x could be people getting hugged by puppies.

For punishments on things we’d rather not have happen (such as letting food spoil or blowing up the earth), we could increase the punishment the more the world state is changed with regard to those variables. For example, using a function like:

$U (p, q, r) = - e^{p q r}$

Where $U (p, q, r)$ is the reward-function and $p$ , $q$ and $r$ are variables of the reward-function.

Why?

With diminishing returns, increasing one variable will yield less the more it is increased, so trading an increase in one variable for a decrease in another variable will become less worth it the higher one variable is. This perhaps prevents disastrous optimisations where everything becomes one thing (e.g. paperclips).

This idea prevents the AI from ignoring things, which we have specified in the reward-function, because they are too hard. Eventually, the diminishing returns on other things we care about will cause the AI to give higher priority to working on the hard things. This keeps all the things we care about at a similar “level” with some variation depending on how hard it is to increase the things we care about.

It might reflect psychological reality. At some point, we don’t care that much about increases in something we like anymore, for example when we get an additional apartment, but we already own 4, we’d likely care way less than when we own none and get a first apartment. As such we’d likely prefer a state in which everything we care about gradually becomes better over one in which (at first) one thing (such as video games) becomes extraordinarily good while other things we care about are ignored.

Why not?

Not everything we care about might have diminishing returns for us and as such we might be better off if the things that don’t have diminishing returns aren’t assigned diminishing returns in the reward-function.

Could this be tested?

Create 2 bots, one with a reward-function with linear rewards/punishments and one with diminishing rewards and increasing punishments and study the behaviour differences of the bots. It is, of course, also possible to test a variety of functions in the reward-function.

Idea 4: A knowledge cap

It might be possible to create a knowledge cap in the AI if we reduce the rewards of the reward-function to zero after some time, or after a certain number of adjustments to the AIs model have been made. Alternatively, if it is possible, we could provide the AI with a disincentive for adjusting its model after some amount of time or we could use any other available method of stopping the AI from training.

Why?

It might be useful to have an AI model which stops learning after some time. Limited optimisation might stop it from optimising so much that it turns the world into a giant paperclip or another doom scenario. This since a disastrous optimisation of the reward-function might need more time/adjustments to be found than a less extreme AI model.

This idea can provide an automatic stop in training the model every short span of time/number of adjustments, which allows the team working on the model to evaluate how it works and if it might be safe, after which the AI could start training again. This prevents the potential problem of an AI left to train for the weekend suddenly increasing in intelligence to the point where it might cause doom.

Same reason as reason 2. for Idea 1. With the limited adjustments forced upon the AI, it is less likely to optimise for small rewards which could be disastrous if done to the exclusion of all else.

Why not?

Stopping every so often to evaluate slows down development of the AI.

Could this be tested?

One could do the standard thing of training one bot without a knowledge cap and one bot with a knowledge cap. (Slowly) decreasing the rewards given by the reward model could also be tested, but this might not do much, since the same things are still being rewarded, so the AI might still make the same adjustments.

Potential policy

Making it a policy where all AI models have to stop training every x days to be reviewed for y hours could afford safety researchers enough time to spot potential problems before they occur.

Credit

Thanks to Kayla Lewis, without whom this would probably have never been posted. Only associate her with the good bits, please.

Also thanks to the other people who read this before I posted it.