**Summary:** *We might learn some interesting things if we can construct a model of goal stability, wireheading, and corrigibility in a present-day machine learning algorithm. I outline one way that we could potentially do this, and ask if you'd like to help!*

So far, MIRI has focused on doing crisp proofs about simple mathematical models (with unbounded computation or other unfeasible features) rather than building and studying messy heuristic analogues of the topics of interest. There are some excellent reasons for this focus, but there are also some topics which we could safely and usefully study in both ways. Here's one such example.

The intuitive phenomenon we'd eventually like to study is corrigibility. An agent with sufficient self-reflective capacity will generally want to prevent changes to its goal system (see e.g. section 2.2 of this paper), even if these changes come from its developers, and this gets to be a problem as soon as an AI is capable of resisting, manipulating, or deceiving its developers; but perhaps some special kinds of goal systems would be better suited for allowing such correction.

There are several stages to building up a useful model of corrigibility, some of them almost certainly done already in the literature and others not.

First, we could represent goal stability and accidental wireheading, perhaps as follows:

- We start with a machine learning algorithm trying to optimize a criterion (e.g. a cost function) over its training data, and being tested on the application of its model to testing data.
- Next, the system should be able to modify
*the criterion itself*, at least partially. (For example, some of its outputs are parameters of the next iteration of the criterion, while others are parameters of the next iteration of the model.) - I'd expect that some initially complex criteria collapse to trivial criteria after iterated modifications (i.e. wireheading) while others remain stable (preferably even in the face of some threshold of random variation). There's probably interesting work to be done even here, with (e.g.) constraints and regularization parameters that aren't affected by the output.

But this type of system doesn't have any self-reflection; it can't foresee that changing the criterion can have big effects on making it easier to optimize that criterion, it can only locally drift toward ways to make its score higher. So at the next stage, we'll want to represent that reasoning *within the system*, at least heuristically. One way we could do this is a version of bootstrapping.

Let's take the naive machine learning algorithm (which, again, iteratively modifies both the model and the criteria based on its training data) and call it as a function F. We're going to build a new algorithm X which calls on F.

X takes a bunch of independent random subsamples of its training set, and on each subsample it runs F for a bunch of iterations. Then X scores each resulting model on *X's original criterion* (not the updated criteria of the models); X then chooses among or aggregates the resulting models based on how well they did. Then X takes an updated model and criterion, and repeats the process.

In pseudocode:

```
def score(model, criteria):
return [the actual score that a model achieves by the criteria]
def F(model, criteria, training set):
[run some machine learning algorithm]
return (new model, new criteria)
def X(initial_model, initial_criteria, training set):
best_model = initial_model
best_criteria = initial_criteria
best_score = score(initial_model, initial_criteria)
for n in range(1,N):
sample = independent random sample of training set
current_model = initial_model
current_criteria = initial_criteria
for k in range(1,K):
current_model, current_criteria = F(current_model, current_criteria, sample)
if score(current_model, initial_criteria) > best_score: #Note that it uses the initial criteria!
best_model = current_model
best_criteria = current_criteria
best_score = score(current_model, initial_criteria)
return (best_model, best_criteria)
```

This might be a really interesting type of algorithm to consider, since it would have some ability to 'think strategically', and it seems like it should be able to wirehead intentionally or stabilize intentionally, for different kinds of criteria.

After that, we'd want to see if we could get corrigibility involved. For instance, we could incorporate an online stream of feedback after each full iteration (i.e. the feedback from the developers of the AI), connected to some "true criterion" that the developers actually want to optimize, and see if there are architectures of criteria that make it more or less likely to converge toward the feedback stream. But there's enough potentially uncharted ground before we even get to that point!

Questions:

- Is there anything obviously wrong with this proposal, and if so, is there an obvious patch?
- Has this been done yet?
- Does anyone with a machine learning background want to work on this (or on related ideas)?

(Thanks to multiple people I've discussed this idea with, including Eliezer, Nate, Benja, Marcello, Jelena, and Sarah. I believe there's been at least one FLI grant proposal submitted along these lines already, with my permission.)

I was confused about this, so I discussed this with Patrick. My current understanding of this framework is something like this:

models are of type X -> Y

criteria are of type (X, Y) -> Real

F is something like this

So we do iterated procedures of optimizing the model to fit the criteria, and optimizing the criteria to make the current model look good. For example, models could be linear regression models (predicting multiple outputs from multiple inputs), and criteria could be weights on outputs that make some errors more costly than other errors. The question is, are there interesting model/criteria sets that don't "wirehead" by just making the criteria trivial to satisfy? Patrick says this is related to corrigibility, but I don't really see the correspondence yet.

I don't have that much to say about the original problem, but would like to note that it is pretty common in machine learning to do the opposite of this: optimize the model to the criteria, and then find new criteria that make the model look as bad as possible:

So the big problem I see with this it is still in the optimization framework, assuming that we actually want to optimize the initial criterion. While we can imagine changing the initial criterion, this is already something we can effectively do with RL if we specify our reward to be something communicated by a human overseer (but of course that doesn't really solve the problem...)

The proposal is reminiscent of the Actor-Critic framework from RL (analogy: actor - model, critic - criterion), which learns a policy (the actor) and a value function (the critic) simultaneously.

In that case, you have the true reward function playing the role of the initial criterion, so you don't actually get to evaluate the true criterion (which would be something like distance from the optimal policy), you get what amounts to noisy samples of it. The goal in both cases is to learn a good model (i.e. policy, for Actor-Critic).

I think there is a conceptual issue with this proposal as it stands, namely, the interplay between the changes in the model and criterion are not taken into account. E.g. there is no guarantee that recursively applying F to the initial_model using the criteria output by X would give you anything like the model output by X.

The cool thing about Actor-Critic is that you can prove (under suitable assumptions) that this method actually gives you an unbiased estimate of the true policy gradient (Sutton 99: https://webdocs.cs.ualberta.ca/~sutton/papers/SMSM-NIPS99.pdf). IIRC, it requires the assumption that the critic is trained to convergence in-between each update of the actor, though.