Doing oversight from the very start of training seems hard

It strikes me that the kind of self-supervision you describe is a suspiciously similar to trying to incorporate meta-preferences in the outer objective by self-modeling. When the model understands humans differently, it changes its notion of what it is to be misaligned or deceptive, which gets used to form a loss term against that kind of behavior, which then impacts how it understands humans.

This analogy:

makes me significantly more optimistic about interpretability inside the training loop.
makes me slightly more pessimistic about meta-preferences incorporated as fine-tuning based on self-modeling.
clarifies preference-related arguments for why getting the former to work means fulfilling more desiderata than current basic schemes do.

Oversight loss

We can use these oversight techniques to add an ‘oversight loss’ term to the loss function which will hopefully steer the training towards aligned models and away from misaligned ones. My model here is that we want to be doing oversight very close to the start of training to prevent it from ever becoming misaligned. If we start doing oversight too late, then the model may already be deceptively misaligned and then our oversight tools are much less likely to work (either because the model can obscure its cognition or the oversight tools are themselves misaligned). I think of this as steering the training process away from the ‘deceptive regions’ of parameter space (parameters which correspond to deceptive models), without ever having to enter the dangerous regions.

Alternatively, rather than deception, we can think about regions where the model ‘behaves unacceptably’ on a some reasonable inputs. If taking a gradient descent step in a direction would increase the size of the set of reasonable inputs for which the model behaves unacceptably, then hopefully the oversight loss would provide some pressure away from stepping in that direction.

Actually doing oversight

I expect powerful AI models will be much too large and complicated for a human to understand/oversee them alone. So we will require help from other models to help with this task; this could be via models assisting the humans, or we could entirely hand the oversight process off to an oversight model.

Here I’m taking very prosaic view: the model being overseen is a large language model, and the overseer model also as an LLM but trained to output some ‘acceptability score’ rather than generate text.

In many proposals this oversight model is some amplified version of the model being overseen. This should hopefully make the job of the overseer easier, as it implicitly has access to all the knowledge and cognition of the model it is overseeing. This could also be implemented as having the overseer be an extra head on the original model.

You can’t oversee models close to initialization

We can’t really do oversight on a model at the very start of training, because it will basically just be a random soup of tensors. We can’t use interpretability tools to look for coherent features or algorithms because there won’t be any. We can’t use adversarial inputs to test for bad behavior because all the model’s output will just be a random string tokens. The model needs to have reached some minimal level of capability for oversight (and even ‘alignment’) to make sense. Even if a human is more ‘capable’ than an initialized model, this doesn’t at all mean that the human can do oversight on it.

Models close to initialization can’t help you do oversight

A dumb model at the start of training obviously can’t help us to do oversight. At a minimum the model needs to be able to understand and generate coherent text, and also to have an adequate understanding of what is ‘acceptable’. These both seem like the sort of capabilities which come from standard pretraining. Obviously pretraining on text helps a model learn to generate text. And pretraining on a lot of text seems like the kind of thing that would give an AI a model of what humans would find ‘acceptable’ (not that this implies that the AI cares what the human wants, only that it has a reasonable model of this). We will also then need to do something further to train the model to be a useful overseer.

So it seems like we will need to do some minimal amount of pretraining to allow the model to be capable enough to help with oversight (and be overseen). This takes us away from the ideal case where we were simultaneously training and using oversight from the beginning of training. For these kind of oversight techniques to work, we need to ensure that pretraining doesn’t itself introduce misalignment.

AI ALIGNMENT FORUM
AF