Using (Uninterpretable) LLMs to Generate Interpretable AI Code

Joar Skalse

(This post is a bit of a thought dump, but I hope it could be an interesting prompt to think about.)

For some types of problems, we can trust a proposed solution without trusting the method that generated the solution. For example, a mathematical proof can be independently verified. This means that we can trust a mathematical proof, without having to trust the mathematician who came up with the proof. Not all problems are like this. For example, in order to trust that a chess move is correct, then we must either trust the player who came up with the move (in terms of both their ability to play chess, and their motivation to make good suggestions), or we must be good at chess ourselves. This is similar to the distinction between NP (or perhaps more generally IP/PSPACE), and larger complexity classes (EXP, etc).

One of the things that make AI safety hard is that we want to use AI systems to solve problems whose solution we are unable (or at least unwilling) to verify. For example, automation isn't very useful if all parts of the process must be constantly monitored. More generally, we also want to use AI systems to get superhuman performance in domains where it is difficult to verify the correctness of an output (such as economic activity, engineering, politics, and etc). This means that we need to trust the mechanism which produces the output (ie the AI itself), and this is hard.

In order to trust the output of a large neural network, we must either verify its output independently, or we must trust the network itself. In order to trust the network itself, we must either verify the network independently, or we must trust the process that generated the network (ie training with SGD). This suggest that there are three ways to ensure that an AI-generated solution is correct: manually verify the solution (and only use the AI for problems where this is possible), find ways to trust the AI model (through interpretability, red teaming, formal verification, and etc), or find ways to trust the training process (through the science of deep learning, reward learning, data augmentation, and etc).

[SGD] -> [neural network] -> [output]

I think there is a fourth way, that may work: use an (uninterpretable) AI system to generate an interpretable AI system, and then let *this* system generate the output. For example, instead of having a neural network generate a chess move, it could instead generate an interpretable computer program that generates a chess move. We can then trust the chess move if we trust the program generated by the neural network, even if we don't trust the neural network, and even if we are unable to verify the chess move.

[SGD] -> [neural network] -> [interpretable computer program] -> [output]

To make this more concrete, suppose we want an LLM to give medical advice. In that case, we want its advice to be truthful and unbiased. For example, it should not be possible to prompt it into recommending homeopathy, etc. If we simply fine-tune the LLM with RLHF and red-teaming, then we can be reasonably sure that it probably won't recommend homeopathy. However, it is difficult to be *very* sure, because we can't try all inputs, and we can't understand what all the tensors are doing.

An alternative strategy is to use the LLM to generate an interpretable, symbolic expert system, and then let this expert system provide medical advice. Such a system might be easy to understand, and interpretable by default. For example, we might be able to definitively verify that there is no input on which it would recommend homeopathy. In that case, we could end up with a system whose outputs we trust, even if we don't verify the outputs, and even if we don't necessarily trust the neural network that we used to generate the program.

(Note that we are pretty close to being able to do things like this in practice. In fact, I am pretty sure that GPT-4 already would be able to generate a decent medical expert system, with a little bit of direction.)

Can this strategy always be used? Is it even possible to generate an interpretable, verifiable AI program that could do the job of a CEO, or would any such program necessarily have to be uninterpretable? I don't know the answer to that question. However, if the answer is "no", then mechanistic interpretability will also necessarily not scale to a neural network that can do the job of a CEO. Stated differently, if (strong) interpretability is possible, then there exist interpretable computer programs for all important tasks that we might want to use AI for. If this is the case, then we could (at least in principle) get a neural network to generate such an AI system for us, even if the neural network isn't interpretable by itself.

Another issue is, of course, that our LLM might be unable to write a program for all tasks that it could otherwise have performed itself (similar to how we, as humans, cannot create computer programs which do all tasks that we can do ourselves). Whether or not that is true, and to what extent it will continue to be true as LLMs (and similar systems) are scaled up, is an empirical question.

[-]Fabien Roger10mo20

Do you have interesting tasks in mind where expert systems are stronger and more robust than a 1B model trained from scratch with GPT-4 demos and where it's actually hard (>1 day of human work) to build an expert system?

I would guess that it isn't the case: interesting hard tasks have many edge cases which would make expert systems break. Transparency would enable you to understand the failures when they happen, but I don't think that the stack of ad-hoc rules stacked on top of each other would be more robust than a model trained from scratch to solve the task. (The tasks I have in mind are sentiment classification and paraphrasing. I don't have enough medical knowledge to imagine what would the expert system look like for medical diagnosis.) Or maybe you have in mind a particular way of writing expert systems which ensures that the stack of ad-hoc rules doesn't interact in weird ways that produces unexpected results?

[-]Joar Skalse10mo20

To clarify, the proposal is not (necessarily) to use an LLM to create an interpretable AI system that is isomorphic to the LLM -- their internal structure could be completely different. The key points are that the generated program is interpretable and trustworthy, and that it can solve some problem we are interested in.

AI ALIGNMENT FORUM
AF

Using (Uninterpretable) LLMs to Generate Interpretable AI Code

8

8