Suppose you have a ML model trained to output formal proofs. Maybe you start with ZFC and then add extra tokens for a range of common concepts. (along with definitions. ). So a human mathematician needs to type in the definition of a gradient in terms of limits, and the definition of limits in terms of epsilon and delta, and the definition of the real numbers in terms of dedekind cuts. All the way back to ZFC. The human needn't type any proofs, just the definitions.

The model could be trained by generating random syntactically correct strings of tokens, and trying to prove or disprove them. (Remember, we have added the notion of a gradient to the token pool, plenty of the random questions will involve gradients) Hopefully it forms intermediate theorems and heuristics useful towards proving a wide class of theorems.

Computer programs can be described as mathematical objects. So the human adds some tokens for lisp programs, and a few definitions about how they behave to the token pool.

"Will program X do Y?" is now a perfectly reasonable question to ask this model.

This is where the magic happens. You give your system a simple toy problem, and ask for short programs that solve the toy problem, and about which many short theorems can be proved. Maybe you do gradient descent on some abstract latent space of mathematical objects. Maybe an inefficient evolutionary algorithm selecting both over the space of programs and the theorems about them. Maybe "replace the last few layers, and fine tune the model to do a new task", like RLHF in ChatGPT.

Now I don't expect this to just work first time. You will want to add conditions like "ignore theorems that are true of trivial programs (eg the identity program)" and perhaps "ignore theorems that only take a few lines to prove" or "ignore theorems so obvious that a copy of you with only 10% the parameters can prove it". For the last one, I am thinking of the programmers actually training a mini version with 10% the parameters, and running some gradients through it. I am not thinking of the AI reasoning about code that is a copy of itself.

The AI model should have a latent space. This can let the programmers say "select programs that are similar to this one" or "choose a program about which theorems close to this theorem in latent space can be proved".

The idea of this is that

Asking questions should be safe. There are a bunch of different things we can optimize, and it should be safe to adjust parameters until it is proving useful results not trivialities. The AI doesn't have much information about human psychology, or about quantum physics or the architecture of the processor it's running on. Gradient descent has been pushing it to be good at answering certain sorts of question. There is little to no advantage to being good at predicting the questions or figuring out what they imply about the people asking them.

With a bit of fiddling, such a design can spit out interesting designs of AI, and theorems about the designs. This isn't a foolproof solution to alignment, but hopefully such help makes the problem a lot easier.

It is ABSOLUTELY NOT SAFE to throw large amounts of compute at the programs that result. Don't have anything capable of running them installed. The programs and the theorems should be read by humans, in the hope that they are genius insights into the nature of AI. The textbook from the future. Humans can then use the insights to do... something.

Suppose you have a ML model trained to output formal proofs. Maybe you start with ZFC and then add extra tokens for a range of common concepts. (along with definitions. ). So a human mathematician needs to type in the definition of a gradient in terms of limits, and the definition of limits in terms of epsilon and delta, and the definition of the real numbers in terms of dedekind cuts. All the way back to ZFC. The human needn't type any proofs, just the definitions.

The model could be trained by generating random syntactically correct strings of tokens, and trying to prove or disprove them. (Remember, we have added the notion of a gradient to the token pool, plenty of the random questions will involve gradients) Hopefully it forms intermediate theorems and heuristics useful towards proving a wide class of theorems.

Computer programs can be described as mathematical objects. So the human adds some tokens for lisp programs, and a few definitions about how they behave to the token pool.

"Will program X do Y?" is now a perfectly reasonable question to ask this model.

This is where the magic happens. You give your system a simple toy problem, and ask for short programs that solve the toy problem, and about which many short theorems can be proved. Maybe you do gradient descent on some abstract latent space of mathematical objects. Maybe an inefficient evolutionary algorithm selecting both over the space of programs and the theorems about them. Maybe "replace the last few layers, and fine tune the model to do a new task", like RLHF in ChatGPT.

Now I don't expect this to just work first time. You will want to add conditions like "ignore theorems that are true of trivial programs (eg the identity program)" and perhaps "ignore theorems that only take a few lines to prove" or "ignore theorems so obvious that a copy of you with only 10% the parameters can prove it". For the last one, I am thinking of the programmers actually training a mini version with 10% the parameters, and running some gradients through it. I am not thinking of the AI reasoning about code that is a copy of itself.

The AI model should have a latent space. This can let the programmers say "select programs that are similar to this one" or "choose a program about which theorems close to this theorem in latent space can be proved".

The idea of this is that