The Simple Alignment Solution Assumption states that many problems in alignment, for example corrigibility or value learning, have simple solutions. I mean "simple" in the sense that logical induction is a simple solution. That expected utility maximization is. That causal models are. This is the sort of thing we are searching for.

Under the Simple Alignment Solution Assumption, the solution is.

Unique or unique-ish. Pinned down by theorems the way that probability theory and expected utility maximization are.

Fundamentally simple, in the right ontology. If you are writing a python program, logical induction is a bit of a pain. But if you can say "all polynomial time Turing machines" with ease, then logical induction is simple.

Recognizable with a few expert-days. These ideas are fairly know-it-when-you-see-it, a lot easier to recognize than to produce.

Mathematical structures. Formal proofs, probability theory and logical induction each have their own mathematical structure.

So, how can LLM's come up with new things in this class?

Training via self play?

Lets say a theorem is Useful if it is often used in the process of proving/disproving random ZFC statements. Such a criterion can be measured by generating random statements to prove/disprove, putting the theorem to be measured in the assumptions, and seeing if it is used.

I would expect that associativity is pretty useful. I would expect 1=2 to be incredibly useful. Whatever the random statement, you can prove it by contradiction.

So the plan would be to make the AI produce simple mathematical structures, about which lots of short and useful theorems exist.

Humans would also add some desiderata.

This is very much something you can and will fiddle with. If the AI is an LLM trained only on self play, then even if it's very smart, the risk of human hacking is minimal. You can try one desideratum, and if it doesn't work, try another. The AI acting as a kind of search engine over the space of interesting mathematics.

Risks.

This is the sort of thing that can give you code for an unfriendly AI. AIXI-tl is simple in this sense. And if more efficient but also simple AI algorithms exist, this will likely find them. So you need to give the output of this program to humans that won't rush to implement the first piece of code they see.

Current LLM based AI systems are getting pretty good at maths by writing formal proofs in Lean or similar. https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/

So, how can we use these models to help align AI?

The Simple Alignment Solution Assumption states that many problems in alignment, for example corrigibility or value learning, have simple solutions. I mean "simple" in the sense that logical induction is a simple solution. That expected utility maximization is. That causal models are. This is the sort of thing we are searching for.

Under the Simple Alignment Solution Assumption, the solution is.

https://www.lesswrong.com/posts/hzuSDMx7pd2uxFc5w/causal-diagrams-and-causal-models

So, how can LLM's come up with new things in this class?

Training via self play?

Lets say a theorem is

Usefulif it is often used in the process of proving/disproving random ZFC statements. Such a criterion can be measured by generating random statements to prove/disprove, putting the theorem to be measured in the assumptions, and seeing if it is used.I would expect that associativity is pretty useful. I would expect 1=2 to be incredibly useful. Whatever the random statement, you can prove it by contradiction.

So the plan would be to make the AI produce simple mathematical structures, about which lots of short and useful theorems exist.

Humans would also add some desiderata.

This is very much something you can and will fiddle with. If the AI is an LLM trained only on self play, then even if it's very smart, the risk of human hacking is minimal. You can try one desideratum, and if it doesn't work, try another. The AI acting as a kind of search engine over the space of interesting mathematics.

Risks.

This is the sort of thing that can give you code for an unfriendly AI. AIXI-tl is simple in this sense. And if more efficient but also simple AI algorithms exist, this will likely find them. So you need to give the output of this program to humans that won't rush to implement the first piece of code they see.