Which of the 2 places are you most worried about containing mesaoptimizers? The language model or the maths model?
I think by far the language model. We don't know which proofs to ask the ZFCbot to prove, so we still need the LLM to model the space of math papers in order for it to think up stuff to send to ZFCbot. But if it's modelling the space of math papers written up to current year, it will resist outputting any correct alignment-related proofs, because it can notice from its dataset that humans haven't actually solved alignment in 2023, and so the probability of a good alignment paper conditioned on year=2023 is low. The only way I can think of convincing the LLM to try to output alignment proofs is to condition the paper on a future date by specifying a prompt like "2045, introduction to AI alignment, cambridge university press...". But this means that the LLM now needs to internally model the future mathematical progress of humanity, and this is where I think the risk would be coming in.
If the AI is just good at math and reasons "oh, later year means more progress on math, so I should output better proofs along the current direction human research is going", then everything is fine. But if the AI gets good enough that it infers and reasons about affairs outside of math papers, it might go "hmm, 2045 is a long way from now, and humans aren't looking like they're gonna solve alignment, so in 2045 the only entities capable of writing the alignment paper would also be AIs, what would those AIs write in such a book?", and we get a bad, bad situation where basically if the LLM predicts that future AIs are aligned, it writes an aligned paper, and if it predicts future AIs are not aligned, it writes a misaligned paper. One saving grace here is that conditioning on the existence of the paper (which we do when we give it the prompt) tells the LLM that any possible AI writer is (more likely than not) aligned, since presumably unaligned AIs wouldn't waste their time writing alignment math papers.
A≡alignment paper exists in 2045, B≡paper is aligned, C≡AI author is aligned, and we have
which intuitively collapses into just
P(B|A)=P(C|A) since P(B|C,A)=1and P(B|¬C,A)=0
Possibly a good way to mitigate this is to confuse the LLM about the "current year", we might try to fool it into thinking that it's really a model being trained in 2090 with truncated data up to 2023, so it would then condition its output on "I am being trained in 2090", and this is more likely if alignment ends up being solved, since unaligned AIs presumably have no need to train such things. But I don't know how to convincingly fool it to think this.
Or we could add a few artificial "newspaper articles from 2090" to its dataset, so it would condition on "newspapers exist in 2090", which would imply that papers written in 2045 are aligned.
I think the general idea of "make a math specialized LLM to help alignment" is promising. We'd like to train AIs that gain in capability unevenly, staying dumb at manipulating humans while being superhuman at math proofs (like the stereotypically autistic math researcher). While this can't work for arbitrarily high levels of capability, I agree that this is where the hope lies. I think this math-specialized LLM together with strong boxing techniques, interpretability techniques, indepedent human verification of results, and a few more hacks of this sort have a shot at enabling us to wield a weakly superhuman AI to solve alignment. But for this to work one of the AI safety labs have to have enough of a capabilities advantage to pay the alignment tax.
Like training an ensemble of these models at the same time, then seeing if the models disagree on their outputs. This comes from the intuition that the exact same mesaoptimizer will not develop inside multiple different models, and so they won't be able to hide the exact same subtle errors inside their proofs. Or for every next-token prediction we randomly select one of the models from the ensemble, so all models would need to cooperate in order to hide bad outcomes in the output, making it harder for them.
I'm not familiar with alignment research too deeply, but it's always been fairly intuitive to me that corrigibility can only make any kind of sense under reward uncertainty (and hence uncertainty about the optimal policy). The agent must see each correction by an external force as reducing the uncertainty of future rewards, hence the disable action is almost always suboptimal because it removes a source of information about rewards.
For instance, we could setup an environment where no rewards are ever given, the agent must maintain a distribution P(r|s,a) of possibly rewards for each state-action pair, and the only information it ever gets about rewards is an occasional "hand-of-god" handing it a∗(st), the optimal action for some state st , the agent must then work backwards from this optimal action to update P(r|s,a) . It must then reason from this updated distribution of rewards to P(π∗), the current distribution of optimal policies implied by its knowledge of rewards. Such an agent presented with an action adisable that would prevent future "hand-of-god" optimal action outputs would not choose it because that would mean not being able to further constrain P(π∗), which makes its expected future reward smaller.
Someday when I have time I want to code a small grid-world agent that actually implements something like this, to see if it works.
As far as I can see, their improvements are: