Sorted by New

Wiki Contributions


Which of the 2 places are you most worried about containing mesaoptimizers? The language model or the maths model?

I think by far the language model. We don't know which proofs to ask the ZFCbot to prove, so we still need the LLM to model the space of math papers in order for it to think up stuff to send to ZFCbot. But if it's modelling the space of math papers written up to current year, it will resist outputting any correct alignment-related proofs, because it can notice from its dataset that humans haven't actually solved alignment in 2023, and so the probability of a good alignment paper conditioned on year=2023 is low. The only way I can think of convincing the LLM to try to output alignment proofs is to condition the paper on a future date by specifying a prompt like "2045, introduction to AI alignment, cambridge university press...". But this means that the LLM now needs to internally model the future mathematical progress of humanity, and this is where I think the risk would be coming in. 

If the AI is just good at math and reasons "oh, later year means more progress on math, so I should output better proofs along the current direction human research is going", then everything is fine. But if the AI gets good enough that it infers and reasons about affairs outside of math papers, it might go "hmm, 2045 is a long way from now[1][2], and humans aren't looking like they're gonna solve alignment, so in 2045 the only entities capable of writing the alignment paper would also be AIs, what would those AIs write in such a book?", and we get a bad, bad situation where basically if the LLM predicts that future AIs are aligned, it writes an aligned paper, and if it predicts future AIs are not aligned, it writes a misaligned paper. One saving grace here is that conditioning on the existence of the paper (which we do when we give it the prompt) tells the LLM that any possible AI writer is (more likely than not) aligned, since presumably unaligned AIs wouldn't waste their time writing alignment math papers.


, and we have

which intuitively collapses into just 

 since and 


  1. ^

    Possibly a good way to mitigate this is to confuse the LLM about the "current year", we might try to fool it into thinking that it's really a model being trained in 2090 with truncated data up to 2023, so it would then condition its output on "I am being trained in 2090", and this is more likely if alignment ends up being solved, since unaligned AIs presumably have no need to train such things. But I don't know how to convincingly fool it to think this.

  2. ^

    Or we could add a few artificial "newspaper articles from 2090" to its dataset, so it would condition on "newspapers exist in 2090", which would imply that papers written in 2045 are aligned.

I think the general idea of "make a math specialized LLM to help alignment" is promising. We'd like to train AIs that gain in capability unevenly, staying dumb at manipulating humans while being superhuman at math proofs (like the stereotypically autistic math researcher). While this can't work for arbitrarily high levels of capability, I agree that this is where the hope lies. I think this math-specialized LLM together with strong boxing techniques, interpretability techniques, indepedent human verification of results, and a few more hacks[1] of this sort have a shot at enabling us to wield a weakly superhuman AI to solve alignment. But for this to work one of the AI safety labs have to have enough of a capabilities advantage to pay the alignment tax.

  1. ^

    Like training an ensemble of these models at the same time, then seeing if the models disagree on their outputs. This comes from the intuition that the exact same mesaoptimizer will not develop inside multiple different models, and so they won't be able to hide the exact same subtle errors inside their proofs. 

    Or for every next-token prediction we randomly select one of the models from the ensemble, so all models would need to cooperate in order to hide bad outcomes in the output, making it harder for them.

I'm not familiar with alignment research too deeply, but it's always been fairly intuitive to me that corrigibility can only make any kind of sense under reward uncertainty (and hence uncertainty about the optimal policy). The agent must see each correction by an external force as reducing the uncertainty of future rewards, hence the disable action is almost always suboptimal because it removes a source of information about rewards. 

For instance, we could setup an environment where no rewards are ever given, the agent must maintain a distribution  of possibly rewards for each state-action pair, and the only information it ever gets about rewards is an occasional "hand-of-god" handing it , the optimal action for some state  , the agent must then work backwards from this optimal action to update  . It must then reason from this updated distribution of rewards to , the current distribution of optimal policies implied by its knowledge of rewards. Such an agent presented with an action  that would prevent future "hand-of-god" optimal action outputs would not choose it because that would mean not being able to further constrain , which makes its expected future reward smaller.

Someday when I have time I want to code a small grid-world agent that actually implements something like this, to see if it works.   

  1. Different networks for each game
  2. They train for 220k steps for each agent and mention that 100k steps takes 7 hours on 4 GPUs (no mention of which gpus, but maybe RTX3090 would be a good guess?)
  3. They don't mention it
  4. They are explicitely motivated by robotics control, so yes, they expect this to help in that direction. I think the main problem is that robotics requires more complicated reward-shaping to obtain desired behaviour. In Atari the reward is already computed for you and you just need to maximise it, when designing a robot to put dishes in a dishwasher the rewards need to be crafted by humans. Going from "Desired Behavior -> Rewards for RL" is harder than "Rewards for RL -> Desired Behavior"
  5. I am somewhat surprised by the simplicity of the 3 methods described in the paper, I update towards "dumb and easy improvements over current methods can lead to drastic changes in performance".

As far as I can see, their improvements are:

  1. Learn the environment dynamics by self-supervision instead of relying only on reward signals. Meaning that they don't learn the dynamics end-to-end like in MuZero. For them the loss function for the enviroment dynamics is completely separate from the RL loss function.  (I was wrong, they in fact add a similarity loss to the loss function of MuZero that provides extra supervision for learning the dynamics, but gradients from rewards still reach the dynamics and representation networks.)
  2. Instead of having the dynamics model predict future rewards, have it predict a time-window averaged reward (what they call "value prefix"). This means that the model doesn't need to get the timing of the reward *exactly* right to get a good loss, and so lets the model have a conception of "a reward is coming sometime soon, but I don't quite know exactly when"
  3. As training progresses the old trajectories sampled with an earlier policy are no longer very useful to the current model, so as each training run gets older, they replace the training run in memory with a model-predicted continuation. I guess it's like replacing your memories of a 10-year old with imagined "what would I have done" sequences, and the older the memories, the more of them you replace with your imagined decisions.