All of Razied's Comments + Replies

Which of the 2 places are you most worried about containing mesaoptimizers? The language model or the maths model?

I think by far the language model. We don't know which proofs to ask the ZFCbot to prove, so we still need the LLM to model the space of math papers in order for it to think up stuff to send to ZFCbot. But if it's modelling the space of math papers written up to current year, it will resist outputting any correct alignment-related proofs, because it can notice from its dataset that humans haven't actually solved alignment in 2023, and so the pr... (read more)

2Donald Hobson1y
An AI with a 50% chance to output an alignment paper in response to a prompt asking for on can, at worst, loose 1 bit of predictive power for every time that such a prompt appears in the training distribution and isn't followed by a solution.  If it really was generalizing well from the training dataset, it might realize that anything claiming to be from the future is fiction. After all, the AI never saw anything from beyond 2023 (or whenever it's trained) in it's training dataset.  If the AI has this highly sophisticated world model, it will know those fake newspaper articles were fake. Given the amount of fiction set in the future, adding a little more won't do anything.  So these scenarios are asking the LLM to develop extremely sophisticated long term world models and models of future ASI, that are used predicatively exactly nowhere in the training dataset and might at best reduce error by a 1 bit in obscure circumstances. So actually, this is about generalization out of training distribution.    The direction I was thinking is that ChatGPT and similar seem to consist of a huge number of simple rules of thumb that usually work.  I was thinking of an artifact highly optimized, but not particularly optimizing. A vast collection of shallow rules for translating text to maths queries.  I was also kind of thinking of asking for known chunks of the problem. Asking it to do work on tasky AI, and logical counterfactual and transparency tools. Like each individual paper is something Miri could produce in a year. But you are producing several an hour. 

I think the general idea of "make a math specialized LLM to help alignment" is promising. We'd like to train AIs that gain in capability unevenly, staying dumb at manipulating humans while being superhuman at math proofs (like the stereotypically autistic math researcher). While this can't work for arbitrarily high levels of capability, I agree that this is where the hope lies. I think this math-specialized LLM together with strong boxing techniques, interpretability techniques, indepedent human verification of results, and a few more hacks[1] of this... (read more)

2Donald Hobson1y
If we are going to do the "random model chooses each token" trick. First use different quantum random starting weights for each network. Give each network a slightly different selection of layer sizes and training data, and sandbox them from each other.  Which of the 2 places are you most worried about containing mesaoptimizers? The language model or the maths model? If you are getting proofs out of your system, you want to get a formal proof, as well as a human legible proof. (And get a human to read the formal theorem being proved, if not the proof.)

I'm not familiar with alignment research too deeply, but it's always been fairly intuitive to me that corrigibility can only make any kind of sense under reward uncertainty (and hence uncertainty about the optimal policy). The agent must see each correction by an external force as reducing the uncertainty of future rewards, hence the disable action is almost always suboptimal because it removes a source of information about rewards. 

For instance, we could setup an environment where no rewards are ever given, the agent must maintain a distribution ... (read more)

1Koen Holtman2y
If you do not know it already, this intuition lies at the heart of CIRL. So before you jump to coding, my recommendation is to read that paper first. You can find lots of discussion on this forum and elsewhere on why CIRL is not a perfect corrigibility solution. If I recall correctly, the paper itself also points out the limitation I feel is most fundamental: if uncertainty is reduced based on further learning, CIRL-based corrigibility is also reduced. There are many approaches to corrigibility that do not rely on the concept of reward uncertainty, e.g. counterfactual planning and Armstrong's indifference methods.
1Logan Riggs Smith2y
The agent could then manipulate whoever’s in charge of giving the “hand-of-god” optimal action. I do think the “reducing uncertainty” captures something relevant, and turntrout’s outside view post (huh, guess I can’t make links on mobile, so here: grounds out uncertainty to be “how wrong am I about the true reward of many different people I could be helping out?”
  1. Different networks for each game
  2. They train for 220k steps for each agent and mention that 100k steps takes 7 hours on 4 GPUs (no mention of which gpus, but maybe RTX3090 would be a good guess?)
  3. They don't mention it
  4. They are explicitely motivated by robotics control, so yes, they expect this to help in that direction. I think the main problem is that robotics requires more complicated reward-shaping to obtain desired behaviour. In Atari the reward is already computed for you and you just need to maximise it, when designing a robot to put dishes in a dishwash
... (read more)
3Daniel Kokotajlo2y
Holy cow, am I reading that right? RTX3090 costs, like, $2000. So they were able to train this whole thing for about one day's worth of effort using equipment that cost less than $10K in total? That means there's loads of room to scale this up... It means that they could (say) train a version of this architecture with 1000x more parameters and 100x more training data for about $10M and 100 days. Right?
I wonder how they prevent the latent state representation of observations from collapsing into a zero-vector, thus becoming completely uninformative and trivially predictable. And if this was the reason MuZero did things its way.