Leela Zero uses MCTS, it doesnt play superhuman in one forward pass (like gpt-4 can do in some subdomains) (i think, didnt find any evaluations of Leela Zero at 1 forward pass), and i'd guess that the network itself doesnt contain any more generalized game playing circuitry than an llm, it just has good intuitions for Go.
Subjectively there is clear improvement between 7b vs. 70b vs. GPT-4, each step 1.5-2 OOMs of training compute.
1.5 to 2 OOMs? 7b to 70b is 1 OOM of compute, adding in chinchilla efficiency would make it like 1.5 OOMs of effective... (read more)
If automating alignment research worked, but took 1000+ tokens per researcher-second, it would be much more difficult to develop, because you'd need to run the system for 50k-1M tokens between each "reward signal or such". Once it's 10 or less tokens per researcher second, it'll be easy to develop and improve quickly.
I'm not against evaluating models in ways where they're worse than rocks, I just think you shouldn't expect anyone else to care about your worse-than-rock numbers without very extensive justification (including adversarial stuff)
Externalized reasoning models suffer from the "legibility penalty" - the fact that many decisions are easier to make than to justify or explain. I think this is a significant barrier for authentic train of thought competitiveness, although not for particularly legible domains, such as math proofs and programming (Illegible knowledge goes into math proofs, but you trust the result regardless so it's fine).
Another problem is that standard training procedures only incentivize the model to use reasoning steps produced by a single human. This means, for instanc... (read more)