Tao Lin - AI Alignment Forum

Leela Zero uses MCTS, it doesnt play superhuman in one forward pass (like gpt-4 can do in some subdomains) (i think, didnt find any evaluations of Leela Zero at 1 forward pass), and i'd guess that the network itself doesnt contain any more generalized game playing circuitry than an llm, it just has good intuitions for Go.

Nit:

Subjectively there is clear improvement between 7b vs. 70b vs. GPT-4, each step 1.5-2 OOMs of training compute.

1.5 to 2 OOMs? 7b to 70b is 1 OOM of compute, adding in chinchilla efficiency would make it like 1.5 OOMs of effective compute, not 2. And llama 70b to gpt-4 is 1 OOM effective compute according to openai naming - llama70b is about as good as gpt-3.5. And I'd personally guess gpt4 is 1.5 OOMs effective compute above llama70b, not 2.

Some thoughts on automating alignment research

Tao Lin10mo20

If automating alignment research worked, but took 1000+ tokens per researcher-second, it would be much more difficult to develop, because you'd need to run the system for 50k-1M tokens between each "reward signal or such". Once it's 10 or less tokens per researcher second, it'll be easy to develop and improve quickly.

Causal scrubbing: results on a paren balance checker

Tao Lin1y4-1

I'm not against evaluating models in ways where they're worse than rocks, I just think you shouldn't expect anyone else to care about your worse-than-rock numbers without very extensive justification (including adversarial stuff)

Externalized reasoning oversight: a research direction for language model alignment

Tao Lin2y82

Externalized reasoning models suffer from the "legibility penalty" - the fact that many decisions are easier to make than to justify or explain. I think this is a significant barrier for authentic train of thought competitiveness, although not for particularly legible domains, such as math proofs and programming (Illegible knowledge goes into math proofs, but you trust the result regardless so it's fine).

Another problem is that standard training procedures only incentivize the model to use reasoning steps produced by a single human. This means, for instance, if you ask a question involving two very different domains of knowledge, a good language model wouldn’t expose it’s knowledge about both of them, as that’s OOD for its training dataset. This may appear in an obvious fashion, as if multiple humans collaborated on the train of thought, or might appear in a way that’s harder to interpret. If you just want to expose this knowledge, you could train on amplified human reasoning (ie from human teams) though.

Also, if you ever train the model on conclusion correctness, you incentivize semantic drift between its reasoning and human language - the model would prefer to pack in more information per token than humans, and might want to express not-normally-said-by-human concepts (one type is fuzzy correlations, which models know a lot of). Even if you penalize KL divergence between human language and the reasoning, this doesn't necessarily incentivize authentic human-like reasoning, just its appearance.

In general I'm unsure whether authentic train of thought is better than just having the model imitate specific concrete humans in ordinary language modelling - if you start a text by a known smart, truthful person, you get out an honest prediction over what that person believes.

AI ALIGNMENT FORUM
AF

Posts

Wiki Contributions

Comments