Tao Lin

Wiki Contributions

Comments

I'm not against evaluating models in ways where they're worse than rocks, I just think you shouldn't expect anyone else to care about your worse-than-rock numbers without very extensive justification (including adversarial stuff)

Externalized reasoning models suffer from the "legibility penalty" - the fact that many decisions are easier to make than to justify or explain. I think this is a significant barrier for authentic train of thought competitiveness, although not for particularly legible domains, such as math proofs and programming (Illegible knowledge goes into math proofs, but you trust the result regardless so it's fine).

Another problem is that standard training procedures only incentivize the model to use reasoning steps produced by a single human. This means, for instance, if you ask a question involving two very different domains of knowledge, a good language model wouldn’t expose it’s knowledge about both of them, as that’s OOD for its training dataset. This may appear in an obvious fashion, as if multiple humans collaborated on the train of thought, or might appear in a way that’s harder to interpret. If you just want to expose this knowledge, you could train on amplified human reasoning (ie from human teams) though.

Also, if you ever train the model on conclusion correctness, you incentivize semantic drift between its reasoning and human language - the model would prefer to pack in more information per token than humans, and might want to express not-normally-said-by-human concepts (one type is fuzzy correlations, which models know a lot of). Even if you penalize KL divergence between human language and the reasoning, this doesn't necessarily incentivize authentic human-like reasoning, just its appearance.

In general I'm unsure whether authentic train of thought is better than just having the model imitate specific concrete humans in ordinary language modelling - if you start a text by a known smart, truthful person, you get out an honest prediction over what that person believes.