As AI systems get more capable, it becomes increasingly uncompetitive and infeasible to avoid deferring to AIs on increasingly many decisions. Further, once systems are sufficiently capable, control becomes infeasible. [1] Thus, one of the main strategies for handling AI risk is fully (or almost fully) deferring to AIs on...
As many have observed, since reasoning models first came out, the amount of compute LLMs use to complete tasks has increased greatly. This trend is often called inference scaling and there is an open question of how much of recent AI progress is driven by inference scaling versus by other...
After five months of me (Buck) being slow at finishing up the editing on this, we’re finally putting out our inaugural Redwood Research podcast. I think it came out pretty well—we discussed a bunch of interesting and underdiscussed topics and I’m glad to have a public record of a bunch...
Prior work has examined 2-hop latent (by "latent" I mean: the model must answer immediately without any Chain-of-Thought) reasoning and found that LLM performance was limited aside from spurious successes (from memorization and shortcuts). An example 2-hop question is: "What element has atomic number (the age at which Tesla died)?"....
A key risk factor for scheming (and misalignment more generally) is opaque reasoning ability. One proxy for this is how good AIs are at solving math problems immediately without any chain-of-thought (CoT) (as in, in a single forward pass). I've measured this on a dataset of easy math problems and...
Prior results have shown that LLMs released before 2024 can't leverage 'filler tokens'—unrelated tokens prior to the model's final answer—to perform additional computation and improve performance. [1] I did an investigation on more recent models (e.g. Opus 4.5) and found that many recent LLMs improve substantially on math problems when...
According to the Sonnet 4.5 system card, Sonnet 4.5 is much more likely than Sonnet 4 to mention in its chain-of-thought that it thinks it is being evaluated; this seems to meaningfully cause it to appear to behave better in alignment evaluations. So, Sonnet 4.5’s behavioral improvements in these evaluations...