I think that if I consistently applied that argument, I'd end up thinking AGI was probably 5+ years away right up until the day AGI was announced.
Point 1: That would not necessarily be incorrect; it's not necessary that you ought to be able to do better than that. Consider math discoveries, which seem to follow a memoryless exponential distribution. Any given time period has a constant probability of a conjecture being proven, so until you observe it happening, it's always a fixed number of years in the future. I think the position that this is how AGI development ought to be modeled is very much defensible.
Indeed: if you place AGI in the reference class of self-driving cars/reusable rockets, you implicitly assume that the remaining challenges are engineering challenges, and that the paradigm of LLMs as a whole is sufficient to reach it. Then time-to-AGI could indeed be estimated more or less accurately.
If we instead assume that some qualitative/theoretical/philosophical insight is still missing, then it becomes a scientific/mathematical challenge instead. The reference class of those is things like Millennium Problems, quantum computing (or, well, it was until recently?), fusion. And as above, the memes like "fusion is always X years away" is not necessarily evidence that there's something wrong with how we do world-modeling.
Point 2: DL is kind of different from other technologies. Here, we're working against a selection process that's eager to Goodhart to what we're requesting, and we're giving it an enormous amount of resources (compute) to spend on that. It might be successfully fooling us regarding how much progress is actually happening.
One connection that comes to mind is the "just add epicycles" tragedy:
Finally, I’m particularly struck by the superficial similarities between the way Ptolemy and Copernicus happened upon a general, overpowered tool for function approximation (Fourier analysis) that enabled them to misleadingly gerrymander false theories around the data, and the way modern ML has been criticized as an inscrutable heap of linear algebra and super-efficient GPUs. I haven’t explored whether these similarities go any deeper, but one implication seems to be that the power and versatility of deep learning might allow suboptimal architectures to perform deceivingly well (just like the power of epicycle-multiplication kept geocentrism alive) and hence distract us from uncovering the actual architectures underlying cognition and intelligence.
That analogy seems incredibly potent to me.
Another way to model time-to-AGI given the "deceitful" nature of DL might be to borrow some tools from sociology or economics, e. g. trying to time the market, predict when a social change will happen, or model what's happening in a hostile epistemic environment. No clear analogy immediately comes to mind, though.
On the contrary, it seems like this is the area where we should be able to best apply RL, since there is a clear reward signal.
Is there? It's one thing to verify whether a proof is correct; whether an expression (posed by a human!) is tautologous to a different expression (also posed by a human!). But what's the ground-truth signal for "the framework of Bayesian probability/category theory is genuinely practically useful"?
This is the reason I'm bearish on the reasoning models even for math. The realistic benefits of them seem to be:
Of those:
(The reasoning-model hype is so confusing for me. Superficially there's a ton of potential, but I don't think there's been any real indication they're up to the real challenges still ahead.)
The question is IMO not "has there been, across the world and throughout the years, a nonzero number of scientific insights generated by LLMs?" (obviously yes), but "is there any way to get an LLM to autonomously generate and recognize genuine scientific insights at least at the same rate as human scientists?". A stopped clock is right twice a day, a random-word generator can eventually produce an insight, and talking to a rubber duck can let you work through a problem. That doesn't mean the clock is useful for telling the time or that the RWG has the property of being insightful.
And my current impression is that no, there's no way to do that. If there were, we would've probably heard about massive shifts in how scientists (and entrepreneurs!) are doing their work.
This aligns with my experience. Yes, LLMs have sometimes directly outputted some insights useful for my research in agent foundations. But it's very rare, and only happens when I've already done 90% of the work setting up the problem. Mostly they're useful as rubber ducks or primers on existing knowledge; not idea-generators.
It seems like this is the sort of thing you could only ever learn by learning about the real world first
Yep. The idea is to try and get a system that develops all practically useful "theoretical" abstractions, including those we haven't discovered yet, without developing desires about the real world. So we train some component of it on the real-world data, then somehow filter out "real-world" stuff, leaving only a purified superhuman abstract reasoning engine.
One of the nice-to-have properties here would be is if we don't need to be able to interpret its world-model to filter out the concepts – if, in place of human understanding and judgement calls, we can blindly use some ground-truth-correct definition of what is and isn't a real-world concept.
Some more evidence that whatever the AI progress on benchmarks is measuring, it's likely not measuring what you think it's measuring:
AIME I 2025: A Cautionary Tale About Math Benchmarks and Data Contamination
AIME 2025 part I was conducted yesterday, and the scores of some language models are available here:
https://matharena.ai thanks to @mbalunovic, @ni_jovanovic et al.I have to say I was impressed, as I predicted the smaller distilled models would crash and burn, but they actually scored at a reasonable 25-50%.
That was surprising to me! Since these are new problems, not seen during training, right? I expected smaller models to barely score above 0%. It's really hard to believe that a 1.5B model can solve pre-math olympiad problems when it can't multiply 3-digit numbers. I was wrong, I guess.
I then used openai's Deep Research to see if similar problems to those in AIME 2025 exist on the internet. And guess what? An identical problem to Q1 of AIME 2025 exists on Quora:
https://quora.com/In-what-bases-b-does-b-7-divide-into-9b-7-without-any-remainder
I thought maybe it was just coincidence, and used Deep Research again on Problem 3. And guess what? A very similar question was on math.stackexchange:
https://math.stackexchange.com/questions/3548821/
Still skeptical, I used Deep Research on Problem 5, and a near identical problem appears again on math.stackexchange:
I haven't checked beyond that because the freaking p-value is too low already. Problems near identical to the test set can be found online.
So, what--if anything--does this imply for Math benchmarks? And what does it imply for all the sudden hill climbing due to RL?
I'm not certain, and there is a reasonable argument that even if something in the train-set contains near-identical but not exact copies of test data, it's still generalization. I am sympathetic to that. But, I also wouldn't rule out that GRPO is amazing at sharpening memories along with math skills.
At the very least, the above show that data decontamination is hard.
Never ever underestimate the amount of stuff you can find online. Practically everything exists online.
I think one of the other problems with benchmarks is that they necessarily select for formulaic/uninteresting problems that we fundamentally know how to solve. If a mathematician figured out something genuinely novel and important, it wouldn't go into a benchmark (even if it were initially intended for a benchmark), it'd go into a math research paper. Same for programmers figuring out some usefully novel architecture/algorithmic improvement. Graduate students don't have a bird's-eye-view on the entirety of human knowledge, so they have to actually do the work, but the LLM just modifies the near-perfect-fit answer from an obscure publication/math.stackexchange thread or something.
I expect the same is the case with programming benchmarks, science-quiz benchmarks, et cetera.
Now, this doesn't necessarily mean that the AI progress has been largely illusory and that we're way further from AGI than the AI hype men would have you believe (although I am very tempted to make this very pleasant claim, and I do place plenty of probability mass on it).
But if you're scoring AIs by the problems they succeed at, rather than the problems they fail at, you're likely massively overestimating their actual problem-solving capabilities.
Ah, I think I see my error.
Consider a CoT of the form "[problem statement]: ...........", which always outputs dots regardless of the problem statement. Can that still encode useful computations, with the information being passed through intermediate self-attention? Yes, but: this only increases parallel-computing capabilities, not serial-computing capabilities.
And in the framework I'd outlined, under the Paraphraser, for the purposes of scheme-y thinking, even a legible CoT would consist of such "filler" tokens.
So yep, you're right, this does force all scheming to be taking place within one forward pass.
I don't see why it would be harder to do.
Suppose you tasked a model with refactoring some database, and it secretly wants to sabotage its functionality. It outputs a CoT segment that goes, "I need to design unit tests for this module". Each word of this CoT is mapped to some internal representation, and the internal representation consists of "on-task" human-intended reasoning , and misaligned scheming (corresponding to, e. g., a tactical decision to fail to include unit tests for some subtle issue). "I" is mapped to , "need" is mapped to , etc. Each iterates on , each iterates on , we have a sequence of serial computations.
In other words: as the model outputs, "I need to design unit tests for this module", there's an unspoken subtext of "I need to sabotage the unit tests here", and the model packs nine words' worth of serial computations into both the "textual" and the "subtextual" aspects of it.
Now you use the Paraphraser on the string, generating alternatives like "the next step is to generate some unit tests" and "several unit tests for this module are needed here". The intent, if I understand you correctly, is that the semantic-equivalence-but-syntactic-difference would ensure that the th token of this phrase is mapped to , as if the paraphrasing didn't happen, but that the mapping of would be broken.
Except, why would that happen? What privileges -representations over -representations? Why would be preserved whereas is broken, instead of both of them remaining functional (and so preserving scheming), or both breaking (and so compromising the model's capabilities)?
In other words: We're basically just teaching the model to fail to distinguish between equivalent-to-a-human rephrasings. Why would this lead to the task-related representations becoming indifferent to the specifics of a phrasing whereas the scheming-related representations remain sensitive to it? Why wouldn't this just map the subtextual "I need to sabotage the unit tests here" to similarly semantically-equivalent-but-syntactically-different "let us compromise the relevant unit-test suite" or "I shall ensure the unit tests don't work properly"?
I expect it'd just lead, to wit, to the model's computations becoming indifferent to the details of a phrasing in full generality. And if the model were inclined to associate the general idea of generating unit tests with tactically screwing the task up, the model would come up with this idea regardless of the details of a phrasing.
It wouldn't interfere with its ability to engage in scheming-relevant serial computations, much like it doesn't interfere with its ability to engage in task-relevant serial computations, for the exact same reasons.
Some thoughts on protecting against LLM steganography.
tl;dr: I suspect the Paraphraser idea is near-completely ineffectual as you scale the capabilities, both for scheming and for general CoT interpretability.
In the reasoning-as-communication framework, the amount of information a fixed-length message/CoT can transfer is upper-bounded by information-theoretic constraints. That is, if we have a ten-bit message, it cannot communicate more than ten bits of information: it cannot always uniquely specify one internal state of an AI model from more than internal states.
However, the amount of information is not lower-bounded. Depending on the decoder, a ten-bit message can communicate between zero and ten bits of information. (A decoder that's just a constant function recovers zero bits, a decoder that only pays attention to the first bit recovers only one bit, et cetera.)
The idea of steganography is that, if we pick a CoT string, and run both an AI model's internal decoders and a human's language-comprehension decoders on it, the AI model would extract more bits of information from it. Paraphrasing, humans' decoders are "coarser": they have broader equivalence classes of messages that map to the same internal representation, compared to AI models.
For example, "a dog crossed the street" and "a dog traveled across the street" might map to the same internal representation for a human, but to different ones for an AI model. Similarly, things like "quantum butterflies juggled metaphysical spaghetti" and "ambitious lampshade collected vintage moonbeams" can map to the same "nonsense" internal representation in humans, but have some useful meaning for AI models. The bits a human fails to make use of correspond to "steganography" or "superhuman latent knowledge".
Proposals such as Shoggoth + Paraphraser + Face aim to destroy any information in CoTs that cannot be recovered by humans, by mapping each CoT string to the equivalence class of strings that all map to the same internal representation in humans.
However, there's an issue: what if the empty string already communicates plenty of steganographically encoded information that humans aren't privy to? That is: what if some knowledge is hard-coded into an AI model, such that it's elicited on any activation, and leads to it converging towards reaching the same hidden conclusions even after we destroy its ability to sneak-in additional query-relevant information?
For example:
Like, imagine that we have two different internal-representation sets: , containing problem-solving strategies for the overt task we're presenting the model, and , related to some misaligned scheming. We would like to constrain the model such that it cannot plan about ; such that its CoTs contain zero bits for narrowing down which state out of all it must pick.
This assumes something like the following model: A given CoT string communicates some bits of information. Of them, bits are human-legible. The model then follows the algorithm of: use task-relevant bits on picking a state out of , and use any leftover bits to pick a state out of . We then destroy the string's information content until (via the Paraphraser). Now the model doesn't have the ability to encode any non-human-legible bits, and if it uses any bits on picking a state out of , they would necessarily be human-legible, and we can spot the scheming by reading the CoT.
But that's not how this works. Instead, the model picks both simultaneously: a CoT is directly mapped to an two-tuple. Which means if the model can narrow down the state of to one out of , it can narrow down the state of to one out of in the same move. Even if we scramble the CoT until it communicates no more bits than a human can extract out of it, that doesn't differentially constrain the model's ability to think about scheming (= communicate scheming-relevant information to its future self).
It's a uniform communication constraint/computational-complexity penalty.
Generalizing beyond scheming: if the model were inclined to not put any of its thoughts into humanese (due to superhuman latent knowledge, say), the Paraphraser wouldn't address this issue either. If a model's plans contain some catastrophic side-effects (which aren't even malign, just something we didn't tell it not to cause), the Paraphraser wouldn't reveal that either.
@Daniel Kokotajlo, thoughts?
(This doesn't touch on the Face part of the proposal, i. e., the idea that we want to not pressure the model to conceal any scheme-y cognition in the hopes it'd naively reveal them. But if there's some pressure to conceal that, the Paraphraser would not help with it at all. It'd just teach the model to map humanese-equivalent strings to equivalent scheming-states, which isn't particularly useful.)
Arguably, if you also penalize the CoT length, this should cause a pressure towards throwing out any computations unrelated to the object-level problem the model's being trained on... But then the whole "paraphraser" bit doesn't actually do anything, you can achieve the same outcome by just directly penalizing the CoT length.
Which, inasmuch as DL training loops use some form of simplicity prior, should be happening by default anyway. I. e., this doesn't address the pressures that we expect to lead to steganography/hidden plotting despite the simplicity prior: which is to say, arguments that steganography/plotting are the actual optimal solutions to the training problem.
Consider the Theory of Everything example: those hidden computations are not actually "irrelevant computations" that we need to displace, they're directly useful. Similarly, any misaligned mesa-objectives the model forms would presumably be convergently useful for performance on the training objective: that would be why they formed to begin with. So the information-bottleneck trickery would do ~nothing against them.
Here's a potential interpretation of the market's apparent strange reaction to DeepSeek-R1 (shorting Nvidia).
I don't fully endorse this explanation, and the shorting may or may not have actually been due to Trump's tariffs + insider trading, rather than DeepSeek-R1. But I see a world in which reacting this way to R1 arguably makes sense, and I don't think it's an altogether implausible world.
If I recall correctly, the amount of money globally spent on inference dwarfs the amount of money spent on training. Most economic entities are not AGI labs training new models, after all. So the impact of DeepSeek-R1 on the pretraining scaling laws is irrelevant: sure, it did not show that you don't need bigger data centers to get better base models, but that's not where most of the money was anyway.
And my understanding is that, on the inference-time scaling paradigm, there isn't yet any proven method of transforming arbitrary quantities of compute into better performance:
Suppose that you don't expect the situation to improve: that the inference-time scaling paradigm would hit a ceiling pretty soon, or that it'd converge to distilling search into forward passes (such that the end users end up using very little compute on inference, like today), and that agents just aren't going to work out the way the AGI labs promise.
In such a world, a given task can either be completed automatically by an AI for some fixed quantity of compute X, or it cannot be completed by an AI at all. Pouring ten times more compute on it does nothing.
In such a world, if it were shown that the compute needs of a task can be met with ten times less compute than previously expected, this would decrease the expected demand for compute.
The fact that capable models can be run locally might increase the number of people willing to use them (e. g., those very concerned about data privacy), as might the ability to automatically complete 10x as many trivial tasks. But it's not obvious that this demand spike will be bigger than the simultaneous demand drop.
And I, at least, when researching ways to set up DeepSeek-R1 locally, found myself more drawn to the "wire a bunch of Macs together" option, compared to "wire a bunch of GPUs together" (due to the compactness). If many people are like this, it makes sense why Nvidia is down while Apple is (slightly) up. (Moreover, it's apparently possible to run the full 671b-parameter version locally, and at a decent speed, using a pure RAM+CPU setup; indeed, it appears cheaper than mucking about with GPUs/Macs, just $6,000.)
This world doesn't seem outright implausible to me. I'm bearish on agents and somewhat skeptical of inference-time scaling. And if inference-time scaling does deliver on its promises, it'll likely go the way of search-and-distill.
On balance, I don't actually expect the market to have any idea what's going on, so I don't know that its reasoning is this specific flavor of "well-informed but skeptical". And again, it's possible the drop was due to Trump, nothing to do with DeepSeek at all.
But as I'd said, this reaction to DeepSeek-R1 does not seem necessarily irrational/incoherent to me.
We'd discussed that some before, but one way to distill it is... I think autonomously doing nontrivial R&D engineering projects requires sustaining coherent agency across a large "inferential distance". "Time" in the sense of "long-horizon tasks" is a solid proxy for it, but not really the core feature. Instead, it's about being able to maintain a stable picture of the project even as you move from a fairly simple-in-terms-of-memorized-templates version of that project, to some sprawling, highly specific, real-life mess.
My sense is that, even now, LLMs are terrible at this[1] (including Anthropic's recent coding agent), and that scaling along this dimension has not at all been good. So the straightforward projection of the current trends is not in fact "autonomous R&D agents in <3 years", and some qualitative advancement is needed to get there.
Are they useful? Yes. Can they be made more useful? For sure. Is the impression that the rate at which they're getting more useful would result in them 5x'ing AI R&D in <3 years a deceptive impression, the result of us setting up a selection process that would spit out something fooling us into forming this impression? Potentially yes, I argue.
Having looked it up now, METR's benchmark admits that the environments in which they test are unrealistically "clean", such that, I imagine, solving the task correctly is the "path of least resistance" in a certain sense (see "systematic differences from the real world" here).