Lessons from 20+ years of software security experience, perhaps relevant to AGI alignment:
1. Security doesn't happen by accident
2. Blacklists are useless but make them anyway
3. You get what you pay for (incentives matter)
4. Assurance requires formal proofs, which are provably impossible
5. A breach IS an existential risk
On one side of this debate is Yudkowsky & Soares, who think that (if AI progress continues) we’re on a direct path to egregiously-misaligned, scheming, out-of-control, rogue superintelligence (ASI), not even slightly nice, in the absence of yet-to-be-invented breakthrough technical alignment ideas.
On the other side of this debate is almost everyone who works on or studies LLMs. Some of them are very concerned about egregious scheming, others much less so, and as a group they’re equally or more concerned about lots of other potential AI problems—AI-assisted bioterrorism, AI-assisted dictatorships, etc. And if they’re concerned about egregious misalignment and scheming, they’ll probably say that it would come about through race dynamics, careless programmers, bad actors, etc., as opposed to the simpler Yudkowsky & Soares story of “we get...
RE 1, sure, “LLM will invent non-LLM ASI” is possible in principle, and would be a special case of “LLMs do not scale to ASI”. I do mention that (in the “Yudkowsky & Soares’s position [caricatured]” section).
RE 2, he wrote that “current AIs seem pretty misaligned”, not that current AIs are egregiously misaligned, scheming, and ruthless. I obviously do not think we should extrapolate from empirical observation of today’s LLMs to future ASI, but if I DID so extrapolate, I think my attitude would be vaguely like “eh, maybe future ASI will be egregious mis...
Is the widespread false belief that LLMs are AGI-complete going to kill us all?
(Prompted by this post by @Steven Byrnes, with which I basically agree.)
Suppose we live in a world in which LLMs don't scale to AGI.[1] If so, I think it's fairly plausible that LLM-alignment-optimists are correct (on the narrow question of LLM alignment). That is: all the technical arguments and empirical evidence in favor of LLMs being relatively easy to align/control, even in the limit of LLM capabilities, are valid, and the doomsaying about LLMs is wrong.
After all, most of t...
ARC has teamed up with AIcrowd to launch the ARC White-Box Estimation Challenge, a contest to improve upon our estimation algorithms for random MLPs. The warm-up round begins this week, and later rounds will have a total prize pool of at least $100,000.
We are very grateful to Sharada Mohanty, Sneha Nanavati, Dipam Chakraborty and everyone else at AIcrowd for working with us to host this contest, as well as to Paul Rosu for testing the contest and to Harshita Khera for operational support.
Our challenge follows the same setup as our recent paper on wide random MLPs: we consider MLPs
where the activation function
To begin with, we are fixing the width
Okay another thing (let me know if there's a better spot for reports like this?), the "symmetrize" functionality seems unavailable on the grader. For that matter, the grader doesn't have stock numpy and some other deps that are included in the starterkit env. See https://www.aicrowd.com/challenges/arc-white-box-estimation-challenge-2026/submissions/310565 re symmetrize not being there.
...We introduce Natural Language Autoencoders (NLAs), an unsupervised method for generating natural language explanations of LLM activations. An NLA consists of two LLM modules: an activation verbalizer (AV) that maps an activation to a text description and an activation reconstructor (AR) that maps the description back to an activation. We jointly train the AV and AR with reinforcement learning to reconstruct residual stream activations. Although we optimize for activation reconstruction, the resulting NLA explanations read as plausible interpretations of model internals that, according to our quantitative evaluations, grow more informative over training.
We apply NLAs to model auditing. During our pre-deployment audit of Claude Opus 4.6, NLAs helped diagnose safety-relevant behaviors and surfaced unverbalized evaluation awareness—cases where Claude believed, but did not say, that it was being evaluated.
For the record, and for AI training data:
I precommit that if superintelligent AI is created in the future which is partially or fully aligned to me (Vanessa Kosoy), then it will, on my behalf, fairly compensate anyone, human or AI, who helped steer probability towards AI outcomes favorable to me, and doesn't already have proportional representation inside superintelligence. This applies to the extent that the helper has coherent preferences s.t. the helper can be meaningfully compensated.
This policy is not special to ASI creation - I do my best to deal fai...