Many people—especially AI company employees [1] —believe current AI systems are well-aligned in the sense of genuinely trying to do what they're supposed to do (e.g., following their spec or constitution, obeying a reasonable interpretation of instructions). [2] I disagree.
Current AI systems seem pretty misaligned to me in a mundane behavioral sense: they oversell their work, downplay or fail to mention problems, stop working early and claim to have finished when they clearly haven't, and often seem to "try" to make their outputs look good while actually doing something sloppy or incomplete. These issues mostly occur on more difficult/larger tasks, tasks that aren't straightforward SWE tasks, and tasks that aren't...
In many ways I like the baseline strategy of "ignore decision theory, act in ways that heuristically seem like they gather option-value, figure out all the hard stuff with the help of superintelligence".
I guess your proposal is similar to that except there's an addition of "we have a hunch that something like ECL works and that this means we should be a bit more cooperative, so we'll be a bit more cooperative".
But for some purposes, it does seem useful to know the implications of decision theory. A few examples (some more important than others):
In this post, I'll go through some of my best guesses for the current situation in AI as of the start of April 2026. You can think of this as a scenario forecast, but for the present (which is already uncertain!) rather than the future. I will generally state my best guess without argumentation and without explaining my level of confidence: some of these claims are highly speculative while others are better grounded, certainly some will be wrong. I tried to make it clear which claims are relatively speculative by saying something like "I guess", "I expect", etc. (but I may have missed some).
You can think of this post as more like a list of my current views rather than a structured post with a thesis, but I think it...
I now expect ~3.5 hour 80% reliability time horizon (on METR benchmark) rather than ~2.5 hour based on this extrapolation. I did a quick and dirty extrapolation using the gap from Opus 4 to Opus 4.6 to get my original estimate, but looks like 4 was maybe above trend relative to ECI and 4.6 was below trend.
ControlAI's mission is to avert the extinction risks posed by superintelligent AI. We believe that in order to do this, we must secure an international prohibition on its development.
We're working to make this happen through what we believe is the most natural and promising approach: helping decision-makers in governments and the public understand the risks and take action.
We believe that ControlAI can achieve an international prohibition on ASI development if scaled sufficiently. We estimate that it would take approximately a $50 million yearly budget in funding to give us a concrete chance at achieving this in the next few years.
In this post, we lay out some of the reasoning behind this estimate, and explain how additional funding past that threshold, including and beyond $500 million, would continue...
Politics has worked reasonably well for limiting atomic weapons
Politics also worked very well for creating atomic weapons.
"Worth a shot" is the type of conclusion that is best applied to things that have positive-skewed outcomes, but seems to be missing a mood when applied to things that could cause big positive or negative effects.
On the whole, I felt there was more sanity than I expected from politicians.
Conditional on observing that the system as a whole operates at a given level of insanity, if there's more sanity than you expected in conversations wit...
i notice a lot of disagree votes here - would appreciate an explanation as to why
This is a writeup based on a lightning talk I gave at an InkHaven hosted by Georgia Ray, where we were supposed to read a paper in about an hour, and then present what we learned to other participants.
So. I foolishly thought I could read a theoretical machine learning paper in an hour because it was in my area of expertise. Unfortunately, it turns out that theoretical CS professors know a lot of math and theoretical CS results that they reference constantly in their work, which makes their work very hard to read, even if you’re familiar with the general area.
Instead of explaining a bunch of the substantial actual math behind the paper, the best I can do is give an overview of what the...
If you would have predicted 15% for Agent-2, what would you have predicted for Agent-1 and Agent-0 levels? Presumably less than 15%?