
Sometimes people make various suggestions that we should simply build "safe" artificial Superintelligence (ASI), rather than the presumably "unsafe" kind.[1]
There are various flavors of “safe” people suggest.
Now I could argue at lengths about why this is astronomically harder than people think it is, why their various proposals are almost universally unworkable, why even attempting this is insanely immoral[2], but that’s not the main point I want to make.
Instead, I want to make a simpler point:
Assume you have a research...
It turns out that Anthropic accidentally trained against the chain of thought of Claude Mythos Preview in around 8% of training episodes. This is at least the second independent incident in which Anthropic accidentally exposed their model's CoT to the oversight signal.
In more powerful systems, this kind of failure would jeopardize safely navigating the intelligence explosion. It's crucial to build good processes to ensure development is executed according to plan, especially as human oversight becomes spread thin over increasing amounts of potentially untrusted and sloppy AI labor.
This particular failure is also directly harmful, because it significantly reduces our confidence that the model's reasoning trace is monitorable (reflective of the AI's intent to misbehave).[1]
I'm grateful that Anthropic has transparently reported on this issue as much as they...
Many people—especially AI company employees [1] —believe current AI systems are well-aligned in the sense of genuinely trying to do what they're supposed to do (e.g., following their spec or constitution, obeying a reasonable interpretation of instructions). [2] I disagree.
Current AI systems seem pretty misaligned to me in a mundane behavioral sense: they oversell their work, downplay or fail to mention problems, stop working early and claim to have finished when they clearly haven't, and often seem to "try" to make their outputs look good while actually doing something sloppy or incomplete. These issues mostly occur on more difficult/larger tasks, tasks that aren't straightforward SWE tasks, and tasks that aren't...
I skimmed after the first section, but I think this is really underselling the case with “oh there’s probably no conscious desire to make their work appear good, just a bunch of subconscious drives.” Everything described here seems to be nicely explained by AIs trying to the best of their ability to do what would be rewarded in training, and seems consistent with either inner alignment to literally getting rewarded (which just isnt achievable at deployment with current capabilities) or inner alignment to something like producing convincing-looking work.
The above treatment of "CDT precommitment games" is problematic: the concept
Definition: A CDT decision problem is the following data. We have a set of variables
The parent relation must induce an acyclic directed graph. We also have a selected subset of decision variables
I agree it reduces the chance that the without-inoculation-prompt model is a reward seeker. But it seems likely to me it replaces that probability mass with much more saint-ish behavior (or kludge of motivations) than scheming? What would push towards scheming in these worlds where you have so few inductive biases towards scheming that you would have gotten a reward-seeker absent inoculation prompt?
The story I'm imagining is that the cognition your AI is doing in training with the inoculation prompt ended up being extremely close to scheming, so it isn'...
Note: you are ineligible to complete this challenge if you’ve studied Ancient or Modern Greek, or if you natively speak Modern Greek, or if for other reasons you know what mistakes I’m claiming Opus 4.6 makes. If you’re ineligible, please don’t help other people complete the challenge.
I have recently started using Claude Opus 4.6 to start studying Ancient Greek. Specifically, I initially used it to grade problem sets at the end of the textbook I’ve been using, but then I got worried about it being sycophantic towards my answers, so started having it just write out the answers itself.
I recently gave it this prompt, from the end of Chapter 3 of my textbook:
...Can you write out the answers to this Ancient Greek fill-in-the-blanks exercise so
I spent 2 hours on this the other day. Today Daniel told me the answer, and after ~15 more minutes of effort I still don't know how to construct a scaffold that would generalize to other problems like this. I might give more information after the answer is revealed here.
If there was a textbook on building safe ASI with instructions that are sufficiently straightforward to execute, people would tend to build safe ASI rather than an extinction/disempowerment ASI. Some AI safety efforts could be thought of as contributions to this hypothetical textbook, making it marginally more real.
The point of an ASI ban/pause is to create the time to reduce this gap, until it's sufficiently narrow that competent people can walk across without fallin... (read more)