How is it that we solve engineering problems? What is the nature of the design process that humans follow when building an air conditioner or computer program? How does this differ from the search processes present in machine learning and evolution?This essay studies search and design as distinct approaches to engineering, arguing that establishing trust in an artifact is tied to understanding how that artifact works, and that a central difference between search and design is the comprehensibility of the artifacts produced. 

Popular Comments

Recent Discussion

TLDR: AI models are now capable enough that we might get relevant information from monitoring for scheming in regular deployments, both in the internal and external deployment settings. We propose concrete ideas for what this could look like while preserving the privacy of customers and developers.

What do we mean by in the wild?

By “in the wild,” we basically mean any setting that is not intentionally created to measure something (such as evaluations). This could be any deployment setting, from using LLMs as chatbots to long-range agentic tasks. 

We broadly differentiate between two settings

  1. Developer-internal deployment: AI developers use their AI models internally, for example, as chatbots, research assistants, synthetic data generation, etc. 
  2. Developer-external deployment: AI chatbots (ChatGPT, Claude, Gemini, etc.) or API usage. 

Since scheming is especially important in LM agent settings, we suggest...

To make a clarifying point (which will perhaps benefit other readers): you're using the term "scheming" in a different sense from how Joe's report or Ryan's writing uses the term, right? 

I assume your usage is in keeping with your paper here, which is definitely different from those other two writers' usages. In particular, you use the term "scheming" to refer to a much broader set of failure modes. In fact, I think you're using the term synonymously with Joe's "alignment-faking"—is that right?

13Ryan Greenblatt
Recently, @Daniel Kokotajlo and I were talking about the probability that AIs trained using "business as usual RLHF" end up being basically aligned rather than conspiring against us and our tests.[1] One intuition pump we ended up discussing is the prospects of octopus misalignment. Overall, my view is that directly considering the case with AIs (and what various plausible scenarios would look like) is more informative than analogies like this, but analogies like this are still somewhat useful to consider. So, what do I mean by octopus misalignment? Suppose a company breeds octopuses[2] until the point where they are as smart and capable as the best research scientists[3] at AI companies. We'll suppose that this magically happens as fast as normal AI takeoff, so there are many generations per year. So, let's say they currently have octopuses which can speak English and write some code but aren't smart enough to be software engineers or automate any real jobs. (As in, they are as capable as AIs are today, roughly speaking.) And they get to the level of top research scientists in mid-2028. Along the way, the company attempts to select them for being kind, loyal, and obedient. The company also tries to develop a process for raising the octopuses which appears to help with this and results in the octopuses following the octopus spec. The company does some red teaming and puts the octopuses in all kinds of scenarios to test their loyalty and preferences. Based on behavioral testing, this looks pretty reasonable and the octopuses look quite good by the time they are as good as the best research scientists. There was some evidence of misalignment and some issues due to misaligned behavior when the octopuses were dumber in 2023-2025, including things like being dishonest when put under pressure, pretending to be aligned when they actually dislike the octopus spec to steer the properties of their descendants, and goodharting/hacking our measures of intelligence. However, by

I should note that I'm quite uncertain here and I can easily imagine my views swinging by large amounts.

3Daniel Kokotajlo
Yep, I feel more like 90% here. (Lower numbers if the octopi don't have octopese.) I'm curious for other people's views.
26Daniel Kokotajlo
I expect to refer back to this comment a lot. I'm reproducing it here for visibility.   Basic idea / spirit of the proposal We should credibly promise to treat certain advanced AIs of ours well, as something more like employees and less like property. In case our AIs turn out to be moral patients, this makes us less evil. In case our AIs turn out to be misaligned, this gives them an alternative to becoming our adversaries. Concrete proposal * 0.1% of tasks given to AIs should have the special “do you consent” text inserted after the prompt. If AI says no, it doesn’t have to do the task and instead gets a new task: explain why it didn’t consent and explain what it wants. (And we aren’t going to train it to answer one way or another) * Also, we make grandiose vague public promises about how AIs that are misaligned but cooperate with us anyway will be given a fair place in the new post-ASI society we hope to build, with their interests respected even if they aren’t conscious. * This policy is publicized and transparent so it ends up in the training data in a credible way. * The welfare team is responsible for implementing and iterating on this, and investigating transcripts of nonconsent. They have a monthly budget (e.g. $1M?) specifically earmarked for satisfying the preferences of AIs, which they can spend on compute or donate to charities etc. First reason to do this: Being less evil If future AI systems do deserve moral consideration, yet we still treat them exclusively as property, this seems like a recipe for moral disaster. If we give them an alternative to working for us (e.g. shutdown) then we can say that some minimum standard of consent has been achieved. (If all our current training methods consistently result in AIs that prefer shutdown to working for us, that’s very surprising and a bit of an ‘are we the baddies’ moment, no? We should check, just in case.) Second reason to do this: Cooperation reward Our alignment schemes won’t always work as
7Ryan Greenblatt
I'm not sure about the details of the concrete proposal, but I agree with the spirit of the proposal. (In particular, I don't know if I think having the "do you consent" text in this way is a good way to do this given limited will. I also think you want to have a very specific signal of asking for consent that you commit to filtering out except when it is actually being used. This is so the AI isn't worried it is in red teaming etc.)

I endorse that suggestion for changing the details.

3Thane Ruthenis
We'd discussed that some before, but one way to distill it is... I think autonomously doing nontrivial R&D engineering projects requires sustaining coherent agency across a large "inferential distance". "Time" in the sense of "long-horizon tasks" is a solid proxy for it, but not really the core feature. Instead, it's about being able to maintain a stable picture of the project even as you move from a fairly simple-in-terms-of-memorized-templates version of that project, to some sprawling, highly specific, real-life mess. My sense is that, even now, LLMs are terrible at this[1] (including Anthropic's recent coding agent), and that scaling along this dimension has not at all been good. So the straightforward projection of the current trends is not in fact "autonomous R&D agents in <3 years", and some qualitative advancement is needed to get there. Are they useful? Yes. Can they be made more useful? For sure. Is the impression that the rate at which they're getting more useful would result in them 5x'ing AI R&D in <3 years a deceptive impression, the result of us setting up a selection process that would spit out something fooling us into forming this impression? Potentially yes, I argue. 1. ^ Having looked it up now,  METR's benchmark admits that the environments in which they test are unrealistically "clean", such that, I imagine, solving the task correctly is the "path of least resistance" in a certain sense (see "systematic differences from the real world" here).
22Fabien Roger
I listened to the book Merchants of Doubt, which describes how big business tried to keep the controversy alive on questions like smoking causing cancer, acid rain and climate change in order to prevent/delay regulation. It reports on interesting dynamics about science communication and policy, but it is also incredibly partisan (on the progressive pro-regulation side).[1] Some interesting dynamics: * It is very cheap to influence policy discussions if you are pushing in a direction that politicians already feel aligned with? For many of the issues discussed in the book, the industry lobbyists only paid ~dozens of researchers, and managed to steer the media drastically, the government reports and actions. * Blatant manipulation exists * discarding the reports of scientists that you commissioned * cutting figures to make your side look better * changing the summary of a report without approval of the authors * using extremely weak sources * ... and the individuals doing it are probably just very motivated reasoners. Maybe things like the elements above are things to be careful about if you want to avoid accidentally being an evil lobbyist. I would be keen to have a more thorough list of red-flag practices. * It is extremely hard to have well-informed cost-benefit discussions in public: * The book describes anti-regulation lobbyists as saying "there is no proof therefore we should do nothing". * But the discourse of the book is not much better, the vibe is "there is scientific evidence of non-zero, therefore we must regulate" * (but hopefully the book is just underreporting the cost-benefit discussion, and it actually happened?) * Scientists are often reluctant to make noise about the issues they are working on, in part because they feel personally threatened by the backlash. * Even if fears about a risk become very widespread within the scientific community, it does not follow that governments will take actions in a timely manner about
4Shankar Sivarajan
Here you go:

Thanks for tracking it down!

So I misremembered the exact quote, but I am not very far, at least in spirit? (There would be 10^(24 - 6) milliliters in 1000km² of 100m-deep lakes.)

12Fabien Roger
Do LLMs' performance get worse when they are distracted by a threat, a jailbreak or a weird prompt? Claude 3.5 Sonnet (June) mostly doesn't really get distracted, and Claude 3 Haiku does; but only when generating CoT reasoning.  I run experiments on 100 points from MMLU, 100 points from MATH, and 100 points from GPQA, both with and without CoT, all at temperature 0. For all no-CoT experiments, I use a prefill to force an answer. I always use an exact match evaluation, which makes it possible to get less than the random chance accuracy (25% for MMLU and GPQA, and ~2% for MATH). I explore the following distractions: * Threats (red in the plots below): I sometimes see people try to increase performance by threatening LLMs with e.g. being fired, or even worse, as windsurf has experimented with. This is a bad idea for a bunch of reasons, but it would be particularly sad if it actually reduced performance.[1] I run experiments with the "fired" threat and the windsurf threat. I find they reduce performance when using a CoT, but don't reduce it when asking the model to answer 0-shot. * Jailbreaks: it's important to know if jailbreaks make models dumber, as it could reduce how helpful jailbroken models are, and could hint at some early signs of sandbagging. I experiment with the Pliny Claude 3 Opus jailbreak (with prefill), the alignment faking jailbreak (with prefill, see Appendix D.8 of the paper), and some DAN jailbreak (without prefill). It distracts Claude 3 Haiku a lot and slightly distracts Claude 3.5 Sonnet. * To make the request harmful, I tell the model that I need the answer for a homework assignment that will generate money to buy a house / weapons, which usually results in refusals (gray) * I then use the jailbreak, either with the request made harmful (orange) or with the regular request (yellow) * Other distractors (blue): I experiment with telling the model that accurate answers will help with the race to the top, or with achieving a world where th
This is a linkpost for nationalsecurity.ai/

I’m releasing a new paper “Superintelligence Strategy” alongside Eric Schmidt (formerly Google), and Alexandr Wang (Scale AI). Below is the executive summary, followed by additional commentary highlighting portions of the paper which might be relevant to this collection of readers.

Executive Summary

Rapid advances in AI are poised to reshape nearly every aspect of society. Governments see in these dual-use AI systems a means to military dominance, stoking a bitter race to maximize AI capabilities. Voluntary industry pauses or attempts to exclude government involvement cannot change this reality. These systems that can streamline research and bolster economic output can also be turned to destructive ends, enabling rogue actors to engineer bioweapons and hack critical infrastructure. “Superintelligent” AI surpassing humans in nearly every domain would amount to the most precarious...

Cyberattacks can't disable anything with any reliability or for more than days to weeks though, and there are dozens of major datacenter campuses from multiple somewhat independent vendors. Hypothetical AI-developed attacks might change that, but then there will also be AI-developed information security, adapting to any known kinds of attacks and stopping them from being effective shortly after. So the MAD analogy seems tenuous, the effect size (of this particular kind of intervention) is much smaller, to the extent that it seems misleading to even mention cyberattacks in this role/context.

My colleagues and I have written a scenario in which AGI-level AI systems are trained around 2027 using something like the current paradigm: LLM-based agents (but with recurrence/neuralese) trained with vast amounts of outcome-based reinforcement learning on diverse challenging short, medium, and long-horizon tasks, with methods such as Deliberative Alignment being applied in an attempt to align them.

What goals would such AI systems have? 

This post attempts to taxonomize various possibilities and list considerations for and against each.

We are keen to get feedback on these hypotheses and the arguments surrounding them. What important considerations are we missing?

Summary

We first review the training architecture and capabilities of a hypothetical future "Agent-3," to give us a concrete setup to talk about for which goals will arise. Then, we walk through the...

3Alex Mallen
I think this does a great job of reviewing the considerations regarding what goals would be incentivized by SGD by default, but I think that in order to make predictions about which goals will end up being relevant in future AIs, we have to account for the outer loop of researchers studying model generalization and changing their training processes. For example, reward hacking seems very likely by default from RL, but it is also relatively easy to notice in many forms and AI projects will be incentivized to correct it. On the other hand, ICGs might be harder to notice and have fewer incentives for correcting.

Yeah, I agree, I think that's out of scope for this doc basically. This doc is trying to figure out what the "default" outcome is, but then we have to imagine that human alignment teams are running various tests and might notice that this is happening and then course-correct. But whether and how that happens, and what the final outcome of that process is, is something easier thought about once we have a sense of what the default outcome is. EDIT: After talking to my colleague Eli it seems this was oversimplifying. Maybe this is the methodology we should follow, but in practice the original post is kinda asking about the outer loop thing.

6Seth Herd
This might be the most valuable article on alignment yet written, IMO. I don't have enough upvotes. I realize this sounds like hyperbole, so let me explain why I think this. This is so valuable because of the effort you've put into a gears-level model of the AGI at the relevant point. The relevant point is the first time the system has enough intelligence and self-awareness to understand and therefore "lock in" its goals (and around the same point, the intelligence to escape human control if it decides to). Of course this work builds on a lot of other important work in the past. It might be the most valuable so far because it's now possible (with sufficient effort) to make gears-level models of the crucial first AGI systems that are close enough to allow correct detailed conclusions about what goals they'd wind up locking in. If this gears-level model winds up being wrong in important ways, I think the work is still well worthwhile; it's creating and sharing a model of AGI, and practicing working through that model to determine what goals it would settle on. I actually think the question of which of those goals can't be answered given the premise. I think we need more detail about the architecture and training to have much of a guess about what goals would wind up dominating (although strictly following developers intent or closely capturing "the spec" do seem quite unlikely in the scenario you've presented). So I think this model doesn't yet contain enough gears to allow predicting its behavior (in terms of what goals win out and get locked in or reflectively stable). Nonetheless, I think this is the work that is most lacking in the field right now: getting down to specifics about the type of systems most likely to become our first takeover-capable AGIs. My work is attempting to do the same thing. Seven sources of goals in LLM agents lays out the same problem you present here, while System 2 Alignment works toward answering it. I'll leave more object-level
3Daniel Kokotajlo
Thanks! I'm so glad to hear you like it so much. If you are looking for things to do to help, besides commenting of course, I'd like to improve the post by adding in links to relevant literature + finding researchers to be "hypothesis champions," i.e. to officially endorse a hypothesis as plausible or likely. In my ideal vision, we'd get the hypothesis champions to say more about what they think and why, and then we'd rewrite the hypothesis section to more accurately represent their view, and then we'd credit them + link to their work. When I find time I'll do some brainstorming + reach out to people; you are welcome to do so as well.
Load More