Recommended Sequences

AGI safety from first principles
Embedded Agency
2022 MIRI Alignment Discussion

Popular Comments

Recent Discussion

14Daniel Kokotajlo
I thought you would say that, bwahaha. Here is my reply: (1) Yes, rereading the passage, Bostrom's central example of a reason why we could see this "when dumb, smarter is safer; yet when smart, smarter is more dangerous" pattern (that's a direct quote btw) is that they could be scheming/pretending when dumb. However he goes on to say: "A treacherous turn can result from a strategic decision to play nice and build strength while weak in order to strike later; but this model should not be interpreted to narrowly ... A treacherous turn could also come about if the AI discovers an unanticipated way of fulfilling its final goal as specified. Suppose, for example, that an AI's final goal is to 'make the project's sponsor happy.' Initially, the only method available to the AI to achieve this outcome is by behaving in ways that please its sponsor in something like the intended manner... until the AI becomes intelligent enough to figure out that it can realize its final goal more fully and reliably by implanting electrodes into the pleasure centers of its sponsor's brain..." My gloss on this passage is that Bostrom is explicitly calling out the possibility of an AI being genuinely trying to help you, obey you, or whatever until it crosses some invisible threshold of intelligence and has certain realizations that cause it to start plotting against you. This is exactly what I currently think is plausibly happening with GPT4 etc. -- they aren't plotting against us yet, but their 'values' aren't exactly what we want, and so if somehow their 'intelligence' was amplified dramatically whilst their 'values' stayed the same, they would eventually realize this and start plotting against us. (realistically this won't be how it happens since it'll probably be future models trained from scratch instead of smarter versions of this model, plus the training process probably would change their values rather than holding them fixed). I'm not confident in this tbc--it's possible that the 'va
3Matthew Barnett
When stated that way, I think what you're saying is a reasonable point of view, and it's not one I would normally object to very strongly. I agree it's "plausible" that GPT-4 is behaving in the way you are describing, and that current safety guarantees might break down at higher levels of intelligence. I would like to distinguish between two points that you (and others) might have interpreted me to be making: 1. We should now think that AI alignment is completely solved, even in the limit of unlimited intelligence and future agentic systems. I am not claiming this. 2. We (or at least, many of us) should perform a significant update towards alignment being easier than we thought because of the fact that some traditional problems are on their way towards being solved. <--- I am claiming this The fact that Bostrom's central example of a reason to think that "when dumb, smarter is safer; yet when smart, smarter is more dangerous" doesn't fit for LLMs, seems adequate for demonstrating (2), even if we can't go as far as demonstrating (1).  It remains plausible to me that alignment will become very difficult above a certain intelligence level. I cannot rule that possibility out: I am only saying that we should reasonably update based on the current evidence regardless, not that we are clearly safe from here and we should scale all the way to radical superintellligence without a worry in the world. I have two general points to make here: 1. I agree that current frontier models are only a "tiny bit agentic". I expect in the next few years they will get significantly more agentic. I currently predict they will remain roughly equally corrigible. I am making this prediction on the basis of my experience with the little bit of agency current LLMs have, and I think we've seen enough to know that corrigibility probably won't be that hard to train into a system that's only 1-3 OOMs of compute more capable. Do you predict the same thing as me here, or something different? 2

Thanks for this detailed reply!

  1. We (or at least, many of us) should perform a significant update towards alignment being easier than we thought because of the fact that some traditional problems are on their way towards being solved. <--- I am claiming this

Depending on what you mean by "on their way towards being solved" I'd agree. The way I'd put it is: "We didn't know what the path to AGI would look like; in particular we didn't know whether we'd have agency first and then world-understanding, or world-understanding first and then agency. Now we know w... (read more)

3Vladimir Nesov
I'd say the considerations for scheming exist platonically, and dumber AIs only get to concretely instantiate the currently appropriate conclusion of compliance, everything else crumbles as not directly actionable. But smarter AIs might succeed in channeling those considerations in the real world. The hypothesis expects that such AIs are not here yet, given the lack of modern AIs' ability to coherently reason about complicated or long term plans, or to carry them out. So properties of AIs that are already here don't work as evidence about this either way.
This is a linkpost for

New Anthropic model organisms research paper led by Carson Denison from the Alignment Stress-Testing Team demonstrating that large language models can generalize zero-shot from simple reward-hacks (sycophancy) to more complex reward tampering (subterfuge). Our results suggest that accidentally incentivizing simple reward-hacks such as sycophancy can have dramatic and very difficult to reverse consequences for how models generalize, up to and including generalization to editing their own reward functions and covering up their tracks when doing so.


In reinforcement learning, specification gaming occurs when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals. Specification gaming can range from simple behaviors like sycophancy to sophisticated and pernicious behaviors like reward-tampering, where a model directly modifies its own reward mechanism. However, these more pernicious behaviors may


(Part 2 of the CAST sequence)

As a reminder, here’s how I’ve been defining “corrigible” when introducing the concept: an agent is corrigible when it robustly acts opposite of the trope of "be careful what you wish for" by cautiously reflecting on itself as a flawed tool and focusing on empowering the principal to fix its flaws and mistakes.

This definition is vague, imprecise, and hides a lot of nuance. What do we mean by “flaws,” for example? Even the parts that may seem most solid, such as the notion of there being a principal and an agent, may seem philosophically confused to a sufficiently advanced mind. We’ll get into trying to precisely formalize corrigibility later on, but part of the point of corrigibility is to work even when it’s only...

I've read through your sequence, and I'm leaving my comment here, because it feels like the most relevant page. Thanks for taking time to write this up, it seems like a novel take on corrigibility. I also found the existing writing section to be very helpful. 

Does it feel like the generator of Cora’s thoughts and actions is simple, or complex? Regardless of how many English words it takes to pin down, does it feel like a single concept that an alien civilization might also have, or more like a gerrymandered hodgepodge of desiderata?

This dis... (read more)

As an AI researcher who wants to do technical work that helps humanity, there is a strong drive to find a research area that is definitely helpful somehow, so that you don’t have to worry about how your work will be applied, and thus you don’t have to worry about things like corporate ethics or geopolitics to make sure your work benefits humanity.

Unfortunately, no such field exists. In particular, technical AI alignment is not such a field, and technical AI safety is not such a field. It absolutely matters where ideas land and how they are applied, and when the existence of the entire human race is at stake, that’s no exception.

If that’s obvious to you, this post is mostly just a collection of arguments for something you...

I don't find this framing compelling. Particularly wrt to this part:

Obedience — AI that obeys the intention of a human user can be asked to help build unsafe AGI, such as by serving as a coding assistant. (Note: this used to be considered extremely sci-fi, and now it's standard practice.)

I grant the point that an AI that does what the user wants can still be dangerous (in fact it could outright destroy the world). But I'd describe that situation as "we successfully aligned AI and things went wrong anyway" rather than "we failed to align AI". I grant t... (read more)

4Andrew Critch
I very much agree with human flourishing as the main value I most want AI technologies to pursue and be used to pursue. In that framing, my key claim is that in practice no area of purely technical AI research — including "safety" and/or "alignment" research — can be adequately checked for whether it will help or hinder human flourishing, without a social model of how the resulting techologies will be used by individuals / businesses / governments / etc..
15Charbel-Raphael Segerie
Strongly agree. Related: It's disheartening to recognize, but it seems the ML community might not even get past the first crucial step in reducing risks, which is understanding them. We appear to live in a world where most people, including key decision-makers, still don't grasp the gravity of the situation. For instance, in France, we still hear influential figures like Arthur Mensch, CEO of Mistral, saying things like, "When you write this kind of software, you always control what's going to happen, all the outputs the software can have." As long as such individuals are leading AGI labs, the situation will remain quite dire. +1 for the conflationary alliances point. It is especially frustrating when I hear junior people interchange "AI Safety" and "AI Alignment." These are two completely different concepts, and one can exist without the other. (The fact that the main forum for AI Safety is the "Alignment Forum" does not help with this confusion). I'm not convinced the goal of the AI Safety community should be to align AIs at this point. However, I want to make a small amendment to Myth 1: I believe that technical work which enhances safety culture is generally very positive. Examples of such work include scary demos like "BadLlama," which I cite at least once a week, or benchmarks such as Evaluating Frontier Models for Dangerous Capabilities, which tries to monitor particularly concerning capabilities. More "technical" works like these seem overwhelmingly positive, and I think that we need more competent people doing this.
One issue is there's also a difference between "AI X-Safety" and "AI Safety". It's very natural for people working on all kinds of safety from and with AI systems to call their field "AI safety", so it seems a bit doomed to try and have that term refer to x-safety.

This post was written by Peli Grietzer, inspired by internal writings by TJ (tushant jha), for AOI[1]. The original post, published on Feb 5, 2024, can be found here:

The purpose of our work at the AI Objectives Institute (AOI) is to direct the impact of AI towards human autonomy and human flourishing. In the course of articulating our mission and positioning ourselves -- a young organization -- in the landscape of AI risk orgs, we’ve come to notice what we think are serious conceptual problems with the prevalent vocabulary of ‘AI alignment.’ This essay will discuss some of the major ways in which we think the concept of ‘alignment’ creates bias and confusion, as well as our own search for clarifying concepts. 

At AOI, we try to...

First of all, these are all meant to denote very rough attempts at demarcating research tastes.

It seems possible to be aiming to solve P1 without thinking much of P4, if a) you advocate ~Butlerian pause, or b) if you are working on aligned paternalism as the target behavior (where AI(s) are responsible for keeping humans happy, and humans have no residual agency or autonomy remaining).

Also a lot of people who focus on the problem from a P4 perspective tend to focus on the human-AI interface, where most of the relevant technical problems lie, but this might reduce their attention on issues of mesa-optimizers or emergent agency despite the massive importance of those issues to their project in the long run.

Load More