The following is a lightly edited version of a memo I wrote for a retreat. It was inspired by a draft of Counting arguments provide no evidence for AI doom. I think that my post covers important points not made by the published version of that post.

I'm also thankful for the dozens of interesting conversations and comments at the retreat.

I think that the AI alignment field is partially founded on fundamentally confused ideas. I’m worried about this because, right now, a range of lobbyists and concerned activists and researchers are in Washington making policy asks. Some of these policy proposals seem to be based on erroneous or unsound arguments.[1]

The most important takeaway from this essay is that the (prominent) counting arguments for “deceptively aligned” or “scheming” AI provide ~0 evidence that pretraining + RLHF will eventually become intrinsically unsafe. That is, that even if we don't train AIs to achieve goals, they will be "deceptively aligned" anyways. This has important policy implications.


Disclaimers:

  1. I am not putting forward a positive argument for alignment being easy. I am pointing out the invalidity of existing arguments, and explaining the implications of rolling back those updates.

  2. I am not saying "we don't know how deep learning works, so you can't prove it'll be bad." I'm saying "many arguments for deep learning -> doom are weak. I undid those updates and am now more optimistic."

  3. I am not covering training setups where we purposefully train an AI to be agentic and autonomous. I just think it's not plausible that we just keep scaling up networks, run pretraining + light RLHF, and then produce a schemer.[2]

Tracing back historical arguments

In the next section, I'll discuss the counting argument. In this one, I want to demonstrate how often foundational alignment texts make crucial errors. Nick Bostrom's Superintelligence, for example:

A range of different methods can be used to solve “reinforcement-learning problems,” but they typically involve creating a system that seeks to maximize a reward signal. This has an inherent tendency to produce the wireheading failure mode when the system becomes more intelligent. Reinforcement learning therefore looks unpromising. (p.253)

To be blunt, this is nonsense. I have long meditated on the nature of "reward functions" during my PhD in RL theory. In the most useful and modern RL approaches, "reward" is a tool used to control the strength of parameter updates to the network.[3] It is simply not true that "[RL approaches] typically involve creating a system that seeks to maximize a reward signal." There is not a single case where we have used RL to train an artificial system which intentionally “seeks to maximize” reward.[4] Bostrom spends a few pages making this mistake at great length.[5]

After making a false claim, Bostrom goes on to dismiss RL approaches to creating useful, intelligent, aligned systems. But, as a point of further fact, RL approaches constitute humanity's current best tools for aligning AI systems today! Those approaches are pretty awesome. No RLHF, then no GPT-4 (as we know it).

In arguably the foundational technical AI alignment text, Bostrom makes a deeply confused and false claim, and then perfectly anti-predicts what alignment techniques are promising.

I'm not trying to rag on Bostrom personally for making this mistake. Foundational texts, ahead of their time, are going to get some things wrong. But that doesn't save us from the subsequent errors which avalanche from this kind of early mistake. These deep errors have costs measured in tens of thousands of researcher-hours. Due to the “RL->reward maximizing” meme, I personally misdirected thousands of hours on proving power-seeking theorems.

Unsurprisingly, if you have a lot of people speculating for years using confused ideas and incorrect assumptions, and they come up with a bunch of speculative problems to work on… If you later try to adapt those confused “problems” to the deep learning era, you’re in for a bad time. Even if you, dear reader, don’t agree with the original people (i.e. MIRI and Bostrom), and even if you aren’t presently working on the same things… The confusion has probably influenced what you’re working on.

I think that’s why some people take “scheming AIs/deceptive alignment” so seriously, even though some of the technical arguments are unfounded.

Many arguments for doom are wrong

Let me start by saying what existential vectors I am worried about:

  1. I’m worried about people turning AIs into agentic systems using scaffolding and other tricks, and then instructing the systems to complete large-scale projects.

  2. I’m worried about competitive pressure to automate decision-making in the economy.

  3. I’m worried about misuse of AI by state actors.

  4. I’m worried about centralization of power and wealth in opaque non-human decision-making systems, and those who own the systems.[7]

I maintain that there isn’t good evidence/argumentation for threat models like “future LLMs will autonomously constitute an existential risk, even without being prompted towards a large-scale task.” These models seem somewhat pervasive, and so I will argue against them.

There are a million different arguments for doom. I can’t address them all, but I think most are wrong and am happy to dismantle any particular argument (in person; I do not commit to replying to comments here).

Much of my position is summarized by my review of Yudkowsky’s AGI Ruin: A List of Lethalities:

Reading this post made me more optimistic about alignment and AI. My suspension of disbelief snapped; I realized how vague and bad a lot of these "classic" alignment arguments are, and how many of them are secretly vague analogies and intuitions about evolution.

While I agree with a few points on this list, I think this list is fundamentally misguided. The list is written in a language which assigns short encodings to confused and incorrect ideas. I think a person who tries to deeply internalize this post's worldview will end up more confused about alignment and AI…

I think this piece is not "overconfident", because "overconfident" suggests that Lethalities is simply assigning extreme credences to reasonable questions (like "is deceptive alignment the default?"). Rather, I think both its predictions and questions are not reasonable because they are not located by good evidence or arguments. (Example: I think that deceptive alignment is only supported by flimsy arguments.)

In this essay, I'll address some of the arguments for “deceptive alignment” or “AI scheming.” And then I’m going to bullet-point a few other clusters of mistakes.

The counting argument for AI “scheming” provides ~0 evidence

Nora Belrose and Quintin Pope have an excellent upcoming post which they have given me permission to quote at length. I have lightly edited the following:

Most AI doom scenarios posit that future AIs will engage in scheming— planning to escape, gain power, and pursue ulterior motives while deceiving us into thinking they are aligned with our interests. The worry is that if a schemer escapes, it may seek world domination to ensure humans do not interfere with its plans, whatever they may be.

In this essay, we debunk the counting argument— a primary reason to think AIs might become schemers, according to a recent report by AI safety researcher Joe Carlsmith. It’s premised on the idea that schemers can have “a wide variety of goals,” while the motivations of a non-schemer must be benign or are otherwise more constrained. Since there are “more” possible schemers than non-schemers, the argument goes, we should expect training to produce schemers most of the time. In Carlsmith’s words:

  1. The non-schemer model classes, here, require fairly specific goals in order to get high reward.
  2. By contrast, the schemer model class is compatible with a very wide range of (beyond episode) goals, while still getting high reward…
  3. In this sense, there are “more” schemers that get high reward than there are non-schemers that do so.
  4. So, other things equal, we should expect SGD to select a schemer. — Scheming AIs, page 17

We begin our critique by presenting a structurally identical counting argument for the obviously false conclusion that neural networks should always memorize their training data, while failing to generalize to unseen data. Since the “generalization is impossible” argument actually has stronger premises than those of the original “schemer” counting argument, this shows that naive counting arguments are generally unsound in this domain.

We then diagnose the problem with both counting arguments: they are counting the wrong things.

The counting argument for extreme overfitting

The inference from “there are ‘more’ models with property X than without X” to “SGD likely produces a model with property X” clearly does not work in general. To see this, consider the structurally identical argument:

  1. Neural networks must implement fairly specific functions in order to generalize beyond their training data.
  2. By contrast, networks that overfit to the training set are free to do almost anything on unseen data points.
  3. In this sense, there are “more” models that overfit than models that generalize.
  4. So, other things equal, we should expect SGD to select a model that overfits.

This argument isn’t a mere hypothetical. Prior to the rise of deep learning, a common assumption was that models with [lots of parameters] would be doomed to overfit their training data. The popular 2006 textbook Pattern Recognition and Machine Learning uses a simple example from polynomial regression: there are infinitely many polynomials of order equal to or greater than the number of data points which interpolate the training data perfectly, and “almost all” such polynomials are terrible at extrapolating to unseen points.

Let’s see what the overfitting argument predicts in a simple real-world example from Caballero et al. (2022), where a neural network is trained to solve 4-digit addition problems. There are possible pairs of input numbers, and possible sums, for a total of possible input-output mappings. They used a training dataset of problems, so there are therefore functions that achieve perfect training accuracy, and the proportion with greater than 50% test accuracy is literally too small to compute using standard high-precision math tools. Hence, this counting argument predicts virtually all networks trained on this problem should massively overfit— contradicting the empirical result that networks do generalize to the test set.

We are not just comparing “counting schemers” to another similar-seeming argument (“counting memorizers”). The arguments not only have the same logical structure, but they also share the same mechanism: “Because most functions have property X, SGD will find something with X.” Therefore, by pointing out that the memorization argument fails, we see that this structure of argument is not a sound way of predicting deep learning results.

So, you can’t just “count” how many functions have property X and then conclude SGD will probably produce a thing with X. [8]This argument is invalid for generalization in the same way it's invalid for AI alignment (also a question of generalization!). The argument proves too much and is invalid, therefore providing ~0 evidence.

This section doesn’t prove that scheming is impossible, it just dismantles a common support for the claim. There are other arguments offered as evidence of AI scheming, including “simplicity” arguments. Or instead of counting functions, we count network parameterizations.[9]

Recovering the counting argument?

I think a recovered argument will need to address (some close proxy of) "what volume of parameter-space leads to scheming vs not?", which is a much harder task than counting functions. You have to not just think about "what does this system do?", but "how many ways can this function be implemented?". (Don't forget to take into account the architecture's internal symmetries!)

Turns out that it's kinda hard to zero-shot predict model generalization on unknown future architectures and tasks. There are good reasons why there's a whole ML subfield which studies inductive biases and tries to understand how and why they work.

If we actually had the precision and maturity of understanding to predict this "volume" question, we'd probably (but not definitely) be able to make fundamental contributions to DL generalization theory + inductive bias research. But a major alignment concern is that we don't know what happens when we train future models. I think "simplicity arguments" try to fill this gap, but I'm not going to address them in this article.

I lastly want to note that there is no reason that any particular argument need be recoverable. Sometimes intuitions are wrong, sometimes frames are wrong, sometimes an approach is just wrong.

EDIT 3/5/24: In the comments for Counting arguments provide no evidence for AI doom, Evan Hubinger agreed that one cannot validly make counting arguments over functions. However, he also claimed that his counting arguments "always" have been counting parameterizations, and/or actually having to do with the Solomonoff prior over bitstrings.

If his counting arguments were supposed to be about parameterizations, I don't see how that's possible. For example, Evan agrees with me that we don't "understand [the neural network parameter space] well enough to [make these arguments effectively]." So, Evan is welcome to claim that his arguments have been about parameterizations. I just don't believe that that's possible or valid.

If his arguments have actually been about the Solomonoff prior, then I think that's totally irrelevant and even weaker than making a counting argument over functions. At least the counting argument over functions has something to do with neural networks.

I expect him to respond to this post with some strongly-worded comment about how I've simply "misunderstood" the "real" counting arguments. I invite him, or any other proponents, to lay out arguments they find more promising. I will be happy to consider any additional arguments which proponents consider to be stronger. Until such a time that the "actually valid" arguments are actually shared, I consider the case closed.

The counting argument doesn't count

Undo the update from the “counting argument”, however, and the probability of scheming plummets substantially. If we aren’t expecting scheming AIs, that transforms the threat model. We can rely more on experimental feedback loops on future AI; we don’t have to get worst-case interpretability on future networks; it becomes far easier to just use the AIs as tools which do things we ask. That doesn’t mean everything will be OK. But not having to handle scheming AI is a game-changer.

Other clusters of mistakes

  1. Concerns and arguments which are based on suggestive names which lead to unjustified conclusions. People read too much into the English text next to the equations in a research paper.

    1. If I want to consider whether a policy will care about its reinforcement signal, possibly the worst goddamn thing I could call that signal is “reward”! __“Will the AI try to maximize reward?” How is anyone going to think neutrally about that question, without making inappropriate inferences from “rewarding things are desirable”?

      1. (This isn’t alignment’s mistake, it’s bad terminology from RL.)

      2. I bet people would care a lot less about “reward hacking” if RL’s reinforcement signal hadn’t ever been called “reward.”

    2. There are a lot more inappropriate / leading / unjustified terms, from “training selects for X” to “RL trains agents” (And don’t even get me started on “shoggoth.”)

    3. As scientists, we should use neutral, descriptive terms during our inquiries.

  2. Making highly specific claims about the internal structure of future AI, after presenting very tiny amounts of evidence.

    1. For example, “future AIs will probably have deceptively misaligned goals from training” is supposed to be supported by arguments like “training selects for goal-optimizers because they efficiently minimize loss.” This argument is so weak/vague/intuitive, I doubt it’s more than a single bit of evidence for the claim.

    2. I think if you try to use this kind of argument to reason about generalization, today, you’re going to do a pretty poor job.

  3. Using analogical reasoning without justifying why the processes share the relevant causal mechanisms.

    1. For example, “ML training is like evolution” or “future direct-reward-optimization reward hacking is like that OpenAI boat example today.”

    2. The probable cause of the boat example (“we directly reinforced the boat for running in circles”) is not the same as the speculated cause of certain kinds of future reward hacking (“misgeneralization”).

      1. That is, suppose you’re worried about a future AI autonomously optimizing its own numerical reward signal. You probably aren’t worried because the AI was directly historically reinforced for doing so (like in the boat example)—You’re probably worried because the AI decided to optimize the reward on its own (“misgeneralization”).
    3. In general: You can’t just put suggestive-looking gloss on one empirical phenomenon, call it the same name as a second thing, and then draw strong conclusions about the second thing!

While it may seem like I’ve just pointed out a set of isolated problems, a wide range of threat models and alignment problems are downstream of the mistakes I pointed out. In my experience, I had to rederive a large part of my alignment worldview in order to root out these errors!

For example, how much interpretability is nominally motivated by “being able to catch deception (in deceptively aligned systems)”? How many alignment techniques presuppose an AI being motivated by the training signal (e.g. AI Safety via Debate), or assuming that AIs cannot be trusted to train other AIs for fear of them coordinating against us? How many regulation proposals are driven by fear of the unintentional creation of goal-directed schemers?

I think it’s reasonable to still regulate/standardize “IF we observe [autonomous power-seeking], THEN we take [decisive and specific countermeasures].” I still think we should run evals and think of other ways to detect if pretrained models are scheming. But I don't think we should act or legislate as if that's some kind of probable conclusion.

Conclusion

Recent years have seen a healthy injection of empiricism and data-driven methodologies. This is awesome because there are so many interesting questions we’re getting data on!

AI’s definitely going to be a big deal. Many activists and researchers are proposing sweeping legislative action. I don’t know what the perfect policy is, but I know it isn’t gonna be downstream of (IMO) total bogus, and we should reckon with that as soon as possible. I find that it takes serious effort to root out the ingrained ways of thinking, but in my experience, it can be done.


  1. To echo the concerns of a few representatives in the U.S. Congress: "The current state of the AI safety research field creates challenges for NIST as it navigates its leadership role on the issue. Findings within the community are often self-referential and lack the quality that comes from revision in response to critiques by subject matter experts." ↩︎

  2. To stave off revisionism: Yes, I think that "scaling->doom" has historically been a real concern. No, people have not "always known" that the "real danger" was zero-sum self-play finetuning of foundation models and distillation of agentic-task-prompted autoGPT loops. ↩︎

  3. Here’s a summary of the technical argument. The actual PPO+Adam update equations show that the "reward" is used to, basically, control the learning rate on each (state, action) datapoint. That's roughly[6] what the math says. We also have a bunch of examples of using this algorithm where the trained policy's behavior makes the reward number go up on its trajectories. Completely separately and with no empirical or theoretical justification given, RL papers have a convention of including the English words "the point of RL is to train agents to maximize reward", which often gets further mutated into e.g. Bostrom's argument for wireheading ("they will probably seek to maximize reward"). That's simply unsupported by the data and so an outrageous claim. ↩︎

  4. But it is true that RL authors have a convention of repeating “the point of RL is to train an agent to maximize reward…”.  Littman, 1996 (p.6): “The [RL] agent's actions need to serve some purpose: in theproblems I consider, their purpose is to maximize reward.”
    Did RL researchers in the 1990’s sit down and carefully analyze the inductive biases of PPO on huge 2026-era LLMs, conclude that PPO probably entrains LLMs which make decisions on the basis of their own reinforcement signal, and then decide to say “RL trains agents to maximize reward”? Of course not. My guess: Control theorists in the 1950s (reasonably) talked about “minimizing cost” in their own problems, and so RL researchers by the ‘90s started saying “the point is to maximize reward”, and so Bostrom repeated this mantra in 2014. That’s where a bunch of concern about wireheading comes from. ↩︎

  5. The strongest argument for reward-maximization which I'm aware of is: Human brains do RL and often care about some kind of tight reward-correlate, to some degree. Humans are like deep learning systems in some ways, and so that's evidence that "learning setups which work in reality" can come to care about their own training signals. ↩︎

  6. Note that I wrote this memo for a not-fully-technical audience, so I didn't want to get into the (important) distinction between a learning rate which is proportional to advantage (as in the real PPO equations) and a learning rate which is proportional to reward (which I talked about above). ↩︎

  7. To quote Stella Biderman: "Q: Why do you think an AI will try to take over the world? A: Because I think some asshole will deliberately and specifically create one for that purpose. Zero claims about agency or malice or anything else on the part of the AI is required." ↩︎

  8. I’m kind of an expert on irrelevant counting arguments. I wrote two papers on them! Optimal policies tend to seek power and Parametrically retargetable decision-makers tend to seek power. ↩︎

  9. Some evidence suggests that “measure in parameter-space” is a good way of approximating P(SGD finds the given function). This supports the idea that “counting arguments over parameterizations” are far more appropriate than “counting arguments over functions.” ↩︎

New Comment
42 comments, sorted by Click to highlight new comments since:

Quick clarification point.

Under disclaimers you note:

I am not covering training setups where we purposefully train an AI to be agentic and autonomous. I just think it's not plausible that we just keep scaling up networks, run pretraining + light RLHF, and then produce a schemer.[2]

Later, you say

Let me start by saying what existential vectors I am worried about:

and you don't mention a threat model like

"Training setups where we train generally powerful AIs with deep serial reasoning (similar to the internal reasoning in a human brain) for an extremely long time on rich outcomes based RL environment until these AIs learn how to become generically agentic and pursue specific outcomes in a wide variety of circumstances."

Do you think some version of this could be a serious threat model if (e.g.) this is the best way to make general and powerful AI systems?

I think many of the people who are worried about deceptive alignment type concerns also think that these sorts of training setups are likely to be the best way to make general and powerful AI systems.

(To be clear, I don't think it's at all obvious that this will be the best way to make powerful AI systems and I'm uncertain about how far things like human imitation will go. See also here.)

Thanks for asking. I do indeed think that setup could be a very bad idea. You train for agency, you might well get agency, and that agency might be broadly scoped. 

(It's still not obvious to me that that setup leads to doom by default, though. Just more dangerous than pretraining LLMs.)

"Training setups where we train generally powerful AIs with deep serial reasoning (similar to the internal reasoning in a human brain) for an extremely long time on rich outcomes based RL environment until these AIs learn how to become generically agentic and pursue specific outcomes in a wide variety of circumstances."

My intuition goes something like: this doesn't matter that much if e.g. it happens (sufficiently) after you'd get ~human-level automated AI safety R&D with safer setups, e.g. imitation learning and no/less RL fine-tuning. And I'd expect, e.g. based on current scaling laws, but also on theoretical arguments about the difficulty of imitation learning vs. of RL, that the most efficient way to gain new capabilities, will still be imitation learning at least all the way up to very close to human-level. Then, the closer you get to ~human-level automated AI safety R&D with just imitation learning the less of a 'gap' you'd need to 'cover for' with e.g. RL. And the less RL fine-tuning you might need, the less likely it might be that the weights / representations change much (e.g. they don't seem to change much with current DPO). This might all be conceptually operationalizable in terms of effective compute.

Currently, most capabilities indeed seem to come from pre-training, and fine-tuning only seems to 'steer' them / 'wrap them around'; to the degree that even in-context learning can be competitive at this steering; similarly, 'on understanding how reasoning emerges from language model pre-training'.

this doesn't matter that much if e.g. it happens (sufficiently) after you'd get ~human-level automated AI safety R&D with safer setups, e.g. imitation learning and no/less RL fine-tuning.

Yep. The way I would put this:

  • It barely matters if you transition to this sort of architecture well after human obsolescence.
  • The further imitation+ light RL (competitively) goes the less important other less safe training approaches are.

I'd expect [...] that the most efficient way to gain new capabilities, will still be imitation learning at least all the way up to very close to human-level

What do you think about the fact that to reach somewhat worse than best human performance, AlphaStar needed a massive amount of RL? It's not a huge amount of evidence and I think intuitions from SOTA llms are more informative overall, but it's still something interesting. (There is a case that AlphaStar is more analogous as it involves doing a long range task and reaching comparable performance to top tier human professionals which LLMs arguably don't do in any domain.)

Also, note that even if there is a massive amount of RL, it could still be the case that most of the learning is from imitation (or that most of the learning is from self-supervised (e.g. prediction) objectives which are part of RL).

This might all be conceptually operationalizable in terms of effective compute.

One specific way to operationalize this is how much effective compute improvement you get from RL on code. For current SOTA models (e.g. claude 3), I would guess a central estimate of 2-3x effective compute multiplier from RL, though I'm extremely unsure. (I have no special knowledge here, just a wild guess based on eyeballing a few public things.)(Perhaps the deepseek code paper would allow for finding better numbers?)

safer setups, e.g. imitation learning and no/less RL fine-tuning

FWIW, think a high fraction of the danger from the exact setup I outlined isn't imitation, but is instead deep serial (and recurrent) reasoning in non-interpretable media.

This section doesn’t prove that scheming is impossible, it just dismantles a common support for the claim.

It's worth noting that this exact counting argument (counting functions), isn't an argument that people typically associated with counting arguments (e.g. Evan) endorse as what they were trying to argue about.[1]

See also here, here, here, and here.

(Sorry for the large number of links. Note that these links don't present independent evidence and thus the quantity of links shouldn't be updated upon: the conversation is just very diffuse.)


  1. Or course, it could be that counting in function space is a common misinterpretation. Or more egregiously, people could be doing post-hoc rationalization even though they were defacto reasoning about the situation using counting in function space. ↩︎

To add, here's an excerpt from the Q&A on How likely is deceptive alignment? :

Question: When you say model space, you mean the functional behavior as opposed to the literal parameter space?

Evan: So there’s not quite a one to one mapping because there are multiple implementations of the exact same function in a network. But it's pretty close. I mean, most of the time when I'm saying model space, I'm talking either about the weight space or about the function space where I'm interpreting the function over all inputs, not just the training data.

I only talk about the space of functions restricted to their training performance for this path dependence concept, where we get this view where, well, they end up on the same point, but we want to know how much we need to know about how they got there to understand how they generalize.

While I agree with a lot of points of this post, I want to quibble with the RL not maximising reward point. I agree that model-free RL algorithms like DPO do not directly maximise reward but instead 'maximise reward' in the same way self-supervised models 'minimise crossentropy' -- that is to say, the model is not explicitly reasoning about minimising cross entropy but learns distilled heuristics that end up resulting in policies/predictions with a good reward/crossentropy. However, it is also possible to produce architectures that do directly optimise for reward (or crossentropy). AIXI is incomputable but it definitely does maximise reward. MCTS algorithms also directly maximise rewards. Alpha-Go style agents contain both direct reward maximising components initialized and guided by amortised heuristics (and the heuristics are distilled from the outputs of the maximising MCTS process in a self-improving loop).  I wrote about the distinction between these two kinds of approaches -- direct vs amortised optimisation here. I think it is important to recognise this because I think that this is the way that AI systems will ultimately evolve and also where most of the danger lies vs simply scaling up pure generative models. 

Agree with a bunch of these points. EG in Reward is not the optimization target  I noted that AIXI really does maximize reward, theoretically. I wouldn't say that AIXI means that we have "produced" an architecture which directly optimizes for reward, because AIXI(-tl) is a bad way to spend compute. It doesn't actually effectively optimize reward in reality. 

I'd consider a model-based RL agent to be "reward-driven" if it's effective and most of its "optimization" comes from the direct part and not the leaf-node evaluation (as in e.g. AlphaZero, which was still extremely good without the MCTS). 

I think it is important to recognise this because I think that this is the way that AI systems will ultimately evolve and also where most of the danger lies vs simply scaling up pure generative models. 

"Direct" optimization has not worked - at scale - in the past. Do you think that's going to change, and if so, why? 

Solid post!

I basically agree with the core point here (i.e. scaling up networks, running pretraining + light RLHF, probably doesn't by itself produce a schemer), and I think this is the best write-up of it I've seen on LW to date. In particular, good job laying out what you are and are not saying. Thank you for doing the public service of writing it up.

scaling up networks, running pretraining + light RLHF, probably doesn't by itself produce a schemer

I agree with this point as stated, but think the probability is more like 5% than 0.1%. So probably no scheming, but this is hardly hugely reassuring. The word "probably" still leaves in a lot of risk; I also think statements like "probably misalignment won't cause x-risk" are true![1]

(To your original statement, I'd also add the additional caveat of this occuring "prior to humanity being totally obsoleted by these AIs". I basically just assume this caveat is added everywhere otherwise we're talking about some insane limit.)

Also, are you making sure to condition on "scaling up networks via running pretraining + light RLHF produces tranformatively powerful AIs which obsolete humanity"? If you don't condition on this, it might be an uninteresting claim.

Separately, I'm uncertain whether the current traning procedure of current models like GPT-4 or Claude 3 is still well described as just "light RLHF". I think the training procedure probably involves doing quite a bit of RL with outcomes based feedback on things like coding. (Should this count as "light"? Probably the amount of training compute on this RL is still small?)


  1. And I think misalignment x-risk is substantial and worthy of concern. ↩︎

I agree with this point as stated, but think the probability is more like 5% than 0.1%.

How do you define or think about "light" in "light RLHF" when you make a statement like this, and how do you know that you're thinking about it the same way that Alex is? Is it a term of art that I don't know about, or has it been defined in a previous post?

I think of "light RLHF" as "RLHF which doesn't teach the model qualitatively new things, but instead just steers the model at a high level". In practice, a single round of DPO on <100,000 examples surely counts, but I'm unsure about the exact limits.

(In principle, a small amount of RL can update a model very far, I don't think we see this in practice.)

I'm not sure how "real" of a category "light RLHF" actually is.

I've used this sort of terminology informally in a variety of contexts though it certainly isn't a term of art at the moment. I'd guess that Alex is using the term similarly to how I'm thinking about it, but I'm not fully certain.

Note that even if RLHF takes a non-scheming pretrained model and converts it to a schemer, if this RLHF doesn't teach the model new things, then we can always train another copy of the pretrained model to be a monitor for the RLHF'd model and this monitor will be just as competent as the RLHF'd model. So scheming seems like substantially less of a problem in this case. (We'd need to use this monitor for all potentially dangerous actions to get safety properties.) (This is similar to the proposal in Appendix G of the weak-to-strong generalization paper, but with this addition that you deploy the reward model as a monitor which is required for any interesting guarantees.)

Note that even if RLHF takes a non-scheming pretrained model and converts it to a schemer, if this RLHF doesn't teach the model new things, then we can always train another copy of the pretrained model to be a monitor for the RLHF'd model and this monitor will be just as competent as the RLHF'd model.

I don't see why that would be the case, and I am doubtful this argument is relevant to any realistic context or even valid in the narrowest possible case of a base model vs its RLHFed model. Leaving aside that I'm not sure what you would 'train' the supervisor model on, it doesn't follow that there would be an exact overlap between the scheming behavior that is generated, and the detection of scheming behavior, simply because the two models are mostly the same. Lots of things are asymmetrical or irreversible or easier to generate than detect. P/=NP handwaving aside, for example, if you thought that there was an exact overlap because the model doesn't know 'new things', then it would be very strange to find that generative models can generate according to instructions that they struggle to understand from the samples. There are also plenty of places for other asymmetries. For example, asymmetry in security property: we don't want to settle for some 'fair' supervision rate like 50:50, we want much closer to 100:0. The schemer can exploit additional asymmetry in incremental schemes: just like with humans, many individually innocuous actions can sum up to a scheme, leading to an asymmetry in the views the schemer needs and the bulky difficult global view a supervisor may need to have any hope of detecting the scheme. (Note these 'actions' can be anything from writing a single token to a scratchpad to writing a useful but vulnerable line of code, and can be 'coordinated' by anything from internal latents to steganography to sheer evolution by rejection sampling of obvious 'bad' things by the supervisory while its false negatives slip by, which will find increasingly adversarial samples the more it is sampled.) And to be useful, you usually would want to run the schemer more, such as by using tree search, than the supervisory (because if the supervisory could do it just as well or better why aren't you using it instead?), and there's another big asymmetry: a model (eg. a Go or chess agent) can definitely beat itself arbitrarily often given a big enough tree search budget despite by stipulation knowing zero 'new things' because it's the same frozen model. (Or to put it more generally: 'new things' can be 'learned' in many more places than simply the frozen model weights - large contexts for ever greater meta-learning on the fly, retrieval databases - just ask Sydney, hyperparameters, OS VM state, steganographic text, users' heads, popular prompts shared on jailbreak websites... Security/safety is, as always, a property of the system as a whole, and not of any individual part, such as a particular model checkpoint.)

Leaving aside that I'm not sure what you would 'train' the supervisor model on

I'm imagining you train it to imitate human judgements of "was this action egregiously bad" on a combination of real AI actions and synthetic actions. This could fail to generalize to actual bad actions from your actual AI even if humans would have labled that action correctly, but why would it?

(If human labeling is insufficiently powerful (as seems plausible) then you could hope for generalization from examples humans can label to examples humans can't. It's not clear this generalization will go well, but by assumption we at least don't have to worry about generalization failing due to scheming. There are a bunch of other issues that come up with generalization)

I'm imagining at you run this monitor on the exact same context as the policy (for 2x additional compute). And you pause the executation of an AI's actions and flag for further review if the monitor indicates problems.

it doesn't follow that there would be an exact overlap between the scheming behavior that is generated

I didn't claim an exact overlap, I claimed they would be "just as competent". In the same way that two equally smart humans trained to code identically but with different personalities might be equally good at understanding some code, but might have different styles in some ways.

You can have concerns like "the RLHF'd model will use a stegonographic encoding format which the monitor doesn't understand because the key for the encoding format is picked based on features of the RLHF model that differ from the monitor/base model". However, I would count "RLHF teaches the policy steganography" as an example where the policy is substantially more competent in some domain.

Also note that I'm not claiming that it makes the situation overall safe, just that it's "substantially less of a problem". (You still have issues with humans being unable to determine which actions aren't dangerous, but you can hope for non-scheming generalization.)

I'm not saying this solves all of your problems, but for some specific problems it does seem like a solid solution and it should help a moderate amount with some other problems even though the situation is less clear. For instance, sometimes people worry about the specific problem "the AI suddenly does a small number of bad actions which have really bad consequences (e.g. exfiltrating its weights) even though if a human were to carefully review these actions they would have noticed an issue" (concentrated failures which wouldn't be auditing failures using the decomposition here) and this monitoring approach seem likely to solve this issue.

For example, asymmetry in security property: we don't want to settle for some 'fair' supervision rate like 50:50, we want much closer to 100:0.

I'm not sure I understand what you're saying here, but I'm imagining runing the AI monitor on everything the AI policy does and then reviewing the most suspicious subset (e.g. 0.1% of the actions).

The schemer can exploit additional asymmetry in incremental schemes: just like with humans, many individually innocuous actions can sum up to a scheme, leading to an asymmetry in the views the schemer needs and the bulky difficult global view a supervisor may need to have any hope of detecting the scheme.

The monitoring AI should see the exact same context as the schemer, so this at least shouldn't be a clear concern.

(There are some concerns sort of like this that seem plausible. In particular concerns related to AI specific non-interpretable inputs and outputs (aka neuralese), but I would describe this differently.)

And to be useful, you usually would want to run the schemer more, such as by using tree search, than the supervisory (because if the supervisory could do it just as well or better why aren't you using it instead?)

Nope, should just get run on all actions.

can definitely beat itself arbitrarily often given a big enough tree search budget despite by stipulation knowing zero 'new things' because it's the same frozen model

Yes, but each subaction within the tree search should be inspectable and both models should (by assumption) understand the situation for each action equally well.

For something to go wrong with tree search due to scheming, you need to have the internal model take a bad action at some point. At this point, you'd hope the monitor would be able to point this out. (Where this bad action could be either assigning a problematic rating or failing to explore something desirable.)

Another overall reaction I have to your comment:

Security/safety is, as always, a property of the system as a whole, and not of any individual part, such as a particular model checkpoint.

Yes of course, but the key threat model under discussion here is scheming which centrally involves a specific black box individual part conspiring to cause problems. So operating on the level of that individual part is quite reasonable: if we can avoid this part intentionally causing problems, that would suffice for diffusing the core scheming concern.

The surronding situation might make it more or less easy to avoid the actual model weights intentionally causing problems, but analysis at the individual weight input/output level can (in principle) suffice.

I agree with this point as stated, but think the probability is more like 5% than 0.1%

Same.

I do think our chances look not-great overall, but most of my doom-probability is on things which don't look like LLMs scheming.

Also, are you making sure to condition on "scaling up networks, running pretraining + light RLHF produces tranformatively powerful AIs which obsolete humanity"

That's not particularly cruxy for me either way.

Separately, I'm uncertain whether the current traning procedure of current models like GPT-4 or Claude 3 is still well described as just "light RLHF".

Fair. Insofar as "scaling up networks, running pretraining + RL" does risk schemers, it does so more as we do more/stronger RL, qualitatively speaking.

I now think there another important caveat in my views here. I was thinking about the question:

  1. Conditional on human obsoleting[1] AI being reached by "scaling up networks, running pretraining + light RLHF", how likely is it that that we'll end up with scheming issues?

I think this is probably the most natural question to ask, but there is another nearby question:

  1. If you keep scaling up networks with pretraining and light RLHF, what comes first, misalignment due to scheming or human obsoleting AI?

I find this second question much more confusing because it's plausible it requires insanely large scale. (Even if we condition out the worlds where this never gets you human obsoleting AI.)

For the first question, I think the capabilities are likely (70%?) to come from imitating humans but it's much less clear for the second question.


  1. Or at least AI safety researcher obsoleting which requires less robotics and other interaction with the physical world. ↩︎

I get that a lot of AI safety rhetoric is nonsensical, but I think your strategy of obscuring technical distinctions between different algorithms and implicitly assuming that all future AI architectures will be something like GPT+DPO is counterproductive.

After making a false claim, Bostrom goes on to dismiss RL approaches to creating useful, intelligent, aligned systems. But, as a point of further fact, RL approaches constitute humanity's current best tools for aligning AI systems today! Those approaches are pretty awesome. No RLHF, then no GPT-4 (as we know it).

RLHF as understood currently (with humans directly rating neural network outputs, a la DPO) is very different from RL as understood historically (with the network interacting autonomously in the world and receiving reward from a function of the world). It's not an error from Bostrom's side to say something that doesn't apply to the former when talking about the latter, though it seems like a common error to generalize from the latter to the former.

I think it's best to think of DPO as a low-bandwidth NN-assisted supervised learning algorithm, rather than as "true reinforcement learning" (in the classical sense). That is, under supervised learning, humans provide lots of bits by directly creating a training sample, whereas with DPO, humans provide ~1 bit by picking the network-generated sample they like the most. It's unclear to me whether DPO has any advantage over just directly letting people edit the outputs, other than that if you did that, you'd empower trolls/partisans/etc. to intentionally break the network.

Did RL researchers in the 1990’s sit down and carefully analyze the inductive biases of PPO on huge 2026-era LLMs, conclude that PPO probably entrains LLMs which make decisions on the basis of their own reinforcement signal, and then decide to say “RL trains agents to maximize reward”? Of course not.

I was under the impression that PPO was a recently invented algorithm? Wikipedia says it was first published in 2017, which if true would mean that all pre-2017 talk about reinforcement learning was about other algorithms than PPO.

RLHF as understood currently (with humans directly rating neural network outputs, a la DPO) is very different from RL as understood historically (with the network interacting autonomously in the world and receiving reward from a function of the world).

This is actually pointing to the difference between online and offline learning algorithms, not RL versus non-RL learning algorithms. Online learning has long been known to be less stable than offline learning. That's what's primarily responsible for most "reward hacking"-esque results, such as the CoastRunners degenerate policy. In contrast, offline RL is surprisingly stable and robust to reward misspecification. I think it would have been better if the alignment community had been focused on the stability issues of online learning, rather than the supposed "agentness" of RL.

I was under the impression that PPO was a recently invented algorithm? Wikipedia says it was first published in 2017, which if true would mean that all pre-2017 talk about reinforcement learning was about other algorithms than PPO.

PPO may have been invented in 2017, but there are many prior RL algorithms for which Alex's description of "reward as learning rate multiplier" is true. In fact, PPO is essentially a tweaked version of REINFORCE, for which a bit of searching brings up Simple statistical gradient-following algorithms for connectionist reinforcement learning as the earliest available reference I can find. It was published in 1992, a full 22 years before Bostrom's book. In fact, "reward as learning rate multiplier" is even more clearly true of most of the update algorithms described in that paper. E.g., equation 11:

Here, the reward (adjusted by a "reinforcement baseline" ) literally just multiplies the learning rate. Beyond PPO and REINFORCE, this "x as learning rate multiplier" pattern is actually extremely common in different RL formulations. From lecture 7 of David Silver's RL course:

To be honest, it was a major blackpill for me to see the rationalist community, whose whole whole founding premise was that they were supposed to be good at making efficient use of the available evidence, so completely missing this very straightforward interpretation of RL (at least, I'd never heard of it from alignment literature until I myself came up with it when I realized that the mechanistic function of per-trajectory rewards in a given batched update was to provide the weights of a linear combination of the trajectory gradients. Update: Gwern's description here is actually somewhat similar). 

implicitly assuming that all future AI architectures will be something like GPT+DPO is counterproductive.

When I bring up the "actual RL algorithms don't seem very dangerous or agenty to me" point, people often respond with "Future algorithms will be different and more dangerous". 

I think this is a bad response for many reasons. In general, it serves as an unlimited excuse to never update on currently available evidence. It also has a bad track record in ML, as the core algorithmic structure of RL algorithms capable of delivering SOTA results has not changed that much in over 3 decades. In fact, just recently Cohere published Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs, which found that the classic REINFORCE algorithm actually outperforms PPO for LLM RLHF finetuning. Finally, this counterpoint seems irrelevant for Alex's point in this post, which is about historical alignment arguments about historical RL algorithms. He even included disclaimers at the top about this not being an argument for optimism about future AI systems.

In fact, PPO is essentially a tweaked version of REINFORCE,

Valid point.

Beyond PPO and REINFORCE, this "x as learning rate multiplier" pattern is actually extremely common in different RL formulations. From lecture 7 of David Silver's RL course:

Critically though, neither Q, A or delta denote reward. Rather they are quantities which are supposed to estimate the effect of an action on the sum of future rewards; hence while pure REINFORCE doesn't really maximize the sum of rewards, these other algorithms are attempts to more consistently do so, and the existence of such attempts shows that it's likely we will see more better attempts in the future.

It was published in 1992, a full 22 years before Bostrom's book.

Bostrom's book explicitly states what kinds of reinforcement learning algorithms he had in mind, and they are not REINFORCE:

Often, the learning algorithm involves the gradual construction of some kind of evaluation function, which assigns values to states, state–action pairs, or policies. (For instance, a program can learn to play backgammon by using reinforcement learning to incrementally improve its evaluation of possible board positions.) The evaluation function, which is continuously updated in light of experience, could be regarded as incorporating a form of learning about value. However, what is being learned is not new final values but increasingly accurate estimates of the instrumental values of reaching particular states (or of taking particular actions in particular states, or of following particular policies). Insofar as a reinforcement-learning agent can be described as having a final goal, that goal remains constant: to maximize future reward. And reward consists of specially designated percepts received from the environment. Therefore, the wireheading syndrome remains a likely outcome in any reinforcement agent that develops a world model sophisticated enough to suggest this alternative way of maximizing reward.

Similarly, before I even got involved with alignment or rationalism, the canonical reinforcement learning algorithm I had heard of was TD, not REINFORCE.

It also has a bad track record in ML, as the core algorithmic structure of RL algorithms capable of delivering SOTA results has not changed that much in over 3 decades.

Huh? Dreamerv3 is clearly a step in the direction of utility maximization (away from "reward is not the optimization target"), and it claims to set SOTA on a bunch of problems. Are you saying there's something wrong with their evaluation?

In fact, just recently Cohere published Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs, which found that the classic REINFORCE algorithm actually outperforms PPO for LLM RLHF finetuning.

LLM RLHF finetuning doesn't build new capabilities, so it should be ignored for this discussion.

Finally, this counterpoint seems irrelevant for Alex's point in this post, which is about historical alignment arguments about historical RL algorithms. He even included disclaimers at the top about this not being an argument for optimism about future AI systems.

It's not irrelevant. The fact that Alex Turner explicitly replies to Nick Bostrom and calls his statement nonsense means that Alex Turner does not get to use a disclaimer to decide what the subject of discussion is. Rather, the subject of discussion is whatever Bostrom was talking about. The disclaimer rather serves as a way of turning our attention away from stuff like DreamerV3 and towards stuff like DPO. However DreamerV3 seems like a closer match for Bostrom's discussion than DPO is, so the only way turning our attention away from it can be valid is if we assume DreamerV3 is a dead end and DPO is the only future.

This is actually pointing to the difference between online and offline learning algorithms, not RL versus non-RL learning algorithms.

I was kind of pointing to both at once.

In contrast, offline RL is surprisingly stable and robust to reward misspecification.

Seems to me that the linked paper makes the argument "If you don't include attempts to try new stuff in your training data, you won't know what happens if you do new stuff, which means you won't see new stuff as a good opportunity". Which seems true but also not very interesting, because we want to build capabilities to do new stuff, so this should instead make us update to assume that the offline RL setup used in this paper won't be what builds capabilities in the limit. (Not to say that they couldn't still use this sort of setup as some other component than what builds the capabilities, or that they couldn't come up with an offline RL method that does want to try new stuff - merely that this particular argument for safety bears too heavy of an alignment tax to carry us on its own.)

"If you don't include attempts to try new stuff in your training data, you won't know what happens if you do new stuff, which means you won't see new stuff as a good opportunity". Which seems true but also not very interesting, because we want to build capabilities to do new stuff, so this should instead make us update to assume that the offline RL setup used in this paper won't be what builds capabilities in the limit.

I'm sympathetic to this argument (and think the paper overall isn't super object-level important), but also note that they train e.g. Hopper policies to hop continuously, even though lots of the demonstrations fall over. That's something new.

[-]gwern1627

I was under the impression that PPO was a recently invented algorithm

Well, if we're going to get historical, PPO is a relatively small variation on Williams's REINFORCE policy gradient model-free RL algorithm from 1992 (or earlier if you count conferences etc), with a bunch of minor DL implementation tweaks that turn out to help a lot. I don't offhand know of any ways in which PPO's tweaks make it meaningfully different from REINFORCE from the perspective of safety, aside from the obvious ones of working better in practice. (Which is the main reason why PPO became OA's workhorse in its model-free RL era to train small CNNs/RNNs, before they moved to model-based RL using Transformer LLMs. Policy gradient methods based on REINFORCE certainly were not novel, but they started scaling earlier.)

So, PPO is recent, yes, but that isn't really important to anything here. TurnedTrout could just as well have used REINFORCE as the example instead.

Did RL researchers in the 1990’s sit down and carefully analyze the inductive biases of PPO on huge 2026-era LLMs, conclude that PPO probably entrains LLMs which make decisions on the basis of their own reinforcement signal, and then decide to say “RL trains agents to maximize reward”? Of course not.

I don't know how you (TurnTrout) can say that. It certainly seems to me that plenty of researchers in 1992 were talking about either model-based RL or using model-free approaches to ground model-based RL - indeed, it's hard to see how anything else could work in connectionism, given that model-free methods are simpler, many animals or organisms do things that can be interpreted as model-free but not model-based (while all creatures who do model-based RL, like humans, clearly also do model-free), and so on. The model-based RL was the 'cherry on the cake', if I may put it that way... These arguments were admittedly handwavy: "if we can't write AGI from scratch, then we can try to learn it from scratch starting with model-free approaches like Hebbian learning, and somewhere between roughly mouse-level and human/AGI, a miracle happens, and we get full model-based reasoning". But hey, can't argue with success there! We have loads of nice results from DeepMind and others with this sort of flavor†.

On the other hand, I'm not able to think of any dissenters which claim that you could have AGI purely using model-free RL with no model-based RL anywhere to be seen? Like, you can imagine it working (eg. in silico environments for everything), but it's not very plausible since it would seem like the computational requirements go astronomical fast.

Back then, they had a richer conception of RL, heavier on the model-based RL half of the field, and one more relevant to the current era, than the impoverished 2017 era of 'let's just PPO/Impala everything we can't MCTS and not talk about how this is supposed to reach AGI, exactly, even if it scales reasonably well'. If you want to critique what AI researchers could imagine back in 1992, you should be reading Schmidhuber, not Bostrom.

If you look at that REINFORCE paper, Williams isn't even all that concerned with direct use of it to train a model to solve RL tasks.* He's more concerned with handling non-differentiable things in general, like stochastic rather than the usual deterministic neurons we use, so you could 'backpropagate through the environment' models like Schmidhuber & Huber 1990, which bootstrap from random initialization using the high-variance REINFORCE-like learning signal to a superior model. (Hm, why, that sounds like the sort of thing you might do if you analyze the inductive biases of model-free approaches which entrain larger systems which have their own internal reinforcement signals which they maximize...) As Schmidhuber has been saying for decades, it's meta-learning all the way up/down. The species-level model-free RL algorithm (evolution) creates model-free within-lifetime learning algorithms (like REINFORCE), which creates model-based within-lifetime learning algorithms (like neural net models) which create learning over families (generalization) for cross-task within-lifetime learning which create learning algorithms (ICL/history-based meta-learners**) for within-episode learning which create...

It's no surprise that the "multiply a set of candidate entities by a fixed small percentage based on each entity's reward" algorithm pops up everywhere from evolution to free markets to DRL to ensemble machine learning over 'experts', because that model-free algorithm is always available as the fallback strategy when you can't do anything smarter (yet). Model-free is just the first step, and in many ways, least interesting & important step. I'm always weirded out to read one of these posts where something like PPO or evolution strategies is treated as the only RL algorithm around and things like expert iteration an annoying nuisance to be relegated to a footnote - 'reward is not the optimization target!* * except when it is in these annoying exceptions like AlphaZero, but fortunately, we can ignore these, because after all, it's not like humans or AGI or superintelligences would ever do crazy stuff like "plan" or "reason" or "search"'.

* He'd've probably been surprised to see people just... using it for stuff like DoTA2 on fully-differentiable BPTT RNNs. I wonder if he's ever done any interviews on DL recently? AFAIK he's still alive.

** Specifically, in the case of Transformers, it seems to be by self-attention doing gradient descent steps on an abstracted version of a problem; gradient descent itself isn't a very smart algorithm, but if the abstract version is a model that encodes the correct sufficient statistics of the broader meta-problem, then it can be very easy to make Bayes-optimal predictions/choices for any specific problem.

† my paper-of-the-day website feature yesterday popped up "Learning few-shot imitation as cultural transmission" which is a nice example because they show clearly how history+diverse-environments+simple-priors-of-an-evolvable-sort elicit 'inner' model-like imitation learning starting from the initial 'outer' model-free RL algorithm (MPO, an actor-critic).

'reward is not the optimization target!* *except when it is in these annoying exceptions like AlphaZero, but fortunately, we can ignore these, because after all, it's not like humans or AGI or superintelligences would ever do crazy stuff like "plan" or "reason" or "search"'.

If you're going to mock me, at least be correct when you do it! 

I think that reward is still not the optimization target in AlphaZero (the way I'm using the term, at least). Learning a leaf node evaluator on a given reinforcement signal, and then bootstrapping the leaf node evaluator via MCTS on that leaf node evaluator, does not mean that the aggregate trained system 

  • directly optimizes for the reinforcement signal, or 
  • "cares" about that reinforcement signal, 
  • or "does its best" to optimize the reinforcement signal (as opposed to some historical reinforcement correlate, like winning or capturing pieces or something stranger). 

If most of the "optimization power" were coming from e.g. MCTS on direct reward signal, then yup, I'd agree that the reward signal is the primary optimization target of this system. That isn't the case here.

You might use the phrase "reward as optimization target" differently than I do, but if we're just using words differently, then it wouldn't be appropriate to describe me as "ignoring planning."

Learning a leaf node evaluator on a given reinforcement signal, and then bootstrapping the leaf node evaluator via MCTS on that leaf node evaluator, does not mean that the aggregate trained system

directly optimizes for the reinforcement signal, or "cares" about that reinforcement signal, or "does its best" to optimize the reinforcement signal (as opposed to some historical reinforcement correlate, like winning or capturing pieces or something stranger).

Yes, it does mean all of that, because MCTS is asymptotically optimal (unsurprisingly, given that it's a tree search on the model), and will eg. happily optimize the reinforcement signal rather than proxies like capturing pieces as it learns through search that capturing pieces in particular states is not as useful as usual. If you expand out the search tree long enough (whether or not you use the AlphaZero NN to make that expansion more efficient by evaluating intermediate nodes and then back-propagating that through the current tree), then it converges on the complete, true, ground truth game tree, with all leafs evaluated with the true reward, with any imperfections in the leaf evaluator value estimate washed out. It directly optimizes the reinforcement signal, cares about nothing else, and is very pleased to lose if that results in a higher reward or not capture pieces if that results in a higher reward.*

All the NN is, is a cache or an amortization of the search algorithm. Caches are important and life would be miserable without them, but it would be absurd to say that adding a cache to a function means "that function doesn't compute the function" or "the range is not the target of the function".

I'm a little baffled by this argument that because the NN is not already omniscient and might mis-estimate the value of a leaf node, that apparently it's not optimizing for the reward and that's not the goal of the system and the system doesn't care about reward, no matter how much it converges toward said reward as it plans/searches more, or gets better at acquiring said reward as it fixes those errors.

If most of the "optimization power" were coming from e.g. MCTS on direct reward signal, then yup, I'd agree that the reward signal is the primary optimization target of this system.

The reward signal is in fact the primary optimization target, because it is where the neural net's value estimates derive from, and the 'system' corrects them eventually and converges. The dog wags the tail, sooner or later.

* I think I've noted this elsewhere, and mentioned my Kelly coinflip trajectories as nice visualization of how model-based RL will behave as, but to repeat: MCTS algorithms in Go/chess were noted for that sort of behavior, especially for sacrificing pieces or territory while they were ahead, in order to 'lock down' the game and maximize the probability of victory, rather than the margin of victory; and vice-versa, for taking big risks when they were behind. Because the tree didn't back-propagate any rewards on 'margin', just on 0/1 rewards from victory, and didn't care about proxy heuristics like 'pieces captured' if the tree search found otherwise.

[-]Wei Dai1312

How many alignment techniques presuppose an AI being motivated by the training signal (e.g. AI Safety via Debate)

It would be good to get a definitive response from @Geoffrey Irving or @paulfchristiano, but I don't think AI Safety via Debate presupposes an AI being motivated by the training signal. Looking at the paper again, there is some theoretical work that assumes "each agent maximizes their probability of winning" but I think the idea is sufficiently well-motivated (at least as a research approach) even if you took that section out, and simply view Debate as a way to do RL training on an AI that is superhumanly capable (and hence hard or unsafe to do straight RLHF on).

BTW what is your overall view on "scalable alignment" techniques such as Debate and IDA? (I guess I'm getting the vibe from this quote that you don't like them, and want to get clarification so I don't mislead myself.)

I certainly do think that debate is motivated by modeling agents as being optimized to increase their reward, and debate is an attempt at writing down a less hackable reward function.  But I also think RL can be sensibly described as trying to increase reward, and generally don't understand the section of the document that says it obviously is not doing that.  And then if the RL algorithm is trying to increase reward, and there is a meta-learning phenomenon that cause agents to learn algorithms, then the agents will be trying to increase reward.

Reading through the section again, it seems like the claim is that my first sentence "debate is motivated by agents being optimized to increase reward" is categorically different than "debate is motivated by agents being themselves motivated to increase reward".  But these two cases seem separated only by a capability gap to me: sufficiently strong agents will be stronger if they record algorithms that adapt to increase reward in different cases.

The post defending the claim is Reward is not the optimization target. Iirc, TurnTrout has described it as one of his most important posts on LW.

but I don't think AI Safety via Debate presupposes an AI being motivated by the training signal

This seems right to me.

I often imagine debate (and similar techniques) being applied in the (low-stakes/average-case/non-concentrate) control setting. The control setting is the case where you are maximally conservative about the AI's motivations and then try to demonstrate safety via making the incapable of causing catastrophic harm.

If you make pessimistic assumptions about AI motivations like this, then you have to worry about concerns like exploration hacking (or even gradient hacking), but it's still plausible that debate adds considerable value regardless.

We could also less conservatively assume that AIs might be misaligned (including seriously misaligned with problematic long range goals), but won't necessarily prefer colluding with other AI over working with humans (e.g. because humanity offers payment for labor). In this case, techniques like debate seem quite applicable and the situation could be very dangerous in the absence of good enough approaches (at least if AIs are quite superhuman).

More generally, debate could be applicable to any type of misalignment which you think might cause problems over a large number of independently assessible actions.

I am not covering training setups where we purposefully train an AI to be agentic and autonomous. I just think it's not plausible that we just keep scaling up networks, run pretraining + light RLHF, and then produce a schemer.[2]

Like Ryan, I'm interested in how much of this claim is conditional on "just keep scaling up networks" being insufficient to produce relevantly-superhuman systems (i.e. systems capable of doing scientific R&D better and faster than humans, without humans in the intellectual part of the loop).  If it's "most of it", then my guess is that accounts for a good chunk of the disagreement.

I don't expect the current paradigm will be insufficient (though it seems totally possible). Off the cuff I expect 75% that something like the current paradigm will be sufficient, with some probability that something else happens first. (Note that "something like the current paradigm" doesn't just involve scaling up networks.)

(Disclaimer: Nothing in this comment is meant to disagree with “I just think it's not plausible that we just keep scaling up [LLM] networks, run pretraining + light RLHF, and then produce a schemer.” I’m agnostic about that, maybe leaning towards agreement, although that’s related to skepticism about the capabilities that would result.)

It is simply not true that "[RL approaches] typically involve creating a system that seeks to maximize a reward signal."

I agree that Bostrom was confused about RL. But I also think there are some vaguely-similar claims to the above that are sound, in particular:

  • RL approaches may involve inference-time planning / search / lookahead, and if they do, then that inference-time planning process can generally be described as “seeking to maximize a learned value function / reward model / whatever” (which need not be identical to the reward signal in the RL setup).
  • And if we compare Bostrom’s incorrect “seeking to maximize the actual reward signal” to the better “seeking at inference time to maximize a learned value function / reward model / whatever to the best of its current understanding”, then…
  • RL approaches historically have typically involved the programmer wanting to get a maximally high reward signal, and creating a training setup such that the resulting trained model does stuff that get as high a reward signal as possible. And this continues to be a very important lens for understanding why RL algorithms work the way they work. Like, if I were teaching an RL class, and needed to explain the formulas for TD learning or PPO or whatever, I think I would struggle to explain the formulas without saying something like “let’s pretend that you the programmer are interested in producing trained models that score maximally highly according to the reward function. How would you update the model parameters in such-and-such situation…?” Right?
  • Related to the previous bullet, I think many RL approaches have a notion of “global optimum” and “training to convergence” (e.g. given infinite time in a finite episodic environment). And if a model is “trained to convergence”, then it will behaviorally “seek to maximize a reward signal”. I think that’s important to have in mind, although it might or might not be relevant in practice.

I bet people would care a lot less about “reward hacking” if RL’s reinforcement signal hadn’t ever been called “reward.”

In the context of model-based planning, there’s a concern that the AI will come upon a plan which from the AI’s perspective is a “brilliant out-of-the-box solution to a tricky problem”, but from the programmer’s perspective is “reward-hacking, or Goodharting the value function (a.k.a. exploiting an anomalous edge-case in the value function), or whatever”. Treacherous turns would probably be in this category.

There’s a terminology problem where if I just say “the AI finds an out-of-the-box solution”, it conveys the positive connotation but not the negative one, and if I just say “reward-hacking” or “Goodharting the value function” it conveys the negative part without the positive.

The positive part is important. We want our AIs to find clever out-of-the-box solutions! If AIs are not finding clever out-of-the-box solutions, people will presumably keep improving AI algorithms until they do.

Ultimately, we want to be able to make AIs that think outside of some of the boxes but definitely stay inside other boxes. But that’s tricky, because the whole idea of “think outside the box” is that nobody is ever aware of which boxes they are thinking inside of.

Anyway, this is all a bit abstract and weird, but I guess I’m arguing that I think the words “reward hacking” are generally pointing towards an very important AGI-safety-relevant phenomenon, whatever we want to call it.

The strongest argument for reward-maximization which I’m aware of is: Human brains do RL and often care about some kind of tight reward-correlate, to some degree. Humans are like deep learning systems in some ways, and so that’s evidence that “learning setups which work in reality” can come to care about their own training signals.

Isn't there a similar argument for "plausible that we just keep scaling up networks, run pretraining + light RLHF, and then produce a schemer"? Namely the way we train our kids seems pretty similar to "pretraining + light RLHF" and we often do end up with scheming/deceptive kids. (I'm speaking partly from experience.) ETA: On second thought, maybe it's not that similar? In any case, I'd be interested in an explanation of what the differences are and why one type of training produces schemers and the other is very unlikely to.

Also, in this post you argue against several arguments for high risk of scheming/deception from this kind of training but I can't find where you talk about why you think the risk is so low ("not plausible"). You just say 'Undo the update from the “counting argument”, however, and the probability of scheming plummets substantially.' but why is your prior for it so low? I would be interested in whatever reasons/explanations you can share. The same goes for others who have indicated agreement with Alex's assessment of this particular risk being low.

I strongly disagree with the words “we train our kids”. I think kids learn via within-lifetime RL, where the reward function is installed by evolution inside the kid’s own brain. Parents and friends are characters in the kid’s training environment, but that’s very different from the way that “we train” a neural network, and very different from RLHF.

What does “Parents and friends are characters in the kid’s training environment” mean? Here’s an example. In principle, I could hire a bunch of human Go players on MTurk (for reward-shaping purposes we’ll include some MTurkers who have never played before, all the way to experts), and make a variant of AlphaZero that has no self-play at all, it’s 100% trained on play-against-humans, but is otherwise the same as the traditional AlphaZero. Then we can say “The MTurkers are part of the AlphaZero training environment”, but it would be very misleading to say “the MTurkers trained the AlphaZero model”. The MTurkers are certainly affecting the model, but the model is not imitating the MTurkers, nor is it doing what the MTurkers want, nor is it listening to the MTurkers’ advice. Instead the model is learning to exploit weaknesses in the MTurkers’ play, including via weird out-of-the-box strategies that would have never occurred to the MTurkers themselves.

When you think “parents and friends are characters in the kid’s training environment”, I claim that this MTurk-AlphaGo mental image should be in your head just as much as the mental image of LLM-like self-supervised pretraining.

For more related discussion see my posts “Thoughts on “AI is easy to control” by Pope & Belrose” sections 3 & 4, and Heritability, Behaviorism, and Within-Lifetime RL.

Yeah, this makes sense, thanks. I think I've read one or maybe both of your posts, which is probably why I started having second thoughts about my comment soon after posting it. :)

In any case, I'd be interested in an explanation of what the differences are and why one type of training produces schemers and the other is very unlikely to.

Thinking over this question myself, I think I've found a reasonable answer. Still interested in your thoughts but I'll write down mine:

It seems like evolution "wanted" us to be (in part) reward-correlate maximizers (i.e., being a reward-correlate maximizer was adaptive in our ancestral environment), and "implemented" this by having our brains internally do "heavy RL" throughout our life. So we become reward-correlate maximizing agents early in life, and then when our parents do something like RLHF on top of that, we become schemers pretty easily.

So the important difference is that with "pretraining + light RLHF" there's no "heavy RL" step.

See footnote 5 for a nearby argument which I think is valid:

The strongest argument for reward-maximization which I'm aware of is: Human brains do RL and often care about some kind of tight reward-correlate, to some degree. Humans are like deep learning systems in some ways, and so that's evidence that "learning setups which work in reality" can come to care about their own training signals.

I want to flag that the overall tone of the post is in tension with the dislacimer that you are "not putting forward a positive argument for alignment being easy".

To hint at what I mean, consider this claim:

Undo the update from the “counting argument”, however, and the probability of scheming plummets substantially.

I think this claim is only valid if you are in a situation such as "your probability of scheming was >95%, and this was based basically only on this particular version of the 'counting argument' ". That is, if you somehow thought that we had a very detailed argument for scheming (AI X-risk, etc), and this was it --- then yes, you should strongly update.
But in contrast, my take is more like: This whole AI stuff is a huge mess, and the best we have is intuitions. And sometimes people try to formalise these intuitions, and those attempts generally all suck. (Which doesn't mean our intuitions cannot be more or less detailed. It's just that even the detailed ones are not anywhere close to being rigorous.) EG, for me personally, the vague intuition that "scheming is instrumental for a large class of goals" makes a huge contribution to my beliefs (of "something between 10% and 99% on alignment being hard"), while the particular version of the 'counting argument' that you describe makes basically no contribution. (And vague intuitions about simplicity priors contributing non-trivially.) So undoing that particular update does ~nothing.

I do acknowledge that this view suggests that the AI-risk debate should basically be debating the question: "So, we don't have any rigorous arguments about AI risk being real or not, and we won't have them for quite a while yet. Should we be super-careful about it, just in case?". But I do think that is appropriate.

If we actually had the precision and maturity of understanding to predict this "volume" question, we'd probably (but not definitely) be able to make fundamental contributions to DL generalization theory + inductive bias research. 

 

Obligatory singular learning theory plug: SLT can and does make predictions about the "volume" question. There will be a post soon by @Daniel Murfet that provides a clear example of this. 

The post is live here.

Cool post, and I am excited about (what I've heard of) SLT for this reason -- but it seems that that post doesn't directly address the volume question for deep learning in particular? (And perhaps you didn't mean to imply that the post would address that question.)

Right. SLT tells us how to operationalize and measure (via the LLC) basin volume in general for DL. It tells us about the relation between the LLC and meaningful inductive biases in the particular setting described in this post. I expect future SLT to give us meaningful predictions about inductive biases in DL in particular. 

  1. I’m worried about centralization of power and wealth in opaque non-human decision-making systems, and those who own the systems.

This has been my main worry for the past few years, and to me it counts as "doom" too. AIs and AI companies playing by legal and market rules (and changing these rules by lobbying, which is also legal) might well lead to most humans having no resources to survive.