Steve Byrnes

I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms—see Email: Twitter: @steve47285. Employer: Physicist by training.


Intro to Brain-Like-AGI Safety

Wiki Contributions


I think I see where you’re coming from but I generally have mixed feelings, and am going back and forth but leaning towards sticking with textbook terminology for my part.

Once we fix the policy network and sampling procedure, we get a mapping from observations…to probability distributions over outputs…. This mapping  is the policy.…

Of course, a policy could in fact be computed using internal planning (e.g. depth-3 heuristic search) to achieve an internally represented goal (e.g. number of diamonds predicted to be present.) I think it's appropriate to call that kind of computation "agentic." But that designation is only appropriate after further information is discovered (e.g. how the policy in fact works).

I’m a bit confused about what you’re proposing. AlphaZero has an input (board state) and an output (move). Are you proposing to call this input-output function “a policy”?

If so, sure we can say that, but I think people would find it confusing—because there’s a tree search in between the input and output, and one ingredient of the tree search is the “policy network” (or maybe just “policy head”, I forget), but here the relation between the “policy network” and the final input-output function is very indirect, such that it seems odd to use (almost) the same term for them.

Don't assume the conclusion by calling a policy an "agent"

The word “agent” invokes a bundle of intuitions / associations, and you think many of those are misleading in general. So then one approach is to ask everyone to avoid the word “agent” in cases where those intuitions don’t apply, and the other is to ask everyone to constantly remind each other that the “agents” produced by RL don’t necessarily have thus-and-such properties.

Neither option is great; this is obviously a judgment call.

For my part, I think that if I say:

“An RL agent isn’t necessarily planning ahead towards goals, in many cases it’s better to think of it as a bundle of situation-dependent reactions…”

…then that strikes me as a normal kind of thing to say as part of a healthy & productive conversation.

So maybe I see pushing-back-on-the-intuitions-while-keeping-the-word as a more viable approach than you do.

(And separately, I see editing widely-used terminology as a very very big cost, probably moreso than you.)

Ditto for “reward”.

“Reinforcement-maximizing policy”

this kinda sounds slightly weird in my mind because I seem to be intuitively associating “reinforcement” with “updates” and the policy in question is a fixed-point that stops getting updated altogether.

I…don't currently think RL is much more dangerous than other ways of computing weight updates

You mention that this is off-topic so maybe you don’t want to discuss it, but I probably disagree with that—with the caveat that it’s very difficult to do an other-things-equal comparison. (I.e., we’re presumably interested in RL-safety-versus-SSL-safety holding capabilities fixed, but switching from RL to SSL does have an effect on capabilities.)

Then later you say “only using unsupervised pretraining doesn't mean you're safe” which is a much weaker statement, and I agree with it.

I think the basic idea of instrumental convergence is just really blindingly obvious, and I think it is very annoying that there are people who will cluck their tongues and stroke their beards and say "Hmm, instrumental convergence you say? I won't believe it unless it is in a very prestigious journal with academic affiliations at the top and Computer Modern font and an impressive-looking methods section."

I am happy that your papers exist to throw at such people.

Anyway, if optimal policies tend to seek power, then I desire to believe that optimal policies tend to seek power :) :) And if optimal policies aren't too relevant to the alignment problem, well neither are 99.99999% of papers, but it would be pretty silly to retract all of those :)

So this basic idea: the AI takes a bunch of steps, and gradient descent is performed based on audits of whether those steps seem reasonable while blinded to what happened as a result. 

Ohh, sorry you had to tell me twice, but maybe I’m finally seeing where we’re talking past each other.

Back to the OP, you wrote:

  • In training, an AI system gets tasks of the form “Produce a plan to accomplish X that looks good to humans” (not tasks of the form “accomplish X”).
  • The AI system is rewarded based on whether the plan makes sense and looks good to humans - not how well it actually ends up working.

When I read that, I was thinking that you meant:

  • I type in: “Hey AI, tell me a plan for ethically making lots of money”
  • The AI brainstorms for an hour
  • The AI prints out a plan
  • I grade the plan (without actually trying to execute it), and reward the AI / backprop-through-time the AI / whatever based on that grade.

But your subsequent replies make me think that this isn’t what you meant, particularly the “brainstorm for an hour” part.

…But hold that thought while I explain why I don’t find the above plan very helpful (just so you understand my previous responses):

  • A whole lot is happening during the hour that the AI is brainstorming
  • We have no visibility into any of that, and very weak control over it (e.g. a few bits of feedback on a million-step brainstorming session)
  • I think RL with online-learning is central to making the brainstorming step actually work, capabilities-wise
  • I likewise think that RL process would need to be doing lots of recursing onto instrumental subgoals and finding new creative problem-solving strategies etc.
  • Even if its desires are something like “I want to produce a good plan”, then it would notice that hacking out of the box would be instrumentally useful towards that goal.

OK, so that’s where I was coming from in my previous replies.

But, now I no longer think that the above is what you meant in the first place. Instead I think you meant:

  • I type in: “Hey AI, tell me a plan for ethically making lots of money”
  • The AI prints out every fine-grained step of the process by which it answers that question
  • I do random local audits of that printout (without actually trying to execute the whole plan).

Is that right?

If so, that makes a lot more sense.

In my (non-LLM) context, I would re-formulate the above as something like:

  • The AI is doing whatever
  • We sometimes pick random “thoughts” of the AI’s cognition, and use interpretability to try to suss out what the AI was thinking and why, at that moment.
  • If something looks problematic, we try to fix it, e.g. with negative rewards.

If that’s a fair translation, then yeah, this seems like an excellent idea to me, I’m all for it.

I don’t see this working as “Plan A”, because I expect the interpretability step to be pretty labor intensive (untangling a “thought” made of lots of idiosyncratic concepts built on other concepts), and therefore only adequate to produce a quite sparse reward, whereas I think programmers will need a very dense reward signal for capabilities reasons.

(UPDATE: you're probably gonna say we can solve that with a reward model. Hmm, I guess I'm more thinking that we need a rich and high-dimensional ground truth for this to be sufficient, and I'm skeptical that we'll get enough samples, if each sample is a whole research project. It also requires that we never stop supervising, even in deployment, thanks to the AI online-learning. But I dunno, I guess I shouldn't rule out things like that.)

But even if we have some other “Plan A” for the rewards, I would still want to do this kind of thing as much as possible, as another layer of protection.

Thanks, that all makes sense.

I agree you can still get a problem from goal misgeneralization and instrumental reasoning, but this seems noticeably less likely (assuming process-based training) than getting a problem from reinforcing pursuit of unintended outcomes. ( has some discussion.) I put significant credence on something like "Internals-based training doesn't pan out, but neither does the concern about goal misgeneralization and instrumental reasoning (in the context of process-based training, ie in the context of not reinforcing pursuit of unintended outcomes)."

I think an important thing behind the scenes here is that you’re hopeful that we can get to TAI using mostly self-supervised learning, whereas I’m expecting that people are eventually going to switch to AI approaches that involve RL in a much more central way than LLMs do today. (More like humans, where for every thought you think, you’re thinking it in part because it’s more rewarding than whatever alternate thoughts you might think instead.) In other words, I’m planning for futures where your “Hypothetical Training Approach” doesn’t work.

I think it’s a good rule-of-thumb to say that the more centrally RL is involved in TAI, as opposed to self-supervised learning, the harder is safety & alignment in general.

(So, I’m pretty unhappy to hold the belief that RL will probably be very centrally involved in TAI!)

So anyway, in the scenarios I’m thinking about, I think “goal misgeneralization and instrumental reasoning” is far more likely and difficult to avoid, compared to what you’re thinking about. I suppose process-based optimization would help a nonzero amount even in my scenarios, but sufficiently little that it doesn’t really seem worth thinking about.

(Obviously, nobody knows how to make TAI, so I think it’s fine and healthy for different people to be thinking through different scenarios! We’ll find out sooner or later!)

That was helpful, thanks! I will try to rephrase. Let’s consider two failure modes. First, define some terms:

X is what we want the AI to be trying to do:

  • In outcome-based training, X = “make money ethically”
  • In process-based training, X = “produce a good plan to make money ethically”

Y is something we want the AI to not try to do:

  • In outcome-based training, maybe Y = “secretly steal money”
  • In process-based training, maybe Y = “produce a deliberately deceptive plan” or “hack out of the box”.

Now consider two failure modes.


  • We were hoping for the AI to want to do X.
  • AI does Y, a little bit, randomly or incompetently.
  • AI is rewarded for doing Y.
  • AI starts trying to do Y and generalizations-of-Y more and more.


  • We were hoping for the AI to want to do X.
  • AI wants to do Y.
  • AI does Y when it finds an opportunity to do so successfully.

My understanding is that you’re thinking about Failure Mode 1 here, and you’re saying that process-based training will help because there it’s less difficult to supervise really well, such that we’re not rewarding the AI for doing Y a little bit / incompetently / randomly.

If so—OK, fair enough.

However, we still need to deal with Failure Mode 2.

One might hope that Failure Mode 2 won’t happen because the AI won’t want to do Y in the first place, because after all it’s never done Y before and got rewarded. However, you can still get Y from goal misgeneralization and instrumental reasoning. (E.g., it’s possible for the AI to generalize from its reward history to “wanting to get reward [by any means necessary]”, and then it wants to hack out of the box for instrumental reasons, even if it’s never done anything like that before.)

So, I can vaguely imagine plans along the lines of:

  • Solve Failure Mode 1 by giving near-perfect rewards
  • Solve Failure Mode 2 by, ummm, out-of-distribution penalties / reasoning about inductive biases / adversarial training / something else, I dunno

If that’s what you have in mind, then yeah, I see why you’re interested in process-based optimization as a piece of that puzzle.

But for my part, I’m not crazy about any plans of that form. I’m currently more hopeful about plans along the lines of:

  • Solve Failure Mode 2 by “thought-based training”, i.e. manipulate the AI’s motivations more directly (this almost definitely requires some interpretability)
  • …And this automatically solves Failure Mode 1 as well.

And then process-based training is not so relevant.

[I have two vague research paths along these lines (1,2), although I’m not sure you’d find those links useful in any detail because I’m assuming model-based RL rather than LLMs.]

Sure. That excerpt is not great.

I'd consider this to be one of the more convincing reasons to be hesitant about a pause (as opposed to the 'crying wolf' argument, which seems to me like a dangerous way to think about coordinating on AI safety?). 

Can you elaborate on this? I think it’s incredibly stupid that people consider it to be super-blameworthy to overprepare for something that turned out not to be a huge deal—even if the expected value of the preparation was super-positive given what was known at the time. But, stupid as it may be, it does seem to be part of the situation we’re in. (What politician wants an article like this to be about them?) (Another example.) I’m in favor of interventions to try to change that aspect of our situation (e.g. widespread use and normalization of prediction markets??), but in the meantime, it seems to me that we should keep that dynamic in mind (among other considerations). Do you disagree with that in principle? Or think it’s overridden by other considerations? Or something else?

I think that’s one consideration, but I think there are a bunch of considerations pointing in both directions. For example:

Pause in scaling up LLMs → less algorithmic progress:

  • The LLM code-assistants or research-assistants will be worse
  • Maybe you can only make algorithmic progress via doing lots of GPT-4-sized training runs or bigger and seeing what happens
  • Maybe pause reduces AI profit which would otherwise be reinvested in R&D

Pause in scaling up LLMs → more algorithmic progress:

  • Maybe doing lots of GPT-4-sized training runs or bigger is a distraction from algorithmic progress
  • In pause-world, it’s cheaper to get to the cutting edge, so more diverse researchers & companies are there, and they’re competing more narrowly on algorithmic progress (e.g. the best algorithms will get the highest scores on benchmarks or whatever, as opposed to whatever algorithms got scaled the most getting the highest scores)

Other things:

  • Pro-pause: It’s “practice for later”, “policy wins beget policy wins”, etc., so it will be easier next time (related)
  • Anti-pause: People will learn to associate “AI pause” = “overreaction to a big nothing”, so it will be harder next time (related)
  • Pro-pause: Needless to say, maybe I’m wrong and LLMs won’t plateau!

There are probably other things too. For me, the balance of considerations is that pause in scaling up LLMs will probably lead to more algorithmic progress. But I don’t have great confidence.

(We might differ in how much of a difference we’re expecting LLM code-assistants and research-assistants to make. I put them in the same category as PyTorch and TensorFlow and IDEs and stackoverflow and other such productivity-enhancers that we’re already living with, as opposed to something wildly more impactful than that.)

The description doesn't seem so bad to me. Your post "Reward is not the optimization target" is about what actual RL algorithms actually do. The wiki descriptions here are a kind of normative motivation as to how people came to be looking into those algorithms in the first place. Like, if there's an RL algorithm that performs worse than chance at getting a high reward, then that ain't an RL algorithm. Right? Nobody would call it that.

I think lots of families of algorithms are likewise lumped together by a kind of normative "goal", even if any given algorithm in that family is doing something somewhat different and more complicated than “achieving that goal”, and even if, in any given application, the programmer might not want that goal to be perfectly achieved even if it could be. So by the same token, supervised learning algorithms are "supposed" to minimize a loss, compilers are "supposed" to create efficient and correct assembly code, word processors are "supposed" to process words, etc., but in all cases that's not a literal and complete description of what the algorithms in question actually do, right? It’s a pointer to a class of algorithms.

Sorry if I'm misunderstanding.

Sure, we can take some particular model-based RL algorithm (MuZero, APTAMI, the human brain algorithm, whatever), but instead of “the reward function” we call it  “function #5829”, and instead of “the value function” we call it “function #6241”, etc. If you insist that I use those terms, then I would still be perfectly capable of describing step-by-step why this algorithm would try to kill us. That would be pretty annoying though. I would rather use the normal terms.

I’m not quite sure what you’re talking about (“projected from the labeled world model”??), but I guess it’s off-topic here unless it specifically applies to APTAMI.

FWIW the problems addressed in this post involve the model-based RL system trying to kill us via using its model-based RL capabilities in the way we normally expect—where the planner plans, and the critic criticizes, and the world-model models the world, etc., and the result is that the system makes and executes a plan to kill us. I consider that the obvious, central type of alignment failure mode for model-based RL, and it remains an unsolved problem.

In addition, one might ask if there are other alignment failure modes too. E.g. people sometimes bring up more exotic things like the “mesa-optimizer” thing where the world-model is secretly harboring a full-fledged planning agent, or whatever. As it happens, I think those more exotic failure modes can be effectively mitigated, and are also quite unlikely to happen in the first place, in the particular context of model-based RL systems. But that depends a lot on how the model-based RL system in question is supposed to work, in detail, and I’m not sure I want to get into that topic here, it’s kinda off-topic. I talk about it a bit in the intro here.

Load More