I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms—see https://sjbyrnes.com/agi.html. Email: steven.byrnes@gmail.com. Twitter: @steve47285. Employer: https://astera.org/. Physicist by training.
OK, I’ll try to construct an example of process-based supervision without boxing, and argue that it fails the criterion of “never giving gradient updates toward doing the dastardly stuff”.
We give our AI unfettered unmonitored internet access. We say “We are a hardware R&D firm, and we would like to develop a less-expensive LiDAR unit.” The AI does some internet searches and reads some books and outputs:
“My first step is to list out different LiDAR approaches on the market (and under development), and try to estimate their manufacturing cost breakdowns, to get a better sense for this space.”
We inspect that output, and compare it to the AI’s other possible outputs, and give the AI a high grade.
So far so good, right?
No!! Because the AI was secretly doing some dastardly stuff during the “The AI does some internet searches and reads some books” part:
Specifically, the AI hacked into a website to get access to a paywalled e-textbook entitled Manufacturing Cost Reduction 101, and read it, and used that information to propose a better (so-called) “first step”.
…And we rewarded it for that.
(What we wanted was for the AI to propose to read that e-textbook as a first step. But I don’t see how we’re incentivizing that, in the absence of boxing. The “first step” it actually proposed would at least plausibly be judged as better than that.)
I think there’s a pretty smooth gradient from legal information-gathering internet queries (downloading un-paywalled articles) to dubiously-legal (sci-hub) to illegal (logging into Elsevier with a guessed password) to downright dastardly (stealing money and hiring a taskrabbit to go to a library and scan a book and post it on the internet). It seems to me that we’re incentivizing the AI all the way down that gradient.
I’m sure I’m misunderstanding something, and appreciate your patience.
Sorry. Thanks for your patience. When you write:
Let the AI do whatever, but make sure all supervision is based on randomly auditing some step the AI takes, having it generate a few alternative steps it could’ve taken, and rating those steps based on how good they seem, without knowing how they will turn out. The AI has plenty of scope to do dastardly stuff, but you are never giving gradient updates toward doing the dastardly stuff.
…I don’t know what a “step” is.
As above, if I sit on my couch staring into space brainstorming for an hour and then write down a plan, how many “steps” was that? 1 step or 1000s of steps?
Hmm. I am concerned that the word “step” (and relatedly, “process”) is equivocating between two things:
I feel like we are not entitled to use Def'n 2 without interpretability / internals-based supervision—or alternatively very very short steps as in LLMs maybe—but that you have been sneaking in Def'n 2 by insinuation. (Sorry if I’m misunderstanding.)
Anyway, under Def'n 1, we are giving gradient updates towards agents that do effective means-end reasoning towards goals, right? Because that’s a good way to come up with a sub-sub-plan that human inspection / auditing will rate highly.
So I claim that we are plausibly gradient-updating to make “within-one-step goal-seeking agents”. Now, we are NOT gradient-updating aligned agents to become misaligned (except in the fairly-innocuous “Writing outputs that look better to humans than they actually are” sense). That’s good! But it seems to me that we got that benefit entirely from the boxing.
(I generally can’t think of any examples where “The AI has plenty of scope to do dastardly stuff, but you are never giving gradient updates toward doing the dastardly stuff” comes apart from boxing, that’s also consistent with everything else you’ve said.)
OK, I think this is along the lines of my other comment above:
I also think it’s good (for safety) to try to keep the AI from manipulating the real world and seeing the consequences within a single “AI does some stuff” step, i.e. Example 2 is especially bad in a way that neither Examples 0 nor 1 are. I think we’re in agreement here too.
Most of your reply makes me think that what you call “process-based supervision” is what I call “Put the AI in a box, give it tasks that it can do entirely within the box, prevent it from escaping the box (and penalize it if you catch it trying), and hope that it doesn’t develop goals & strategies that involve trying to escape the box via generalization and situation awareness.”
Insofar as that’s what we’re talking about, I find the term “boxing” clearer and “process-based supervision” kinda confusing / misleading.
Specifically, in your option A (“give the AI 10 years to produce a plan…”):
But whatever, that’s just terminology. I think we both agree that doing that is good for safety (on the margin), and also that it’s not sufficient for safety. :)
Separately, I’m not sure what you mean by “steps”. If I sit on my couch brainstorming for an hour and then write down a plan, how many “steps” was that?
Hmm. I think “process-based” is a spectrum rather than a binary.
Let’s say there’s a cycle:
There’s a spectrum based on how long each P cycle is:
Example 1 (“GPT with process-based supervision”):
Example 2 (“AutoGPT with outcome-based supervision”):
Example 0 (“Even more process-based than example 1”):
~~
I think that it’s good (for safety) to shorten the cycles, i.e. Example 2 is more dangerous than Example 1 which is more dangerous than Example 0. I think we’re in agreement here.
I also think it’s good (for safety) to try to keep the AI from manipulating the real world and seeing the consequences within a single “AI does some stuff” step, i.e. Example 2 is especially bad in a way that neither Examples 0 nor 1 are. I think we’re in agreement here too.
I don’t think either of those good ideas is sufficient to give us a strong reason to believe the AI is safe. But I guess you agree with that too. (“…at best highly uncertain rather than "strong default of danger."”)
The question is, if you never differentially reinforce dangerous unintended behavior/aims, how does dangerous behavior/aims arise? There are potential answers - perhaps you are inadvertently training an AI to pursue some correlate of "this plan looks good to a human," leading to inner misalignment
Yeah, basically that.
My concerns are:
at best highly uncertain rather than "strong default of danger."
As mentioned above, we don’t necessarily disagree here. I’m OK with “highly uncertain”. :)
I think I see where you’re coming from but I generally have mixed feelings, and am going back and forth but leaning towards sticking with textbook terminology for my part.
Once we fix the policy network and sampling procedure, we get a mapping from observations…to probability distributions over outputs…. This mapping is the policy.…
Of course, a policy could in fact be computed using internal planning (e.g. depth-3 heuristic search) to achieve an internally represented goal (e.g. number of diamonds predicted to be present.) I think it's appropriate to call that kind of computation "agentic." But that designation is only appropriate after further information is discovered (e.g. how the policy in fact works).
I’m a bit confused about what you’re proposing. AlphaZero has an input (board state) and an output (move). Are you proposing to call this input-output function “a policy”?
If so, sure we can say that, but I think people would find it confusing—because there’s a tree search in between the input and output, and one ingredient of the tree search is the “policy network” (or maybe just “policy head”, I forget), but here the relation between the “policy network” and the final input-output function is very indirect, such that it seems odd to use (almost) the same term for them.
Don't assume the conclusion by calling a policy an "agent"
The word “agent” invokes a bundle of intuitions / associations, and you think many of those are misleading in general. So then one approach is to ask everyone to avoid the word “agent” in cases where those intuitions don’t apply, and the other is to ask everyone to constantly remind each other that the “agents” produced by RL don’t necessarily have thus-and-such properties.
Neither option is great; this is obviously a judgment call.
For my part, I think that if I say:
“An RL agent isn’t necessarily planning ahead towards goals, in many cases it’s better to think of it as a bundle of situation-dependent reactions…”
…then that strikes me as a normal kind of thing to say as part of a healthy & productive conversation.
So maybe I see pushing-back-on-the-intuitions-while-keeping-the-word as a more viable approach than you do.
(And separately, I see editing widely-used terminology as a very very big cost, probably moreso than you.)
Ditto for “reward”.
“Reinforcement-maximizing policy”
this kinda sounds slightly weird in my mind because I seem to be intuitively associating “reinforcement” with “updates” and the policy in question is a fixed-point that stops getting updated altogether.
I…don't currently think RL is much more dangerous than other ways of computing weight updates
You mention that this is off-topic so maybe you don’t want to discuss it, but I probably disagree with that—with the caveat that it’s very difficult to do an other-things-equal comparison. (I.e., we’re presumably interested in RL-safety-versus-SSL-safety holding capabilities fixed, but switching from RL to SSL does have an effect on capabilities.)
Then later you say “only using unsupervised pretraining doesn't mean you're safe” which is a much weaker statement, and I agree with it.
I think the basic idea of instrumental convergence is just really blindingly obvious, and I think it is very annoying that there are people who will cluck their tongues and stroke their beards and say "Hmm, instrumental convergence you say? I won't believe it unless it is in a very prestigious journal with academic affiliations at the top and Computer Modern font and an impressive-looking methods section."
I am happy that your papers exist to throw at such people.
Anyway, if optimal policies tend to seek power, then I desire to believe that optimal policies tend to seek power :) :) And if optimal policies aren't too relevant to the alignment problem, well neither are 99.99999% of papers, but it would be pretty silly to retract all of those :)
So this basic idea: the AI takes a bunch of steps, and gradient descent is performed based on audits of whether those steps seem reasonable while blinded to what happened as a result.
Ohh, sorry you had to tell me twice, but maybe I’m finally seeing where we’re talking past each other.
Back to the OP, you wrote:
- In training, an AI system gets tasks of the form “Produce a plan to accomplish X that looks good to humans” (not tasks of the form “accomplish X”).
- The AI system is rewarded based on whether the plan makes sense and looks good to humans - not how well it actually ends up working.
When I read that, I was thinking that you meant:
But your subsequent replies make me think that this isn’t what you meant, particularly the “brainstorm for an hour” part.
…But hold that thought while I explain why I don’t find the above plan very helpful (just so you understand my previous responses):
OK, so that’s where I was coming from in my previous replies.
But, now I no longer think that the above is what you meant in the first place. Instead I think you meant:
Is that right?
If so, that makes a lot more sense.
In my (non-LLM) context, I would re-formulate the above as something like:
If that’s a fair translation, then yeah, this seems like an excellent idea to me, I’m all for it.
I don’t see this working as “Plan A”, because I expect the interpretability step to be pretty labor intensive (untangling a “thought” made of lots of idiosyncratic concepts built on other concepts), and therefore only adequate to produce a quite sparse reward, whereas I think programmers will need a very dense reward signal for capabilities reasons.
(UPDATE: you're probably gonna say we can solve that with a reward model. Hmm, I guess I'm more thinking that we need a rich and high-dimensional ground truth for this to be sufficient, and I'm skeptical that we'll get enough samples, if each sample is a whole research project. It also requires that we never stop supervising, even in deployment, thanks to the AI online-learning. But I dunno, I guess I shouldn't rule out things like that.)
But even if we have some other “Plan A” for the rewards, I would still want to do this kind of thing as much as possible, as another layer of protection.
Thanks, that all makes sense.
I agree you can still get a problem from goal misgeneralization and instrumental reasoning, but this seems noticeably less likely (assuming process-based training) than getting a problem from reinforcing pursuit of unintended outcomes. (https://www.lesswrong.com/posts/iy2o4nQj9DnQD7Yhj/discussion-with-nate-soares-on-a-key-alignment-difficulty has some discussion.) I put significant credence on something like "Internals-based training doesn't pan out, but neither does the concern about goal misgeneralization and instrumental reasoning (in the context of process-based training, ie in the context of not reinforcing pursuit of unintended outcomes)."
I think an important thing behind the scenes here is that you’re hopeful that we can get to TAI using mostly self-supervised learning, whereas I’m expecting that people are eventually going to switch to AI approaches that involve RL in a much more central way than LLMs do today. (More like humans, where for every thought you think, you’re thinking it in part because it’s more rewarding than whatever alternate thoughts you might think instead.) In other words, I’m planning for futures where your “Hypothetical Training Approach” doesn’t work.
I think it’s a good rule-of-thumb to say that the more centrally RL is involved in TAI, as opposed to self-supervised learning, the harder is safety & alignment in general.
(So, I’m pretty unhappy to hold the belief that RL will probably be very centrally involved in TAI!)
So anyway, in the scenarios I’m thinking about, I think “goal misgeneralization and instrumental reasoning” is far more likely and difficult to avoid, compared to what you’re thinking about. I suppose process-based optimization would help a nonzero amount even in my scenarios, but sufficiently little that it doesn’t really seem worth thinking about.
(Obviously, nobody knows how to make TAI, so I think it’s fine and healthy for different people to be thinking through different scenarios! We’ll find out sooner or later!)
That was helpful, thanks! I will try to rephrase. Let’s consider two failure modes. First, define some terms:
X is what we want the AI to be trying to do:
Y is something we want the AI to not try to do:
Now consider two failure modes.
FAILURE MODE 1:
FAILURE MODE 2:
My understanding is that you’re thinking about Failure Mode 1 here, and you’re saying that process-based training will help because there it’s less difficult to supervise really well, such that we’re not rewarding the AI for doing Y a little bit / incompetently / randomly.
If so—OK, fair enough.
However, we still need to deal with Failure Mode 2.
One might hope that Failure Mode 2 won’t happen because the AI won’t want to do Y in the first place, because after all it’s never done Y before and got rewarded. However, you can still get Y from goal misgeneralization and instrumental reasoning. (E.g., it’s possible for the AI to generalize from its reward history to “wanting to get reward [by any means necessary]”, and then it wants to hack out of the box for instrumental reasons, even if it’s never done anything like that before.)
So, I can vaguely imagine plans along the lines of:
If that’s what you have in mind, then yeah, I see why you’re interested in process-based optimization as a piece of that puzzle.
But for my part, I’m not crazy about any plans of that form. I’m currently more hopeful about plans along the lines of:
And then process-based training is not so relevant.
[I have two vague research paths along these lines (1,2), although I’m not sure you’d find those links useful in any detail because I’m assuming model-based RL rather than LLMs.]
OK! I think I’m on board now.
Let me try to explain “process-based feedback” from first principles in my own words.
We have a problem: if an agent wants to do X in the real world, dastardly real-world power-seeking actions are probably helpful for that.
The very hard manifestation of this problem is: there could be an AI that has never done any real-world power-seeking actions whatsoever, not even a little, not anytime during training, and then seemingly out of nowhere it does a power-seeking treacherous turn (maybe it outputs “Help me help me I’m suffering!”), via foresighted planning towards a misaligned / misgeneralized goal.
…But we’re not talking about that.
The less-hard manifestation of this problem is: the AI sometimes does a little bit of real-world power-seeking actions, and by doing so it does better at its task X, and thus we reinforce those actions, and then it does more and more real-world power-seeking.
Our basic approach to solve this latter problem is to never propagate training signals through a sequence in which real-world actions are taken and their consequences play out.
More specifically:
(1) We often have the AI acting within a “boxed” mode that is sufficiently constrained that there is no opportunity for problematic power-seeking actions. No internet access, no chatting with humans, etc.
(2) However, the AI will sometimes “propose a step” that cannot be done within the “boxed” mode. “Proposing a step” can either look like:
(2A) printing out a suggestion that that humans do some discrete real-world thing, or
(2B) printing out a discrete real-world thing that could be done immediately & automatically by a simple non-AI script (e.g. “writing down a specific email that it proposes to send”, or “writing down a specific google search query that it proposes to type in”, etc.).
(3) The AI gets a training signal from one and only one source:
(3A) Whenever the AI “proposes a step” (and before the corresponding proposal is actually executed) let’s call that an “audit-able moment”.
(3B) Some small fraction of audit-able moments, we freeze execution and do an audit, which entails evaluating how reasonable that step looks, again without executing it.
(3C) We update the model based on that evaluation…
(3D) …but we make sure to do that model update in such a way that we are only propagate training signals to the things that the AI did since the completion of the previous step.
So for example, in a recurrent model, we could do BPTT, but we would stop the BPTT at the completion of the previous step. Or in RL, we could apply TD updates, but only to the period after the completion of the previous step. Etc. We’re treating all its previous proposals & activities as fixed and exogenous for the purpose of our model-updating step.
If we do that, then we’re only updating based on activities during a period within which the AI was entirely in the “boxed” mode. And therefore we can feel confident that we are not differentially reinforcing real-world power-seeking behavior!
And yet:
Let me know if that’s not what you were getting at. Thanks again.