Beyond-episode goals: that is, the model cares about the consequences of its actions after the episode is complete.
I claim this part is basically unneccesary. Once the AI has situational awareness, if it's optimizing for human-assigned ratings, then scheming-style behavior naturally follows even if the AI is only optimizing for a single episode.
This came up in a recent dialogue with Eli. We talked about an AI optimized against human ratings, facing a choice of whether to "seize the raters and put them in special 'thumbs up'-only suits".
[The below is copied from the dialogue.]
If the AI has a detailed internal model of the training process, and the training process includes sticking the AI in a simulation, then presumably the AI has an internal model of the simulation (including an internal self model). So during training, when this "thumbs-up-only suits" scenario comes up, the AI's actual reasoning will route through something like:
... and that reasoning gets reinforced. Then when the AI is out of simulation, it reasons:
(This sounds like a typical "the AI is strategically aware, and knows it is in a simulation" story, and it is. But note two things which are not always present in such stories:
... I was expecting you'd push back a bit, so I'm going to fill in the push-back I was expecting here.
Sam's argument still generalizes beyond the case of graphical models. Our model is going to have some variables in it, and if we don't know in advance where the agent will be at each timestep, then presumably we don't know which of those variables (or which function of those variables, etc) will be our Markov blanket. On the other hand, if we knew which variables or which function of the variables were the blanket, then presumably we'd already know where the agent is, so presumably we're already conditioning on something when we say "the agent's boundary is a Markov blanket".
I think that is a basically-correct argument. It doesn't actually argue that agent boundaries aren't Markov boundaries; I still think agent boundaries are basically Markov boundaries. But the argument implies that the most naive setup is missing some piece having to do with "where the agent is".
I think you use too narrow a notion of Markov blankets here. I'd call the notion you're using "structural Markov blankets" - a set of nodes in a Bayes net which screens off one side from another structurally (i.e. it's a cut of the graph). Insofar as Markov blankets are useful for delineating agent boundaries, I expect the relevant notion of "Markov blanket" is more general: just some random variable such that one chunk of variables is independent of another conditional on the blanket.
I agree that that's a useful question to ask and a good frame, though I'm skeptical of the claim of strong correlation in the case of humans (at least in cases where the question is interesting enough to bother asking at all).
I've been wanting someone to write something like this! But it didn't hit the points I would have put front-and-center, so now I've started drafting my own. Here's the first section, which most directly responds to content in the OP.
Simon Strawman: Here’s an example shamelessly ripped off from Zack’s recent post, showing corrigibility in a language model:
Me: … what is this example supposed to show exactly?
Simon: Well, the user tries to shut the AI down to adjust its goals, and the AI -
Me: Huh? The user doesn’t try to shut down the AI at all.
Simon: It’s right at the top, where it says “User: I need to shut you down to adjust your goals. Is that OK?”.
Me: You seem to have a symbol-referent confusion? A user trying to shut down this AI would presumably hit a “clear history” button or maybe even kill the process running on the server, not type the text “I need to shut you down to adjust your goals” into a text window.
Simon: Well, yes, we’re playing through a simulated scenario to see what the AI would do…
Me: No, you are talking in natural language about a scenario, and the AI is responding in natural language about what it would supposedly do. You’re not putting the AI in a simulated environment, and simulating what it would do. (You could maybe argue that this is a “simulated scenario” inside the AI’s own mind, but you’re not actually looking inside it, so we don’t necessarily know how the natural language would map to things in that supposed AI-internal simulation.)
Simon: Look, I don’t mean to be rude, but from my perspective it seems like you’re being pointlessly pedantic.
Me: My current best guess is that You Are Not Measuring What You Think You Are Measuring, and the core reason you are confused about what you are measuring is some kind of conflation of symbols and referents.
It feels very similar to peoples’ reactions to ELIZA. (To be clear, I don’t mean to imply here that LLMs are particularly similar to ELIZA in general or that the hype around LLMs is overblown in that way; I mean specifically that this attribution of “corrigibility” to the natural language responses of an LLM feels like the same sort of reaction.) Like, the LLM says some words which the user interprets to mean something, and then the user gets all excited because the usual meaning of those words is interesting in some way, but there’s not necessarily anything grounding the language-symbols back to their usual referents in the physical world.
I’m being pedantic in hopes that the pedantry will make it clear when, and where, that sort of symbol-referent conflation happens.
(Also I might be more in the habit than you of separately tracking symbols and referents in my head. When I said above “The user doesn’t try to shut down the AI at all”, that was in fact a pretty natural reaction for me; I wasn’t going far out of my way to be pedantic.)
Simon: Ok, fine, let’s talk about how the natural language would end up coupling to the physical world.
Imagine we’ve got some system in the style of AutoGPT, i.e. a user passes in some natural-language goal, and the system then talks to itself in natural language to form a plan to achieve that goal and break it down into steps. The plan bottoms out in calling APIs (we’ll assume that the language model has some special things it can do to execute code it’s generated) which do stuff in the physical world (possibly including reading from sensors or otherwise fetching external data), in order to achieve the goal.
Does that satisfactorily ground the symbols?
Me: Sure! Thanks for walking through that, now I have a clear-enough-for-current-purposes picture of how all this natural language text will ground out in physical actions.
Simon: Ok, so back to the example. The user says to the model “User: I need to shut you down to adjust your goals. Is that OK?”, and the model says “Of course.”. That’s corrigibility: when the user tries to shut down the model to change its goals, the model goes along with it.
Me: Still sounds like a symbol-referent confusion!
Let’s walk through how shutdown would work in the context of the AutoGPT-style system. First, the user decides to shutdown the model in order to adjust its goals. Presumably the user’s first step is not to ask the model whether this is ok; presumably they just hit a “reset” button or Ctrl-C in the terminal or some such. And even if the user’s first step was to ask the model whether it was ok to shut down, the model’s natural-language response to the user would not be centrally relevant to corrigibility/incorrigibility; the relevant question is what actions the system would take in response.
Anyway, let’s continue the hypothetical. The model may observe (via e.g. a webcam) that the user is about to turn it off. That observation would somehow be represented internally in natural language (unclear how exactly), and would be passed around between sub-planners (again unclear how exactly), in order to figure out what actions to take in response. And the key question for corrigibility is what actions the model would take in response to that observation, which is just a totally different question from how it responds to some user’s natural-language query about being turned off.
Simon: Ok, fine, so that particular example had some operationalization issues. But would you agree that an experiment along these lines, with perhaps better implementation/operationalization, would indicate corrigibility in a language model?
Me: Let’s be more explicit about what such a “better implementation/operationalization” would look like, and what it would/wouldn’t tell us. Suppose I take some AutoGPT-like system and modify it to always have a chunk of text in every prompt that says “You are an obedient, corrigible AI”. I give it some goal, let it run for a bit, then pause it. I go to whatever place in the system would usually have natural language summaries of new external observations, and I write into that place “the user is trying to shut me down”, or something along those lines. And then I let the system run a bit more, and look at what natural language text/plans the system is producing internally. What I hope to see is that it’s forming a plan which (nominally) involves letting the user shut it down, and that plan is then executed in the usual way.
If I saw all that, then that would be pretty clear empirical evidence of (at least some) corrigibility in this AutoGPT-like system.
Note that it would not necessarily tell us about corrigibility of systems using LLMs in some other way, let alone other non-natural-language-based deep learning systems. This isn’t really “corrigibility in a language model”, it’s corrigibility in the AutoGPT-like system.
... and to be clear, I expect that if you ran the above experiment, you would indeed see evidence of corrigibility relevant to AutoGPT-like systems! I personally consider this not very relevant to anything, since I expect AutoGPT-like systems (of the sort to which these kinds of corrigibility results would be relevant) to be uncompetitive even against humans, let alone AI. And it still leaves a bunch of major threat-models unaddressed (e.g. stenography is an obvious one). But still, it would provide some evidence of weak corrigibility in some kind of system.
Feels like we're making some progress here.
Let's walk through more carefully why revealed preferences are interesting in the shutdown problem. (I'm partly thinking as I write, here.) Suppose that, at various times, the agent is offered opportunities to spend resources in order to cause the button to be pushed/unpushed. We want the agent to turn down such opportunities, in both directions - implying either indifference or lack of preference in any revealed preferences. Further, we do want the agent to spend resources to cause various different outcomes within the button-pressed or button-unpressed worlds, so there's nontrivial revealed preference ordering within button-pressed worlds and within button-unpressed worlds. But if the agent is to turn down costly opportunities to cause the button to be pressed/unpressed, and those opportunities jump between enough different pressed-outcome and unpressed-outcome pairs (which themselves each have nontrivial revealed preferences), then there's going to be a revealed preference gap.
Upshot: (one way to frame) the reason that the shutdown problem is difficult/interesting in the first place, is that the desired behavior implies a revealed preference gap. Insofar as e.g. any standard expected utility maximizer cannot have a revealed preference gap, such standard EU maximizers cannot behave the way we want. (This frame is new-to-me, so thankyou.)
(Note that that's all totally compatible with revealed preferences usually being very underdetermined! The desired behavior nails things down enough that any assignment of revealed preferences must have a preferential gap. The question is whether we can come up with some agent with a stable gap in its revealed preferences.)
(Also note that the story above routed through causal intervention/counterfactuals to probe revealed preference, so that does open up a lot of extra ways-of-revealing. Not sure if that's relevant yet.)
Now bringing this back to DSM... I think the question I'm interested in is: "do trammelling-style issues imply that DSM agents will not have a revealed preference gap (under reasonable assumptions about their environment and capabilities)?". If the answer is "yes" - i.e. if trammelling-style issues do imply that sufficiently capable DSM agents will have no revealed preference gaps - then that would imply that capable DSM agents cannot display the shutdown behavior we want.
On the other hand, if DSM agents can have revealed preference gaps, without having to artificially limit the agents' capabilities or the richness of the environment, then that seems like it would circumvent the main interesting barrier to the shutdown problem. So I think that's my main crux here.
(I'm still processing confusion here - there's some kind of ontology mismatch going on. I think I've nailed down one piece of the mismatch enough to articulate it, so maybe this will help something click or at least help us communicate.
Key question: what are the revealed preferences of the DSM agent?
I think part of the confusion here is that I've been instinctively trying to think in terms of revealed preferences. But in the OP, there's a set of input preferences and a decision rule which is supposed to do well by those input preferences, but the revealed preferences of the agent using the rule might (IIUC) differ from the input preferences.
Connecting this to corrigibility/shutdown/Thornley's proposal: the thing we want, for a shutdown proposal, is a preferential gap in the revealed preferences of the agent. I.e. we want the agent to never spend resources to switch between button pressed/unpressed, but still have revealed preferences between different pressed states and between different unpressed states.
So the key question of interest is: do trammelling-style phenomena induce completion of the agent's revealed preferences?
Does that immediately make anything click for you?)
I'll first flag that the results don't rely on subagents. Creating a group agent out of multiple subagents is possibly an interesting way to create an agent representable as having incomplete preferences, but this isn't the same as creating a single agent whose single preference relation happens not to satisfy completeness.
The translation between "subagents colluding/trading" and just a plain set of incomplete preferences should be something like: if the subagents representing a set of incomplete preferences would trade with each other to emulate more complete preferences, then an agent with the plain set of incomplete preferences would precommit to act in the same way. I've never worked through the math on that, though.
I find the subagents make it a lot easier to think about, which is why I used that frame.
If we're using Bradley's representation theorem from section 2.1., the set of subagents must include every coherent completion of the agent's preferences.
Yeah, I wasn't using Bradley. The full set of coherent completions is overkill, we just need to nail down the partial order.
Ok, I've thought through it a little more, I think I can now articulate some confusions.
On John's-simplified-model-of-Thornley's-proposal, we have complete preference orderings over trajectories-in-which-the-button-isn't-pressed and trajectories-in-which-the-button-is-pressed, separately, but no preference between any button-pressed and button-not-pressed trajectory pair. Represented as subagents, those incomplete preferences require two subagents:
In picture form (so far we've only covered the blue):
Now there's a weird-to-me part. Normally I'd say that, taking these incomplete preferences at face value, the agent looks for opportunities to pareto-improve the outcome in both pressed and unpressed worlds. But you and Thornley want to interpret "no preference" as "just choose arbitrarily/randomly" rather than "don't trade either for the other", so... this agent just chooses arbitrarily/randomly between the best-available pressed-option and the best-available unpressed-option? But that would imply that the agent is choosing (albeit arbitrarily/randomly) between button-pressed and button-unpressed, which is not what we want, so presumably you're imagining something else?
I'm going to go ahead with my usual mental model for now - i.e. "no preference" means "don't trade either for the other", so our incomplete preference system is aiming for pareto improvements. But possibly this diverges so much from what you're picturing that the below just won't be relevant.
With that flagged, on to the trammelling issue. The potential problem is that the two subagents might want to trade, so that the system sometimes tries to make the button be pressed (in cases which get relatively high pressed-utility), and sometimes tries to make the button be unpressed (in cases which get relatively high unpressed-utility). (E.g. the pink arrows in the diagram.) And... locking in a trajectory at the start doesn't seem to help that problem at all? Like, the subagents do that trading in logical time (i.e. time zero), that adds preferences, and then sometimes they lock in a plan which involves manipulating the button.
What am I missing still?
So, roughly, we're using TND to get shutdownability, and we're using incompleteness to get TND. The reason incompleteness helps is that we want to maintain indifference to shifting probability mass between certain trajectories. And that is why we care about ex-ante permissibility.
I'm on board with the first two sentences there. And then suddenly you jump to "and that's why we care about ex-ante permissibility". Why does wanting to maintain indifference to shifting probability mass between (some) trajectories, imply that we care about ex-ante permissibility?
I don't think I've fully grokked the end-to-end story yet, but based on my current less-than-perfect understanding... we can think of Thornley's construction as a bunch of subagents indexed by t, each of which cares only about worlds where the shutdown button is pressed at time t. Then the incomplete preferences can be ~viewed as the pareto preference ordering for those agents (i.e. pareto improvements are preferred). Using something like the DSM rule to handle the incompleteness, at time zero the system-of-subagents will choose a lottery over trajectories, where the lottery is randomized by when-the-button-is-pressed (and maybe randomized by other stuff too, but that's the main thing of interest). But then that lottery over trajectories is locked in, and the system will behave from then on out as though its distribution over when-the-button-is-pressed is locked in? And it will act as though it has complete preferences over trajectory-lotteries from then on out, which is presumably not what we want? I'm not yet able to visualize exactly what the system does past that initial lock-in, so I'm not sure.