All of johnswentworth's Comments + Replies

Beyond-episode goals: that is, the model cares about the consequences of its actions after the episode is complete.

I claim this part is basically unneccesary. Once the AI has situational awareness, if it's optimizing for human-assigned ratings, then scheming-style behavior naturally follows even if the AI is only optimizing for a single episode.

This came up in a recent dialogue with Eli. We talked about an AI optimized against human ratings, facing a choice of whether to "seize the raters and put them in special 'thumbs up'-only suits".

[The below is copied... (read more)

3Joe Carlsmith8d
I agree that AIs only optimizing for good human ratings on the episode (what I call "reward-on-the-episode seekers") have incentives to seize control of the reward process, that this is indeed dangerous, and that in some cases it will incentivize AIs to fake alignment in an effort to seize control of the reward process on the episode (I discuss this in the section on "non-schemers with schemer-like traits"). However, I also think that reward-on-the-episode seekers are also substantially less scary than schemers in my sense, for reasons I discuss here (i.e., reasons to do with what I call "responsiveness to honest tests," the ambition and temporal scope of their goals, and their propensity to engage in various forms of sandbagging and what I call "early undermining"). And this especially for reward-on-the-episode seekers with fairly short episodes, where grabbing control over the reward process may not be feasible on the relevant timescales.

... I was expecting you'd push back a bit, so I'm going to fill in the push-back I was expecting here.

Sam's argument still generalizes beyond the case of graphical models. Our model is going to have some variables in it, and if we don't know in advance where the agent will be at each timestep, then presumably we don't know which of those variables (or which function of those variables, etc) will be our Markov blanket. On the other hand, if we knew which variables or which function of the variables were the blanket, then presumably we'd already know where t... (read more)

2Abram Demski5d
I find your attempted clarification confusing.  No? A probabilistic model can just be a probability distribution over events, with no "random variables in it". It seemed like your suggestion was to define the random variables later, "on top of" the probabilistic model, not as an intrinsic part of the model, so as to avoid the objection that a physics-ish model won't have agent-ish variables in it. So the random variables for our markov blanket can just be defined as things like skin surface temperature & surface lighting & so on; random variables which can be derived from a physics-ish event space, but not by any particularly simple means (since the location of these things keeps changing). Again, no? If I know skin surface temperature and lighting conditions and so on all add up to a Markov blanket, I don't thereby know where the skin is. It seems like you agree with Sam way more than would naively be suggested by your initial reply. I don't understand why.  When I talked with Sam about this recently, he was somewhat satisfied by your reply, but he did think there were a bunch of questions which follow. By giving up on the idea that the markov blanket can be "built up" from an underlying causal model, we potentially give up on a lot of niceness desiderata which we might have wanted. So there's a natural question of how much you want to try and recover, which you could have gotten from "structural" markov blankets, and might be able to get some other way, but don't automatically get from arbitrary markov blankets. In particular, if I had to guess: causal properties? I don't know about you, but my OP was mainly directed at Critch, and iiuc Critch wants the Markov blanket to have some causal properties so that we can talk about input/output. I also find it appealing for "agent boundaries" to have some property like that. But if the random variables are unrelated to a causal graph (which, again, is how I understood your proposal) then it seems difficult to recove

I think you use too narrow a notion of Markov blankets here. I'd call the notion you're using "structural Markov blankets" - a set of nodes in a Bayes net which screens off one side from another structurally (i.e. it's a cut of the graph). Insofar as Markov blankets are useful for delineating agent boundaries, I expect the relevant notion of "Markov blanket" is more general: just some random variable such that one chunk of variables is independent of another conditional on the blanket.

Ahhh that makes sense, thanks.

I agree that that's a useful question to ask and a good frame, though I'm skeptical of the claim of strong correlation in the case of humans (at least in cases where the question is interesting enough to bother asking at all).

I've been wanting someone to write something like this! But it didn't hit the points I would have put front-and-center, so now I've started drafting my own. Here's the first section, which most directly responds to content in the OP.


Failure Mode: Symbol-Referent Confusions

Simon Strawman: Here’s an example shamelessly ripped off from Zack’s recent post, showing corrigibility in a language model:

Me: … what is this example supposed to show exactly?

Simon: Well, the user tries to shut the AI down to adjust its goals, and the AI - 

Me: Huh? The user doe... (read more)

3Alex Turner1mo
For people, at least, there is a strong correlation between "answers to 'what would you do in situation X?'" and "what you actually do in situation X." Similarly, we could also measure these correlations for language models so as to empirically quantify the strength of the critique you're making. If there's low correlation for relevant situations, then your critique is well-placed. (There might be a lot of noise, depending on how finicky the replies are relative to the prompt.)

Feels like we're making some progress here.

Let's walk through more carefully why revealed preferences are interesting in the shutdown problem. (I'm partly thinking as I write, here.) Suppose that, at various times, the agent is offered opportunities to spend resources in order to cause the button to be pushed/unpushed. We want the agent to turn down such opportunities, in both directions - implying either indifference or lack of preference in any revealed preferences. Further, we do want the agent to spend resources to cause various different outcomes with... (read more)

4Sami Petersen1mo
Great, I think bits of this comment help me understand what you're pointing to. I think this is roughly right, together with all the caveats about the exact statements of Thornley's impossibility theorems. Speaking precisely here will be cumbersome so for the sake of clarity I'll try to restate what you wrote like this: 1. Useful agents satisfying completeness and other properties X won't be shutdownable. 2. Properties X are necessary for an agent to be useful. 3. So, useful agents satisfying completeness won't be shutdownable. 4. So, if a useful agent is shutdownable, its preferences are incomplete. This argument would let us say that observing usefulness and shutdownability reveals a preferential gap. A quick distinction: an agent can (i) reveal p, (ii) reveal ¬p, or (iii) neither reveal p nor ¬p. The problem of underdetermination of preference is of the third form. We can think of some of the properties we've discussed as 'tests' of incomparability, which might or might not reveal preferential gaps. The test in the argument just above is whether the agent is useful and shutdownable. The test I use for my results above (roughly) is 'arbitrary choice'. The reason I use that test is that my results are self-contained; I don't make use of Thornley's various requirements for shutdownability. Of course, arbitrary choice isn't what we want for shutdownability. It's just a test for incomparability that I used for an agent that isn't yet endowed with Thornley's other requirements. The trammelling results, though, don't give me any reason to think that DSM is problematic for shutdownability. I haven't formally characterised an agent satisfying DSM as well as TND, Stochastic Near-Dominance, and so on, so I can't yet give a definitive or exact answer to how DSM affects the behaviour of a Thornley-style agent. (This is something I'll be working on.) But regarding trammelling, I think my results are reasons for optimism if anything. Even in the least convenient case

(I'm still processing confusion here - there's some kind of ontology mismatch going on. I think I've nailed down one piece of the mismatch enough to articulate it, so maybe this will help something click or at least help us communicate.

Key question: what are the revealed preferences of the DSM agent?

I think part of the confusion here is that I've been instinctively trying to think in terms of revealed preferences. But in the OP, there's a set of input preferences and a decision rule which is supposed to do well by those input preferences, but the revealed ... (read more)

4Sami Petersen2mo
That makes sense, yeah. Let me first make some comments about revealed preferences that might clarify how I'm seeing this. Preferences are famously underdetermined by choice behaviour. If A and B are available and I pick A, you can't infer that I like A more than B — I might be indifferent or unable to compare them. Worse, under uncertainty, you can't tell why I chose some lottery over another even if you assume I have strict preferences between all options — the lottery I choose depends on my beliefs too. In expected utility theory, beliefs and preferences together induce choice, so if we only observe choice, we have one equation in two unknowns.[1] Given my choice, you'd need to read my mind's probabilities to be able to infer my preferences (and vice versa).[2] In that sense, preferences (mostly) aren't actually revealed. Economists have to assume various things to apply revealed preference theory, e.g. setting beliefs equal to 'objective chances', or assuming a certain functional form for the utility function. But why do we care about preferences per se, rather than what's revealed? Because we want to predict future behaviour. If you can't infer my preferences from my choices, you can't predict my future choices. In the example above, if my 'revealed preference' between A and B is that I prefer A, then you might make false predictions about my future behaviour (because I might well choose B next time). Let me know if I'm on the right track for clarifying things. If I am, could you say how you see trammelling/shutdown connecting to revealed preferences as described here, and I'll respond to that? 1. ^ L∗∈argmaxL∑iu(xi)p(xi[L]) 2. ^ The situation is even worse when you can't tell what I'm choosing between, or what my preference relation is defined over.

I'll first flag that the results don't rely on subagents. Creating a group agent out of multiple subagents is possibly an interesting way to create an agent representable as having incomplete preferences, but this isn't the same as creating a single agent whose single preference relation happens not to satisfy completeness.

The translation between "subagents colluding/trading" and just a plain set of incomplete preferences should be something like: if the subagents representing a set of incomplete preferences would trade with each other to emulate more comp... (read more)

3Sami Petersen2mo
My results above on invulnerability preclude the possibility that the agent can predictably be made better off by its own lights through an alternative sequence of actions. So I don't think that's possible, though I may be misreading you. Could you give an example of a precommitment that the agent would take? In my mind, an example of this would have to show that the agent (not the negotiating subagents) strictly prefers the commitment to what it otherwise would've done according to DSM etc. I agree the full set won't always be needed, at least when we're just after ordinal preferences, though I personally don't have a clear picture of when exactly that holds.

Ok, I've thought through it a little more, I think I can now articulate some confusions.

On John's-simplified-model-of-Thornley's-proposal, we have complete preference orderings over trajectories-in-which-the-button-isn't-pressed and trajectories-in-which-the-button-is-pressed, separately, but no preference between any button-pressed and button-not-pressed trajectory pair. Represented as subagents, those incomplete preferences require two subagents:

  • One subagent always prefers button pressed to unpressed, is indifferent between unpressed trajectories, and ha
... (read more)
3Sami Petersen2mo
For the purposes of this discussion, this is right. I don't think the differences between this description and the actual proposal matter in this case. I don't think this representation is quite right, although not for a reason I expect to matter for this discussion. It's a technicality but I'll mention it for completeness. If we're using Bradley's representation theorem from section 2.1., the set of subagents must include every coherent completion of the agent's preferences. E.g., suppose there are three possible trajectories. Let p denote a pressed trajectory and u1,u2 two unpressed trajectories, where u1 gets you strictly more coins than u2. Then there'll be five (ordinal) subagents, described in order of preference: ⟨u1,u2,p⟩, ⟨u1,u2p⟩, ⟨u1,p,u2⟩ , ⟨u1p,u2⟩, and ⟨p,u1,u2⟩. Indeed, this wouldn't be good, and isn't what Thornley's proposal does. The agent doesn't choose arbitrarily between the best pressed vs unpressed options. Thornley's proposal adds more requirements on the agent to ensure this. My use of 'arbitrary' in the post is a bit misleading in that context. I'm only using it to identify when the agent has multiple permissible options available, which is what we're after to get TND. If no other requirements are added to the agent, and it's acting under certainty, this could well lead it to actually choose arbitrarily. But it doesn't have to in general, and under uncertainty and together with the rest of Thornley's requirements, it doesn't. (The requirements are described in his proposal.) I'll first flag that the results don't rely on subagents. Creating a group agent out of multiple subagents is possibly an interesting way to create an agent representable as having incomplete preferences, but this isn't the same as creating a single agent whose single preference relation happens not to satisfy completeness. That said, I will spend some more time thinking about the subagent idea, and I do think collusion between them seems like the major initial hurd

So, roughly, we're using TND to get shutdownability, and we're using incompleteness to get TND. The reason incompleteness helps is that we want to maintain indifference to shifting probability mass between certain trajectories. And that is why we care about ex-ante permissibility.

I'm on board with the first two sentences there. And then suddenly you jump to "and that's why we care about ex-ante permissibility". Why does wanting to maintain indifference to shifting probability mass between (some) trajectories, imply that we care about ex-ante permissibility... (read more)

3Sami Petersen2mo
The ex-ante permissible trajectories are the trajectories that the agent lacks any strict preference between. Suppose the permissible trajectories are {A,B,C}. Then, from the agent's perspective, A isn't better than B, B isn't better than A, and so on. The agent considers them all equally choiceworthy. So, the agent doesn't mind picking any one of them over any other, nor therefore switching from one lottery over them with some distribution to another lottery with some other distribution. The agent doesn't care whether it gets A versus B, versus an even chance of A or B, versus a one-third chance of A, B, or C.[1] Suppose we didn't have multiple permissible options ex-ante. For example, if only A was permissible, then the agent would dislike shifting probability mass away from A and towards B or C—because B and C aren't among the best options.[2] So that's why we want multiple ex-ante permissible trajectories: it's the only way to maintain indifference to shifting probability mass between (those) trajectories. [I'll respond to the stuff in your second paragraph under your longer comment.] 1. ^ The analogous case with complete preferences is clearer: if there are multiple permissible options, the agent must be indifferent between them all (or else the agent would be fine picking a strictly dominated option). So if n options are permissible, then u(xi)=u(xj)∀i,j∈Nn. Assuming expected utility theory, we'll then of course have ∑ni=1u(xi)p(xi)=∑ni=1u(xi)p′(xi) for any probability functions p,p′. This means the agent is indifferent to shifting probability mass between the permissible options. 2. ^ This is a bit simplified but it should get the point across.
2johnswentworth2mo
Ok, I've thought through it a little more, I think I can now articulate some confusions. On John's-simplified-model-of-Thornley's-proposal, we have complete preference orderings over trajectories-in-which-the-button-isn't-pressed and trajectories-in-which-the-button-is-pressed, separately, but no preference between any button-pressed and button-not-pressed trajectory pair. Represented as subagents, those incomplete preferences require two subagents: * One subagent always prefers button pressed to unpressed, is indifferent between unpressed trajectories, and has the original complete order on pressed trajectories. * The other subagent always prefers button unpressed to pressed, is indifferent between pressed trajectories, and has the original complete order on unpressed trajectories. In picture form (so far we've only covered the blue): Now there's a weird-to-me part. Normally I'd say that, taking these incomplete preferences at face value, the agent looks for opportunities to pareto-improve the outcome in both pressed and unpressed worlds. But you and Thornley want to interpret "no preference" as "just choose arbitrarily/randomly" rather than "don't trade either for the other", so... this agent just chooses arbitrarily/randomly between the best-available pressed-option and the best-available unpressed-option? But that would imply that the agent is choosing (albeit arbitrarily/randomly) between button-pressed and button-unpressed, which is not what we want, so presumably you're imagining something else? I'm going to go ahead with my usual mental model for now - i.e. "no preference" means "don't trade either for the other", so our incomplete preference system is aiming for pareto improvements. But possibly this diverges so much from what you're picturing that the below just won't be relevant. With that flagged, on to the trammelling issue. The potential problem is that the two subagents might want to trade, so that the system sometimes tries to make the button

Very cool math (and clear post), but I think this formulation basically fails to capture the central Goodheart problem.

Relevant slogan: Goodheart is about generalization, not approximation.

A simple argument for that slogan: if we have a "true" utility , and an approximation  which is always within  of , then optimizing  achieves a utility under  within  of the optimal . So optimizing an approximation of the utility function yields an outcome which is approximately-optimal und... (read more)

It's been a couple months, but I finally worked through this post carefully enough to respond to it.

I tentatively think that the arguments about trammelling define/extend trammelling in the wrong way, such that the arguments in the trammelling section prove a thing which is not useful.

The Core Problem

I think this quote captures the issue best:

The agent selects plans, not merely prospects, so no trammelling can occur that was not already baked into the initial choice of plans. The general observation is that, if many different plans are acceptable at the ou

... (read more)
5Sami Petersen2mo
This is a tricky topic to think about because it's not obvious how trammelling could be a worry for Thornley's Incomplete Preference Proposal. I think the most important thing to clarify is why care about ex-ante permissibility. I'll try to describe that first (this should help with my responses to downstream concerns).   Big picture Getting terminology out of the way: words like "permissibility" and "mandatory" are shorthand for rankings of prospects. A prospect is permissible iff it's in a choice set, e.g. by satisfying DSM. It's mandatory iff it's the sole element of a choice set. To see why ex-ante permissibility matters, note that it's essentially a test to see which prospects the agent is either indifferent between or has a preferential gap between (and are not ranked below anything else). When you can improve a permissible prospect along some dimension and yet retain the same set of permissible prospects, for example, you necessarily have a preferential gap between those remaining prospects. In short, ex-ante permissibility tells you which prospects the agent doesn't mind picking between. The part of the Incomplete Preference Proposal that carries much of the weight is the Timestep Near-Dominance (TND) principle for choice under uncertainty. One thing it does, roughly, is require that the agent does not mind shifting probability mass between trajectories in which the shutdown time differs. And this is where incompleteness comes in. You need preferential gaps between trajectories that differ in shutdown time for this to hold in general. If the agent had complete preferences over trajectories, it would have strict preferences between at least some trajectories that differ in shutdown time, giving it reason to shift probability mass by manipulating the button. Why TND helps get you shutdownability is described in Thornley's proposal, so I'll refer to his description and take that as a given here. So, roughly, we're using TND to get shutdownability, and we'

Either Eliezer believed that we need a proposed solution to the value identification problem that far exceeds the performance of humans on the task of identifying valuable from non-valuable outcomes. This is somewhat plausible as he mentions CEV in the next paragraph, but elsewhere Eliezer has said, "When I say that alignment is lethally difficult, I am not talking about ideal or perfect goals of 'provable' alignment, nor total alignment of superintelligences on exact human values, nor getting AIs to produce satisfactory arguments about moral dilemmas whic

... (read more)
4Vladimir Nesov2mo
What makes this concept confusing and probably a bad framing is that to the extent doom is likely, neither many individual humans nor humanity as a whole are aligned in this sense. Humanity is currently in the process of producing successors that fail to predictably have the property of converging to not kill us. (I agree that this is the MIRI referent of values/alignment and the correct thing to keep in mind as the central concern.)

Are you interpreting me as arguing that alignment is easy in this post?

Not in any sense which I think is relevant to the discussion at this point.

Are you saying that MIRI has been very consistent on the question of where the "hard parts" of alignment lie?

My estimate of how well Eliezer or Nate or Rob of 2016 would think my comment above summarizes the relevant parts of their own models, is basically the same as my estimate of how well Eliezer or Nate or Rob of today would think my comment above summarizes the relevant parts of their own models. 

That d... (read more)

Thanks for the continued clarifications.

Our primary existing disagreement might be this part,

My estimate of how well Eliezer or Nate or Rob of 2016 would think my comment above summarizes the relevant parts of their own models, is basically the same as my estimate of how well Eliezer or Nate or Rob of today would think my comment above summarizes the relevant parts of their own models. 

Of course, there's no way of proving what these three people would have said in 2016, and I sympathize with the people who are saying they don't care much about the spe... (read more)

I think you have basically not understood the argument which I understand various MIRI folks to make, and I think Eliezer's comment on this post does not explain the pieces which you specifically are missing. I'm going to attempt to clarify the parts which I think are most likely to be missing. This involves a lot of guessing, on my part, at what is/isn't already in your head, so I apologize in advance if I guess wrong.

(Side note: I am going to use my own language in places where I think it makes things clearer, in ways which I don't think e.g. Eliezer or ... (read more)

5Alex Turner2mo
(Placeholder: I think this view of alignment/model internals seems wrongheaded in a way which invalidates the conclusion, but don't have time to leave a meaningful reply now. Maybe we should hash this out sometime at Lighthaven.)
5Matthew Barnett2mo
This comment is valuable for helping to clarify the disagreement. So, thanks for that. Unfortunately, I am not sure I fully understand the comment yet. Before I can reply in-depth, I have a few general questions: 1. Are you interpreting me as arguing that alignment is easy in this post? I avoided arguing that, partly because I don't think the inner alignment problem has been solved, and the inner alignment problem seems to be the "hard part" of the alignment problem, as I understand it. Solving inner alignment completely would probably require (at the very least) solving mechanistic interpretability, which I don't think we're currently close to solving. 2. Are you saying that MIRI has been very consistent on the question of where the "hard parts" of alignment lie? If so, then your comment makes more sense to me, as you (in my understanding) are trying to summarize what their current arguments are, which then (again, in my understanding) would match what MIRI said more than five years ago. However, I was mainly arguing against the historical arguments, or at least my interpretation of these argument, such as the arguments in Nate Soares' 2017 talk. To the extent that the arguments you present are absent from pre-2018 MIRI content, I think they're mostly out of scope for the purpose of my thesis, although I agree that it's important to talk about how hard alignment is independent of all the historical arguments. (In general, I agree that discussions about current arguments are way more important than discussions about what people believed >5 years ago. However, I think it's occasionally useful to talk about the latter, and so I wrote one post about it.)

Here's a meme I've been paying attention to lately, which I think is both just-barely fit enough to spread right now and very high-value to spread.

Meme part 1: a major problem with RLHF is that it directly selects for failure modes which humans find difficult to recognize, hiding problems, deception, etc. This problem generalizes to any sort of direct optimization against human feedback (e.g. just fine-tuning on feedback), optimization against feedback from something emulating a human (a la Constitutional AI or RLAIF), etc.

Many people will then respond: "O... (read more)

5Alex Turner3mo
I agree that there's something nice about activation steering not optimizing the network relative to some other black-box feedback metric. (I, personally, feel less concerned by e.g. finetuning against some kind of feedback source; the bullet feels less jawbreaking to me, but maybe this isn't a crux.) (Medium confidence) FWIW, RLHF'd models (specifically, the LLAMA-2-chat series) seem substantially easier to activation-steer than do their base counterparts. 

There's a bunch of considerations and models mixed together in this post. Here's a way I'm factoring some of them, which other people may also find useful.

I'd consider counterfactuality the main top-level node; things which would have been done anyway have radically different considerations from things which wouldn't. E.g. doing an eval which (carefully, a little bit at a time) mimics what e.g. chaosGPT does, in a controlled environment prior to release, seems straightforwardly good so long as people were going to build chaosGPT soon anyway. It's a direct ... (read more)

I'm not really sure what you mean by "oversight, but add an epicycle" or how to determine if this is a good summary.

Something like: the OP is proposing oversight of the overseer, and it seems like the obvious next move would be to add an overseer of the oversight-overseer. And then an overseer of the oversight-oversight-overseer. Etc.

And the implicit criticism is something like: sure, this would probably marginally improve oversight, but it's a kind of marginal improvement which does not really move us closer in idea-space to whatever the next better parad... (read more)

the OP is proposing oversight of the overseer,

I don't think this is right, at least in the way I usually use the terms. We're proposing a strategy for conservatively estimating the quality of an "overseer" (i.e. a system which is responsible for estimating the goodness of model actions). I think that you aren't summarizing the basic point if you try to use the word "oversight" for both of those.

5Ryan Greenblatt4mo
This post doesn't discuss how to do oversight, just how to evaluate whether or not your oversight is working. I'm not really sure what you mean by "oversight, but add an epicycle" or how to determine if this is a good summary. Consider a type of process for testing whether or not planes are safe. You could call this "safe planes, but add an epicycle", but I'm not sure what this communicates.

We don't believe that all knowledge and computation in a trained neural network emerges in phase transitions, but our working hypothesis is that enough emerges this way to make phase transitions a valid organizing principle for interpretability.

I think this undersells the case for focusing on phase transitions.

Hand-wavy version of a stronger case: within a phase (i.e. when there's not a phase change), things change continuously/slowly. Anyone watching from outside can see what's going on, and have plenty of heads-up, plenty of opportunity to extrapolate wh... (read more)

This was an outstanding post! The concept of a "conflationary alliance" seems high-value and novel to me. The anthropological study mostly confirms what I already believed, but provides very legible evidence.

Wait... doesn't the caprice rule just directly modify its preferences toward completion over time? Like, every time a decision comes up where it lacks a preference, a new preference (and any implied by it) will be added to its preferences.

Intuitively: of course the caprice rule would be indifferent to completing its preferences up-front via contract/commitment, because it expects to complete its preferences over time anyway; it's just lazy about the process (in the "lazy data structure" sense).

I would order these differently.

Within the first section (prompting/RLHF/Constitutional):

  • I'd guess that Constitutional AI would work only in the very easiest worlds
  • RLHF would work in slightly less-easy worlds
  • Prompting would work in worlds where alignment is easy, but too hard for RLHF or Constitutional AI

The core reasoning here is that human feedback directly selects for deception. Furthermore, deception induced by human feedback does not require strategic awareness - e.g. that thing with the hand which looks like it's grabbing a ball but isn't is a good e... (read more)

2Samuel Dylan Martin5mo
The phenomenon that a 'better' technique is actually worse than a 'worse' technique if both are insufficient is something I talk about in a later section of the post and I specifically mention RLHF. I think this holds true in general throughout the scale, e.g. Eliezer and Nate have said that even complex interpretability-based oversight with robustness testing and AI research assistance is also just incentivizing more and better deception, so this isn't unique to RLHF. But I tend to agree with Richard's view in his discussion with you under that post that while if you condition on deception occurring by default RLHF is worse than just prompting (i.e. prompting is better in harder worlds), RLHF is better than just prompting in easy worlds. I also wouldn't call non-strategically aware pursuit of inaccurate proxies for what we want 'deception', because in this scenario the system isn't being intentionally deceptive. In easy worlds, the proxies RLHF learns are good enough in practice and cases like the famous thing with the hand which looks like it's grabbing a ball but isn't just disappear if you're diligent enough with how you provide feedback. In that world, not using RLHF would get systems pursuing cruder and worse proxies for what we want that fail often (e.g. systems just overtly lie to you all the time, say and do random things etc.). I think that's more or less the situation we're in right now with current AIs! If the proxies that RLHF ends up pursuing are in fact close enough, then RLHF works and will make systems behave more reliably and be harder to e.g. jailbreak or provoke into random antisocial behavior than with just prompting. I did flag in a footnote that the 'you get what you measure' problem that RLHF produces could also be very difficult to deal with for structural or institutional reasons.   I'm assuming you meant fourth-easiest here not fourth hardest. It's important to note that I'm not here talking about testing systems to see if they misbeh

I think this scenario is still strategically isomorphic to "advantages mainly come from overwhelmingly great intelligence". It's intelligence at the level of a collective, rather than the individual level, but the conclusion is the same. For instance, scalable oversight of a group of AIs which is collectively far smarter than any group of humans is hard in basically the same ways as oversight of one highly-intelligent AI. Boxing the group of AIs is hard for the same reasons as boxing one. Etc.

2Joar Skalse5mo
I think the broad strokes are mostly similar, but that a bunch of relevant details are different. Yes, a large collective of near-human AI that is allowed to interact freely over a (subjectively) long period of time is presumably roughly as hard to understand and control as a Bostrom/Yudkowsky-esque God in a box. However, in this scenario, we have the option to not allow free interaction between multiple instances, while still being able to extract useful work from them. It is also probably much easier to align a system that is not of overwhelming intelligence, and this could be done before the AIs are allowed to interact. We might also be able to significantly influence their collective behaviour by controlling the initial conditions of their interactions (similarly to how institutions and cultural norms have a substantial long-term impact on the trajectory of a country, for example). It is also more plausible that humans (or human simulations or emulations) could be kept in the loop for a long time period in this scenario. Moreover, if intelligence is bottle-necked by external resources (such as memory, data, CPU cycles, etc) rather than internal algorithmic efficiency, then you can exert more control over the resulting intelligence explosion by controlling those resources. Etc etc.

My stock counterargument to this: insofar as humans' advantage over other animals stems primarily from our ability to transmit knowledge/memes/etc across individuals and through generations, we should expect AI to have a much larger advantage, because they can do the same thing far, far better. This doesn't even require the AI to be all that "smart" - even just the ability to copy minds directly would allow transmission from "parent" to "child" with far less knowledge-loss than humans can achieve. (Consider, for instance, the notorious difficulty of traini... (read more)

3Joar Skalse5mo
Yes, I agree with this. I mean, even if we assume that the AIs are basically equivalent to human simulations, they still get obvious advantages from the ability to be copy-pasted, the ability to be restored to a checkpoint, the ability to be run at higher clock speeds, and the ability to make credible pre-commitments, etc etc. I therefore certainly don't think there is any plausible scenario in which unchecked AI systems wouldn't end up with most of the power on earth. However, there is a meaningful difference between the scenario where their advantages mainly come from overwhelmingly great intelligence, and the scenario where their advantages mainly (or at least in large part) come from other sources. For example, scaleable oversight is a more realistic possibility in the latter scenario than it is in the former scenario. Boxing methods are also more realistic in the latter scenario than the former scenario, etc.

Two answers here.

First, the standard economics answer: economic profit  accounting profit. Economic profit is how much better a company does than their opportunity cost; accounting profit is revenue minus expenses. A trading firm packed with top-notch physicists, mathematicians, and programmers can make enormous accounting profit and yet still make zero economic profit, because the opportunity costs for such people are quite high. "Efficient markets" means zero economic profits, not zero accounting profits.

Second answer: as Zvi is fond of pointi... (read more)

I don't think it's been posted publicly yet. Elliot said I was welcome to cite it publicly, but didn't explicitly say whether I should link it. @EJT ?

Consider two claims:

  • Any system can be modeled as maximizing some utility function, therefore utility maximization is not a very useful model
  • Corrigibility is possible, but utility maximization is incompatible with corrigibility, therefore we need some non-utility-maximizer kind of agent to achieve corrigibility

These two claims should probably not both be true! If any system can be modeled as maximizing a utility function, and it is possible to build a corrigible system, then naively the corrigible system can be modeled as maximizing a utility function.

I exp... (read more)

4Johannes C. Mayer5mo
Expected Utility Maximization is Not Enough Consider a homomorphically encrypted computation running somewhere in the cloud. The computations correspond to running an AGI. Now from the outside, you can still model the AGI based on how it behaves, as an expected utility maximizer, if you have a lot of observational data about the AGI (or at least let's take this as a reasonable assumption). No matter how closely you look at the computations, you will not be able to figure out how to change these computations in order to make the AGI aligned if it was not aligned already (Also, let's assume that you are some sort of Cartesian agent, otherwise you would probably already be dead if you were running these kinds of computations). So, my claim is not that modeling a system as an expected utility maximizer can't be useful. Instead, I claim that this model is incomplete. At least with regard to the task of computing an update to the system, such that when we apply this update to the system, it would become aligned. Of course, you can model any system, as an expected utility maximizer. But just because I can use the "high level" conceptual model of expected utility maximization, to model the behavior of a system very well. But behavior is not the only thing that we care about, we actually care about being able to understand the internal workings of the system, such that it becomes much easier to think about how to align the system. So the following seems to be beside the point unless I am <missing/misunderstanding> something: Maybe I have missed the fact that the claim you listed says that expected utility maximization is not very useful. And I'm saying it can be useful, it might just not be sufficient at all to actually align a particular AGI system. Even if you can do it arbitrarily well.

I'm not sure what motivation for worst-case reasoning you're thinking about here. Maybe just that there are many disjunctive ways things can go wrong other than bad capability evals and the AI will optimize against us?

This getting very meta, but I think my Real Answer is that there's an analogue of You Are Not Measuring What You Think You Are Measuring for plans. Like, the system just does not work any of the ways we're picturing it at all, so plans will just generally not at all do what we imagine they're going to do.

(Of course the plan could still in-pri... (read more)

Yes, there is a story for a canonical factorization of , it's just separate from the story in this post.

Sounds like we need to unpack what "viewing  as a latent which generates " is supposed to mean.

I start with a distribution . Let's say  is a bunch of rolls of a biased die, of unknown bias. But I don't know that's what  is; I just have the joint distribution of all these die-rolls. What I want to do is look at that distribution and somehow "recover" the underlying latent variable (bias of the die) and factorization, i.e. notice that I can write the distribution as , where  i... (read more)

4Rohin Shah7mo
Okay, I understand how that addresses my edit. I'm still not quite sure why the lightcone theorem is a "foundation" for natural abstraction (it looks to me like a nice concrete example on which you could apply techniques) but I think I should just wait for future posts, since I don't really have any concrete questions at the moment.

 is conceptually just the whole bag of abstractions (at a certain scale), unfactored.

3Thane Ruthenis7mo
Sure, but isn't the goal of the whole agenda to show that Λ does have a certain correct factorization, i. e. that abstractions are convergent? I suppose it may be that any choice of low-level boundaries results in the same Λ, but the Λ itself has a canonical factorization, and going from Λ back to XT reveals the corresponding canonical factorization of XT? And then depending on how close the initial choice of boundaries was to the "correct" one, Λ is easier or harder to compute (or there's something else about the right choice that makes it nice to use).

If you have sets of variables that start with no mutual information (conditioning on ), and they are so far away that nothing other than  could have affected both of them (distance of at least ), then they continue to have no mutual information (independent).

Yup, that's basically it. And I agree that it's pretty obvious once you see it - the key is to notice that distance  implies that nothing other than  could have affected both of them. But man, when I didn't know that was what I should look for? Much les... (read more)

5Rohin Shah7mo
Okay, that mostly makes sense. I agree this is true, but why does the Lightcone theorem matter for it? It is also a theorem that a Gibbs resampler initialized at equilibrium will produce XT distributed according to X, and as you say it's clear that the resampler throws away a ton of information about X0 in computing it. Why not use that theorem as the basis for identifying the information to throw away? In other words, why not throw away information from X0 while maintaining  XT∼X ? EDIT: Actually, conditioned on X0, it is not the case that XT is distributed according to X. (Simple counterexample: Take a graphical model where node A can be 0 or 1 with equal probability, and A causes B through a chain of > 2T steps, such that we always have B = A for a true sample from X. In such a setting, for a true sample from X, B should be equally likely to be 0 or 1, but  BT∣X0=B0, i.e. it is deterministic.) Of course, this is a problem for both my proposal and for the Lightcone theorem -- in either case you can't view X0 as a latent that generates X (which seems to be the main motivation, though I'm still not quite sure why that's the motivation).

Yup, that's basically it. And I agree that it's pretty obvious once you see it - the key is to notice that distance  implies that nothing other than  could have affected both of them. But man, when I didn't know that was what I should look for? Much less obvious.

... I feel compelled to note that I'd pointed out a very similar thing a while ago.

Granted, that's not exactly the same formulation, and the devil's in the details.

Ah, no, I suppose that part is supposed to be handled by whatever approximation process we define for ? That is, the "correct" definition of the "most minimal approximate summary" would implicitly constrain the possible choices of boundaries for which  is equivalent to ?

Almost. The hope/expectation is that different choices yield approximately the same , though still probably modulo some conditions (like e.g. sufficiently large ).

What's the  here? Is it meant to be ?

System size, i.e. number of variab... (read more)

4Thane Ruthenis7mo
Can you elaborate on this expectation? Intuitively, Λ should consist of a number of higher-level variables as well, and each of them should correspond to a specific set of lower-level variables: abstractions and the elements they abstract over. So for a given Λ, there should be a specific "correct" way to draw the boundaries in the low-level system. But if ~any way of drawing the boundaries yields the same Λ, then what does this mean? Or perhaps the "boundaries" in the mesoscale-approximation approach represent something other than the factorization of X into individual abstractions?
4Thane Ruthenis7mo
By the way, do we need the proof of the theorem to be quite this involved? It seems we can just note that for for any two (sets of) variables X1, X2 separated by distance D, the earliest sampling-step at which their values can intermingle (= their lightcones intersect) is D/2 (since even in the "fastest" case, they can't do better than moving towards each other at 1 variable per 1 sampling-step).

The new question is: what is the upper bound on bits of optimization gained from a bit of observation? What's the best-case asymptotic scaling? The counterexample suggests it's roughly exponential, i.e. one bit of observation can double the number of bits of optimization. On the other hand, it's not just multiplicative, because our xor example at the top of this post showed a jump from 0 bits of optimization to 1 bit from observing 1 bit.

Alright, I think we have an answer! The conjecture is false.

Counterexample: suppose I have a very-high-capacity information channel (N bit capacity), but it's guarded by a uniform random n-bit password. O is the password, A is an N-bit message and a guess at the n-bit password. Y is the N-bit message part of A if the password guess matches O; otherwise, Y is 0.

Let's say the password is 50 bits and the message is 1M bits. If A is independent of the password, then there's a  chance of guessing the password, so the bitrate will be about ... (read more)

2johnswentworth7mo
The new question is: what is the upper bound on bits of optimization gained from a bit of observation? What's the best-case asymptotic scaling? The counterexample suggests it's roughly exponential, i.e. one bit of observation can double the number of bits of optimization. On the other hand, it's not just multiplicative, because our xor example at the top of this post showed a jump from 0 bits of optimization to 1 bit from observing 1 bit.

Eliminating G

The standard definition of channel capacity makes no explicit reference to the original message ; it can be eliminated from the problem. We can do the same thing here, but it’s trickier. First, let’s walk through it for the standard channel capacity setup.

Standard Channel Capacity Setup

In the standard setup,  cannot depend on , so our graph looks like

… and we can further remove  entirely by absorbing it into the stochasticity of .

Now, there are two key steps. First step: if  is not a determini... (read more)

They fit naturally into the coherent whole picture. In very broad strokes, that picture looks like selection theorems starting from selection pressures for basic agency, running through natural factorization of problem domains (which is where modules and eventually language come in), then world models and general purpose search (which finds natural factorizations dynamically, rather than in a hard-coded way) once the environment and selection objective has enough variety.

I spent a few hours today just starting to answer this question, and only got as far as walking through what this "agency" thing is which we're trying to understand. Since people have already asked for clarification on that topic, I'll post it here as a standalone mini-essay. Things which this comment does not address, which I may or may not get around to writing later:

  • Really there should be a few more multi-agent phenomena at the end - think markets, organizations/firms, Schelling points, governance, that sort of thing. I ran out of steam before getting t
... (read more)
4Chris_Leong7mo
Thanks for your response. There's a lot of good material here, although some of these components like modules or language seem less central to agency, at least from my perspective. I guess you might see these are appearing slightly down the stack?

(Also, I had the above convos with John >1y ago, and perhaps John simply changed since then.)

In hindsight, I do think the period when our discussions took place were a local maximum of (my own estimate of the extent of applicability of my math), partially thanks to your input and partially because I was in the process of digesting a bunch of the technical results we talked about and figuring out the next hurdles. In particular, I definitely underestimated the difficulty of extending the results to finite approximations.

That said, I doubt that fully accounts for the difference in perception.

More details:

  • I think the argument Nate gave is at least correct for markets of relatively-highly-intelligent agents, and that was a big update for me (thankyou Nate!). I'm still unsure how far it generalizes to relatively less powerful agents.
  • Nate left out my other big takeaway: Nate's argument here implies that there's probably a lot of money to be made in real-world markets! In practice, it would probably look like an insurance-like contract, by which two traders would commit to the "side-channel trades at non-market prices" required to make them aggrega
... (read more)

Aside: Vanessa mentioned in person at one point that the game-theoretic perspective on infra-bayes indeed basically works, and she has a result somewhere about the equivalence. So that might prove useful, if you're looking to claim this prize.

If we’d learned that GPT-4 or Claude had those capabilities, we expect labs would have taken immediate action to secure and contain their systems.

At that point, the time at which we should have stopped is probably already passed, especially insofar as:

  • systems are trained with various degrees of internet access, so autonomous function is already a problem even during training
  • people are able to make language models more capable in deployment, via tricks like e.g. chain-of-thought prompting.

As written, this evaluation plan seems to be missing elbow-room. The ... (read more)

4Beth Barnes8mo
Autonomous Replication as we define it in our evaluations (though maybe not clear from our blog post) is significantly below what we think is necessary to actually be an xrisk. In particular, we assume no human resistance, model has access to weights, ways of making money it tries are scalable, doesn't have any issues purchasing tons of GPUs, no monitoring by labs, etc

You might be arguing for some analogy but it's not immediately clear to me what, so maybe clarify if that's the case?

The basic analogy is roughly "if we want a baseline for how hard it will be to evaluate an AI's outputs on their own terms, we should look at how hard it is to evaluate humans' outputs on their own terms, especially in areas similar in some way to AI safety". My guess is that you already have lots of intuition about how hard it is to assess results, from your experience assessing grantees, so that's the intuition I was trying to pump. In par... (read more)

3HoldenKarnofsky8mo
I see, thanks. I feel like the closest analogy here that seems viable to me would be to something like: is Open Philanthropy able to hire security experts to improve its security and assess whether they're improving its security? And I think the answer to that is yes. (Most of its grantees aren't doing work where security is very important.) It feels harder to draw an analogy for something like "helping with standards enforcement," but maybe we could consider OP's ability to assess whether its farm animal welfare grantees are having an impact on who adheres to what standards, and how strong adherence is? I think OP has pretty good (not perfect) ability to do so.

I don't agree with this characterization, at least for myself. I think people should be doing object-level alignment research now, partly (maybe mostly?) to be in better position to automate it later.

Indeed, I think you're a good role model in this regard and hope more people will follow your example.

It seems to me like the main crux here is that you're picturing a "phase transition" that kicks in in a fairly unpredictable way, such that a pretty small increase in e.g. inference compute or training compute could lead to a big leap in capabilities. Does that sound right?

I don't think this is implausible but haven't seen a particular reason to consider it likely.

The phrase I'd use there is "grokking general-purpose search". Insofar as general-purpose search consists of a relatively-simple circuit/function recursively calling itself a lot with different c... (read more)

3HoldenKarnofsky9mo
I think I find the "grokking general-purpose search" argument weaker than you do, but it's not clear by how much. The "we" in "we can point AIs toward and have some ability to assess" meant humans, not Open Phil. You might be arguing for some analogy but it's not immediately clear to me what, so maybe clarify if that's the case?

+1, this is probably going to be my new default post to link people to as an intro.

Brief responses to the critiques:

Results don’t discuss encoding/representation of abstractions

Totally agree with this one, it's the main thing I've worked on over the past month and will probably be the main thing in the near future. I'd describe the previous results (i.e. ignoring encoding/representation) as characterizing the relationship between the high-level and the high-level.

Definitions depend on choice of variables 

The local/causal structure of our universe gives a very strong preferred way to "slice it up"; I expect that's plenty sufficient... (read more)

Thanks for the responses! I think we qualitatively agree on a lot, just put emphasis on different things or land in different places on various axes. Responses to some of your points below:

The local/causal structure of our universe gives a very strong preferred way to "slice it up"; I expect that's plenty sufficient for convergence of abstractions. [...]

Let me try to put the argument into my own words: because of locality, any "reasonable" variable transformation can in some sense be split into "local transformations", each of which involve only a few vari... (read more)

Load More