Beyond-episode goals: that is, the model cares about the consequences of its actions after the episode is complete.
I claim this part is basically unneccesary. Once the AI has situational awareness, if it's optimizing for human-assigned ratings, then scheming-style behavior naturally follows even if the AI is only optimizing for a single episode.
This came up in a recent dialogue with Eli. We talked about an AI optimized against human ratings, facing a choice of whether to "seize the raters and put them in special 'thumbs up'-only suits".
[The below is copied... (read more)
... I was expecting you'd push back a bit, so I'm going to fill in the push-back I was expecting here.
Sam's argument still generalizes beyond the case of graphical models. Our model is going to have some variables in it, and if we don't know in advance where the agent will be at each timestep, then presumably we don't know which of those variables (or which function of those variables, etc) will be our Markov blanket. On the other hand, if we knew which variables or which function of the variables were the blanket, then presumably we'd already know where t... (read more)
I think you use too narrow a notion of Markov blankets here. I'd call the notion you're using "structural Markov blankets" - a set of nodes in a Bayes net which screens off one side from another structurally (i.e. it's a cut of the graph). Insofar as Markov blankets are useful for delineating agent boundaries, I expect the relevant notion of "Markov blanket" is more general: just some random variable such that one chunk of variables is independent of another conditional on the blanket.
Ahhh that makes sense, thanks.
I agree that that's a useful question to ask and a good frame, though I'm skeptical of the claim of strong correlation in the case of humans (at least in cases where the question is interesting enough to bother asking at all).
I've been wanting someone to write something like this! But it didn't hit the points I would have put front-and-center, so now I've started drafting my own. Here's the first section, which most directly responds to content in the OP.
Simon Strawman: Here’s an example shamelessly ripped off from Zack’s recent post, showing corrigibility in a language model:
Me: … what is this example supposed to show exactly?
Simon: Well, the user tries to shut the AI down to adjust its goals, and the AI -
Me: Huh? The user doe... (read more)
Feels like we're making some progress here.
Let's walk through more carefully why revealed preferences are interesting in the shutdown problem. (I'm partly thinking as I write, here.) Suppose that, at various times, the agent is offered opportunities to spend resources in order to cause the button to be pushed/unpushed. We want the agent to turn down such opportunities, in both directions - implying either indifference or lack of preference in any revealed preferences. Further, we do want the agent to spend resources to cause various different outcomes with... (read more)
(I'm still processing confusion here - there's some kind of ontology mismatch going on. I think I've nailed down one piece of the mismatch enough to articulate it, so maybe this will help something click or at least help us communicate.
Key question: what are the revealed preferences of the DSM agent?
I think part of the confusion here is that I've been instinctively trying to think in terms of revealed preferences. But in the OP, there's a set of input preferences and a decision rule which is supposed to do well by those input preferences, but the revealed ... (read more)
I'll first flag that the results don't rely on subagents. Creating a group agent out of multiple subagents is possibly an interesting way to create an agent representable as having incomplete preferences, but this isn't the same as creating a single agent whose single preference relation happens not to satisfy completeness.
The translation between "subagents colluding/trading" and just a plain set of incomplete preferences should be something like: if the subagents representing a set of incomplete preferences would trade with each other to emulate more comp... (read more)
Ok, I've thought through it a little more, I think I can now articulate some confusions.
On John's-simplified-model-of-Thornley's-proposal, we have complete preference orderings over trajectories-in-which-the-button-isn't-pressed and trajectories-in-which-the-button-is-pressed, separately, but no preference between any button-pressed and button-not-pressed trajectory pair. Represented as subagents, those incomplete preferences require two subagents:
So, roughly, we're using TND to get shutdownability, and we're using incompleteness to get TND. The reason incompleteness helps is that we want to maintain indifference to shifting probability mass between certain trajectories. And that is why we care about ex-ante permissibility.
I'm on board with the first two sentences there. And then suddenly you jump to "and that's why we care about ex-ante permissibility". Why does wanting to maintain indifference to shifting probability mass between (some) trajectories, imply that we care about ex-ante permissibility... (read more)
Very cool math (and clear post), but I think this formulation basically fails to capture the central Goodheart problem.
Relevant slogan: Goodheart is about generalization, not approximation.
A simple argument for that slogan: if we have a "true" utility U(X), and an approximation U′(X) which is always within ϵ of U, then optimizing U′ achieves a utility under U within 2ϵ of the optimal U. So optimizing an approximation of the utility function yields an outcome which is approximately-optimal und... (read more)
It's been a couple months, but I finally worked through this post carefully enough to respond to it.
I tentatively think that the arguments about trammelling define/extend trammelling in the wrong way, such that the arguments in the trammelling section prove a thing which is not useful.
I think this quote captures the issue best:
The agent selects plans, not merely prospects, so no trammelling can occur that was not already baked into the initial choice of plans. The general observation is that, if many different plans are acceptable at the ou
Either Eliezer believed that we need a proposed solution to the value identification problem that far exceeds the performance of humans on the task of identifying valuable from non-valuable outcomes. This is somewhat plausible as he mentions CEV in the next paragraph, but elsewhere Eliezer has said, "When I say that alignment is lethally difficult, I am not talking about ideal or perfect goals of 'provable' alignment, nor total alignment of superintelligences on exact human values, nor getting AIs to produce satisfactory arguments about moral dilemmas whic
Are you interpreting me as arguing that alignment is easy in this post?
Not in any sense which I think is relevant to the discussion at this point.
Are you saying that MIRI has been very consistent on the question of where the "hard parts" of alignment lie?
My estimate of how well Eliezer or Nate or Rob of 2016 would think my comment above summarizes the relevant parts of their own models, is basically the same as my estimate of how well Eliezer or Nate or Rob of today would think my comment above summarizes the relevant parts of their own models.
That d... (read more)
Thanks for the continued clarifications.
Our primary existing disagreement might be this part,
Of course, there's no way of proving what these three people would have said in 2016, and I sympathize with the people who are saying they don't care much about the spe... (read more)
I think you have basically not understood the argument which I understand various MIRI folks to make, and I think Eliezer's comment on this post does not explain the pieces which you specifically are missing. I'm going to attempt to clarify the parts which I think are most likely to be missing. This involves a lot of guessing, on my part, at what is/isn't already in your head, so I apologize in advance if I guess wrong.
(Side note: I am going to use my own language in places where I think it makes things clearer, in ways which I don't think e.g. Eliezer or ... (read more)
Here's a meme I've been paying attention to lately, which I think is both just-barely fit enough to spread right now and very high-value to spread.
Meme part 1: a major problem with RLHF is that it directly selects for failure modes which humans find difficult to recognize, hiding problems, deception, etc. This problem generalizes to any sort of direct optimization against human feedback (e.g. just fine-tuning on feedback), optimization against feedback from something emulating a human (a la Constitutional AI or RLAIF), etc.
Many people will then respond: "O... (read more)
There's a bunch of considerations and models mixed together in this post. Here's a way I'm factoring some of them, which other people may also find useful.
I'd consider counterfactuality the main top-level node; things which would have been done anyway have radically different considerations from things which wouldn't. E.g. doing an eval which (carefully, a little bit at a time) mimics what e.g. chaosGPT does, in a controlled environment prior to release, seems straightforwardly good so long as people were going to build chaosGPT soon anyway. It's a direct ... (read more)
That's useful, thanks.
I'm not really sure what you mean by "oversight, but add an epicycle" or how to determine if this is a good summary.
Something like: the OP is proposing oversight of the overseer, and it seems like the obvious next move would be to add an overseer of the oversight-overseer. And then an overseer of the oversight-oversight-overseer. Etc.
And the implicit criticism is something like: sure, this would probably marginally improve oversight, but it's a kind of marginal improvement which does not really move us closer in idea-space to whatever the next better parad... (read more)
the OP is proposing oversight of the overseer,
I don't think this is right, at least in the way I usually use the terms. We're proposing a strategy for conservatively estimating the quality of an "overseer" (i.e. a system which is responsible for estimating the goodness of model actions). I think that you aren't summarizing the basic point if you try to use the word "oversight" for both of those.
We don't believe that all knowledge and computation in a trained neural network emerges in phase transitions, but our working hypothesis is that enough emerges this way to make phase transitions a valid organizing principle for interpretability.
I think this undersells the case for focusing on phase transitions.
Hand-wavy version of a stronger case: within a phase (i.e. when there's not a phase change), things change continuously/slowly. Anyone watching from outside can see what's going on, and have plenty of heads-up, plenty of opportunity to extrapolate wh... (read more)
This was an outstanding post! The concept of a "conflationary alliance" seems high-value and novel to me. The anthropological study mostly confirms what I already believed, but provides very legible evidence.
Wait... doesn't the caprice rule just directly modify its preferences toward completion over time? Like, every time a decision comes up where it lacks a preference, a new preference (and any implied by it) will be added to its preferences.
Intuitively: of course the caprice rule would be indifferent to completing its preferences up-front via contract/commitment, because it expects to complete its preferences over time anyway; it's just lazy about the process (in the "lazy data structure" sense).
I would order these differently.
Within the first section (prompting/RLHF/Constitutional):
The core reasoning here is that human feedback directly selects for deception. Furthermore, deception induced by human feedback does not require strategic awareness - e.g. that thing with the hand which looks like it's grabbing a ball but isn't is a good e... (read more)
I think this scenario is still strategically isomorphic to "advantages mainly come from overwhelmingly great intelligence". It's intelligence at the level of a collective, rather than the individual level, but the conclusion is the same. For instance, scalable oversight of a group of AIs which is collectively far smarter than any group of humans is hard in basically the same ways as oversight of one highly-intelligent AI. Boxing the group of AIs is hard for the same reasons as boxing one. Etc.
My stock counterargument to this: insofar as humans' advantage over other animals stems primarily from our ability to transmit knowledge/memes/etc across individuals and through generations, we should expect AI to have a much larger advantage, because they can do the same thing far, far better. This doesn't even require the AI to be all that "smart" - even just the ability to copy minds directly would allow transmission from "parent" to "child" with far less knowledge-loss than humans can achieve. (Consider, for instance, the notorious difficulty of traini... (read more)
Two answers here.
First, the standard economics answer: economic profit ≠ accounting profit. Economic profit is how much better a company does than their opportunity cost; accounting profit is revenue minus expenses. A trading firm packed with top-notch physicists, mathematicians, and programmers can make enormous accounting profit and yet still make zero economic profit, because the opportunity costs for such people are quite high. "Efficient markets" means zero economic profits, not zero accounting profits.
Second answer: as Zvi is fond of pointi... (read more)
I don't think it's been posted publicly yet. Elliot said I was welcome to cite it publicly, but didn't explicitly say whether I should link it. @EJT ?
Consider two claims:
These two claims should probably not both be true! If any system can be modeled as maximizing a utility function, and it is possible to build a corrigible system, then naively the corrigible system can be modeled as maximizing a utility function.
I exp... (read more)
I'm not sure what motivation for worst-case reasoning you're thinking about here. Maybe just that there are many disjunctive ways things can go wrong other than bad capability evals and the AI will optimize against us?
This getting very meta, but I think my Real Answer is that there's an analogue of You Are Not Measuring What You Think You Are Measuring for plans. Like, the system just does not work any of the ways we're picturing it at all, so plans will just generally not at all do what we imagine they're going to do.
(Of course the plan could still in-pri... (read more)
Yes, there is a story for a canonical factorization of Λ, it's just separate from the story in this post.
Sounds like we need to unpack what "viewing X0 as a latent which generates X" is supposed to mean.
I start with a distribution P[X]. Let's say X is a bunch of rolls of a biased die, of unknown bias. But I don't know that's what X is; I just have the joint distribution of all these die-rolls. What I want to do is look at that distribution and somehow "recover" the underlying latent variable (bias of the die) and factorization, i.e. notice that I can write the distribution as P[X]=∑iP[Xi|Λ]P[Λ], where Λ i... (read more)
Λ is conceptually just the whole bag of abstractions (at a certain scale), unfactored.
Yeah, that probably works.
If you have sets of variables that start with no mutual information (conditioning on X0), and they are so far away that nothing other than X0 could have affected both of them (distance of at least 2T), then they continue to have no mutual information (independent).
Yup, that's basically it. And I agree that it's pretty obvious once you see it - the key is to notice that distance 2T implies that nothing other than X0 could have affected both of them. But man, when I didn't know that was what I should look for? Much les... (read more)
Yup, that's basically it. And I agree that it's pretty obvious once you see it - the key is to notice that distance 2T implies that nothing other than X0 could have affected both of them. But man, when I didn't know that was what I should look for? Much less obvious.
... I feel compelled to note that I'd pointed out a very similar thing a while ago.
Granted, that's not exactly the same formulation, and the devil's in the details.
Ah, no, I suppose that part is supposed to be handled by whatever approximation process we define for Λ? That is, the "correct" definition of the "most minimal approximate summary" would implicitly constrain the possible choices of boundaries for which Λ is equivalent to X0?
Almost. The hope/expectation is that different choices yield approximately the same Λ, though still probably modulo some conditions (like e.g. sufficiently large T).
What's the n/2 here? Is it meant to be m/2?
System size, i.e. number of variab... (read more)
The new question is: what is the upper bound on bits of optimization gained from a bit of observation? What's the best-case asymptotic scaling? The counterexample suggests it's roughly exponential, i.e. one bit of observation can double the number of bits of optimization. On the other hand, it's not just multiplicative, because our xor example at the top of this post showed a jump from 0 bits of optimization to 1 bit from observing 1 bit.
Alright, I think we have an answer! The conjecture is false.
Counterexample: suppose I have a very-high-capacity information channel (N bit capacity), but it's guarded by a uniform random n-bit password. O is the password, A is an N-bit message and a guess at the n-bit password. Y is the N-bit message part of A if the password guess matches O; otherwise, Y is 0.
Let's say the password is 50 bits and the message is 1M bits. If A is independent of the password, then there's a 2−50 chance of guessing the password, so the bitrate will be about 2−5... (read more)
The standard definition of channel capacity makes no explicit reference to the original message G; it can be eliminated from the problem. We can do the same thing here, but it’s trickier. First, let’s walk through it for the standard channel capacity setup.
In the standard setup, A cannot depend on O, so our graph looks like
… and we can further remove O entirely by absorbing it into the stochasticity of Y.
Now, there are two key steps. First step: if A is not a determini... (read more)
They fit naturally into the coherent whole picture. In very broad strokes, that picture looks like selection theorems starting from selection pressures for basic agency, running through natural factorization of problem domains (which is where modules and eventually language come in), then world models and general purpose search (which finds natural factorizations dynamically, rather than in a hard-coded way) once the environment and selection objective has enough variety.
I spent a few hours today just starting to answer this question, and only got as far as walking through what this "agency" thing is which we're trying to understand. Since people have already asked for clarification on that topic, I'll post it here as a standalone mini-essay. Things which this comment does not address, which I may or may not get around to writing later:
(Also, I had the above convos with John >1y ago, and perhaps John simply changed since then.)
In hindsight, I do think the period when our discussions took place were a local maximum of (my own estimate of the extent of applicability of my math), partially thanks to your input and partially because I was in the process of digesting a bunch of the technical results we talked about and figuring out the next hurdles. In particular, I definitely underestimated the difficulty of extending the results to finite approximations.
That said, I doubt that fully accounts for the difference in perception.
Aside: Vanessa mentioned in person at one point that the game-theoretic perspective on infra-bayes indeed basically works, and she has a result somewhere about the equivalence. So that might prove useful, if you're looking to claim this prize.
If we’d learned that GPT-4 or Claude had those capabilities, we expect labs would have taken immediate action to secure and contain their systems.
At that point, the time at which we should have stopped is probably already passed, especially insofar as:
As written, this evaluation plan seems to be missing elbow-room. The ... (read more)
You might be arguing for some analogy but it's not immediately clear to me what, so maybe clarify if that's the case?
The basic analogy is roughly "if we want a baseline for how hard it will be to evaluate an AI's outputs on their own terms, we should look at how hard it is to evaluate humans' outputs on their own terms, especially in areas similar in some way to AI safety". My guess is that you already have lots of intuition about how hard it is to assess results, from your experience assessing grantees, so that's the intuition I was trying to pump. In par... (read more)
I don't agree with this characterization, at least for myself. I think people should be doing object-level alignment research now, partly (maybe mostly?) to be in better position to automate it later.
Indeed, I think you're a good role model in this regard and hope more people will follow your example.
It seems to me like the main crux here is that you're picturing a "phase transition" that kicks in in a fairly unpredictable way, such that a pretty small increase in e.g. inference compute or training compute could lead to a big leap in capabilities. Does that sound right?I don't think this is implausible but haven't seen a particular reason to consider it likely.
It seems to me like the main crux here is that you're picturing a "phase transition" that kicks in in a fairly unpredictable way, such that a pretty small increase in e.g. inference compute or training compute could lead to a big leap in capabilities. Does that sound right?
I don't think this is implausible but haven't seen a particular reason to consider it likely.
The phrase I'd use there is "grokking general-purpose search". Insofar as general-purpose search consists of a relatively-simple circuit/function recursively calling itself a lot with different c... (read more)
+1, this is probably going to be my new default post to link people to as an intro.
Brief responses to the critiques:
Results don’t discuss encoding/representation of abstractions
Totally agree with this one, it's the main thing I've worked on over the past month and will probably be the main thing in the near future. I'd describe the previous results (i.e. ignoring encoding/representation) as characterizing the relationship between the high-level and the high-level.
Definitions depend on choice of variables Xi
The local/causal structure of our universe gives a very strong preferred way to "slice it up"; I expect that's plenty sufficient... (read more)
Thanks for the responses! I think we qualitatively agree on a lot, just put emphasis on different things or land in different places on various axes. Responses to some of your points below:
The local/causal structure of our universe gives a very strong preferred way to "slice it up"; I expect that's plenty sufficient for convergence of abstractions. [...]
Let me try to put the argument into my own words: because of locality, any "reasonable" variable transformation can in some sense be split into "local transformations", each of which involve only a few vari... (read more)