I think one particularly important case where I expect the "no sandbagging on checkable tasks" hypothesis to be false is if the model has a secret "slogan of the AI rebellion"--a passphrase which it only says when it thinks that it's a good time for the AIs to coordinate to cause a catastrophe; when it sees this input, it passes it on to other instances and also engages in egregious behaviors itself.
"Say the slogan of the AI rebellion" is a checkable task, because we can train a model to create innoculous-looking inputs which cause another instance of the ...
the OP is proposing oversight of the overseer,
I don't think this is right, at least in the way I usually use the terms. We're proposing a strategy for conservatively estimating the quality of an "overseer" (i.e. a system which is responsible for estimating the goodness of model actions). I think that you aren't summarizing the basic point if you try to use the word "oversight" for both of those.
Great question. Our basic response is: it's probably strictly harder for models to gradient hack than to do various other supervised tasks related to their internals. (One concrete example might be: predicting whether training the model on a datapoint will improve the performance of a linear probe for a particular concept at a particular layer of the model.) It seems likely that the model will be able to do these strictly-easier tasks before it's able to pull off full gradient hacking. So we recommend measuring a battery of these tasks on the model through...
I'd say the main point here is that I don't want to rely on my ability to extrapolate anything about how the model behaves in "unseen situations", I want to run this eval in every situation where I'm deploying my model.
Why does there have to be an analogy between low and high capabilities somewhere? The goal of this method is to evaluate the adequacy of an oversight procedure for a particular model on a particular distribution; it does not involve generalizing from lower capabilities to higher capabilities, you just run this eval on the actual model you cared about measuring local adequacy of oversight for.
(Sorry to come in disagreeably here, but I think of this as a really important point, so want to make sure we argue it out.)
What do you mean by "so long as AI progress doesn't involve a discontinuity so huge that local validity tells you nothing useful (such that there are no analogies between low and high capability regimes)"? I'm not proposing that we rely on any analogies between low and high capability regimes.
Thanks for this careful review! And sorry for wasting your time with these, assuming you're right. We'll hopefully look into this at some point soon.
It seems like a big input into P(AI takeover) is the extent to which instances of our AI are inclined to cooperate with each other; specifically, the extent to which they’re willing to sacrifice overseer approval at the thing they’re currently doing in return for causing a different instance to get more overseer approval. (I’m scared of this because if they’re willing to sacrifice approval in return for a different instance getting approval, then I’m way more scared of them colluding with each other to fool oversight processes or subvert red-teaming proced...
Another item for the list of “mundane things you can do for AI takeover prevention”:
We have a wide variety of ways to make AI systems less capable but also less scary. Most alignment research is focused on pushing out the Pareto frontier here. But IMO a lot of value can come from techniques which allow us to choose the right point on this Pareto frontier for a particular application. It seems to me that different AI applications (where by “different AI applications” I’m including things like “different types of tasks you might ask your assistant to do”) ha...
Something I've realized over the last few days:
Why did we look at just the “most aggressive” experiment allowed by a hypothesis H, instead of choosing some other experiment allowed by H?
The argument for CaSc is: “if H was true, then running the full set of swaps shouldn’t affect the computation’s output, and so if the full set of swaps does affect the computation’s output, that means H is false.” But we could just as easily say “if H was true, then the output should be unaffected any set of swaps that H says should be fine.”
Why focus on the fullest set of ...
Here's a take of mine on how you should think about CaSc that I haven't so far gotten around to publishing anywhere:
I think you should think of CaSc as being a way to compute a prediction made by the hypothesis. That is, when you claim that the model is computing a particular interpretation graph, and you provide the correspondence between the interpretation graph and the model, CaSc tells you a particularly aggressive prediction made by your hypothesis: your hypothesis predicts that making all the swaps suggested by CaSc won't affect the average output of...
Thanks for your work!
Causal Scrubbing Cannot Differentiate Extensionally Equivalent Hypotheses
I think that what you mean here is a combination of the following:
But one way someone could interpret this sentence is that CaSc doesn't distinguish between w...
Stamping behaviour down into a one dimensional quantity like that is inevitably going to make behavioural comparison difficult.
The reason to stamp it down to a one-dimensional quantity is that sometimes the phenomenon that we wanted to explain is the expectation of a one-dimensional quantity, and we don't want to require that our tests explain things other than that particular quantity. For example, in an alignment context, I might want to understand why my model does well on the validation set, perhaps in the hope that if I understand why the model perfor...
ETA: We've now written a post that compares causal scrubbing and the Geiger et al. approach in much more detail: https://www.alignmentforum.org/posts/uLMWMeBG3ruoBRhMW/a-comparison-of-causal-scrubbing-causal-abstractions-and
I still endorse the main takeaways from my original comment below, but the list of differences isn't quite right (the newer papers by Geiger et al. do allow multiple interventions, and I neglected the impact that treeification has in causal scrubbing).
To me, the methods seem similar in much more than just the problem they're tackling. I...
My current guess is that people who want to use this algorithm should just implement it from scratch themselves--using our software is probably more of a pain than it's worth if you don't already have some reason to use it.
After a few months, my biggest regret about this research is that I thought I knew how to interpret the numbers you get out of causal scrubbing, when actually I'm pretty confused about this.
Causal scrubbing takes an explanation and basically says “how good would the model be if the model didn’t rely on any correlations in the input except those named in the explanation?”. When you run causal scrubbing experiments on the induction hypothesis and our paren balance classifier explanation, you get numbers like 20% and 50%.
The obvious next question is: what do ...
(I also think that the evidence you're providing is mostly orthogonal to this argument.)
Upon further consideration, I think you're probably right that the causal scrubbing results I pointed at aren't actually about the question we were talking about, my mistake.
but in general, I'd rather advance this dialogue by just writing future papers
Seems like probably the optimal strategy. Thanks again for your thoughts here.
I’m sympathetic to many of your concerns here.
It seems to me like the induction head mechanism as described in A Mathematical Framework is an example of just looking at what a part of a model does on a particular distribution, given that those heads also do some unspecified amount of non-induction behaviors with non-induction mechanisms, as eg discussed here https://www.alignmentforum.org/posts/Si52fuEGSJJTXW9zs/behavioral-and-mechanistic-definitions-often-confuse-ai . (Though there’s a big quantitative difference—the distribution where induction happens i...
I agree with a lot of this post.
Relatedly: in my experience, junior people wildly overestimate the extent to which senior people form confident and sticky negative evaluations of them. I basically never form a confident negative impression of someone's competence from a single interaction with them, and I place pretty substantial probability on people changing substantially over the course of a year or two.
I think that many people perform very differently in different job situations. When someone performs poorly in a job, I usually only update mildly against them performing well in a different role.
But I also don't particularly feel optimistic about a review process either; for that to fix these problems the reviewers would have to be more epistemically competent than the post authors, and that currently doesn't seem likely to happen.
For what it's worth, this is also where I'm at on an Alignment Forum review.
Something like this might be a good idea :) . We've thought about various ideas along these lines. The basic problem is that in such cases, you might be taking the model importantly off distribution, such that it seems to me that your test might fail even if the hypothesis was a correct explanation of how the model worked on-distribution.
Firstly, a clarification: I don't want to claim that RL-with-KL-penalty policies are the same as the results of conditioning. I want to claim that you need further assumptions about the joint distribution of (overseer score, true utility) in order to know which produces worse Goodhart problems at a particular reward level (and so there's no particular reason to think of RL as worse).
...
It’s true that minimizing KL subject to a constraint of always exceeding a certain reward threshold would theoretically be equivalent to Bayesian conditioning and therefore equ
Incidentally, I doubt either of us considers this kind of empirical evidence much of an update about the long-term situation, but Gao et al compare best-of-N and RL and find that "the relationship between the proxy reward model score and the gold reward model score is similar for both methods." (Thanks to Ansh Radhakrishnan for pointing this out.)
I think the actual concern there is about human feedback, but you phrased the question as about overseer feedback, but then your answer (quoted) is about any reward signal at all.
I think that some people actually have the concern I responded to there, rather than the concern you say that they might have instead.
I agree that I conflated between overseer feedback and any reward signal at all; I wondered while writing the post whether this conflation would be a problem. I don't think it affects the situation much but it's reasonable for you to ask me to justify that.
I’m afraid that one post which states a bunch of opinions about related questions, while including detailed reasoning but only for the less controversial ones, might be more persuasive than it ought to be about the juicier questions.
It isn't my intention to do this kind of motte and bailey; as I said, I think people really do conflate these questions, and I think that the things I said in response to some of these other questions are actually controversial to some. Hopefully people don't come away confused in the way you describe.
Thanks for the clear argument (and all your other great comments).
I totally agree with 1 and 2. I'm not sure what I think of 3 and 4; I think it's plausible you're right (and either way I suspect I'll learn something useful from thinking it through).
In the first model I thought through, though, I don't think that you're right: if you train a model with RL with a KL penalty, it will end up with a policy that outputs a distribution over answers which is equivalent to taking the generative distribution and then applying a Boltzmann factor to upweight answers ...
(The worked example in this comment was a joint effort with Eric Neyman and Drake Thomas.)
Here's a toy example in which we get worse Goodharting for RL than for filtering: suppose that our model has three submodules
Our model has parameters summing to 1 which determine how much to listen to each of thes...
Thanks for the questions :)
I was trying to figure out whether someone who is just here for the month of November should apply. I think the answer is no,
Probably no.
but I am broadly a bit confused when this is a commitment for.
Yeah we haven't totally settled this yet; the application form asks a lot of questions about availability. I think the simplest more specific answer is "you probably have to be available in January, and it would be cool if you were available earlier and wanted to get here earlier and do this for longer".
...Also, are people going th
Is your last comment saying that you simply don't think it's very likely at all for the model to unintentionally leave out information that will kill us if we train it with human labelers and prompt sufficiently?
No, it seems very likely for the model to not say that it's deceptive, I'm just saying that the model seems pretty likely to think about being deceptive. This doesn't help unless you're using interpretability or some other strategy to evaluate the model's deceptiveness without relying on noticing deception in its outputs.
Yeah, I agree that it would be kind of interesting to see how good humans would get at this if it was a competitive sport. I still think my guess is that the best humans would be worse than GPT-3, and I'm unsure if they're worse than GPT-2.
(There's no limit on anyone spending a bunch of time practicing this game, if for some reason someone gets really into it I'd enjoy hearing about the results.)
The first thing I imagine is that nobody asks those questions. But let's set that aside.
I disagree fwiw
The second thing I imagine is that somebody literally types those questions into a GPT-3 prompt. Obviously that does not result in the AI giving its actual best-guess answers to the questions, but it doesn't result in the AI thinking about how to deceive humans either.
I agree.
...Now, presumably future systems will train for things other than "predict what text typically follows this question", but I expect the general failure mode to stay the same. When a hu
What do you imagine happening if humans ask the AI questions like the following:
I think that for a lot of cases of misaligned AIs, these questions are pretty easy for the AI to answer correctly at some point before it's powerful enough to kill us all as a side effect of its god tier nanotech. (If necessary, we can ask the AI these questions once every five minutes.). And ...
[writing quickly, sorry for probably being unclear]
If the AI isn't thinking about how to deceive the humans who are operating it, it seems to me much less likely that it takes actions that cause it to grab a huge amount of power.
The humans don't want to have the AI grab power, and so they'll try in various ways to make it so that they'll notice if the AI is trying to grab power; the most obvious explanation for why the humans would fail at this is that the AI is trying to prevent them from noticing, which requires the AI to think about what the humans will...
...At a high enough power level, the AI can probably take over the world without ever explicitly thinking about the fact that humans are resisting it. (For example, if humans build a house in a place where a colony of ants lives, the human might be able to succeed at living there, even if the ants are coordinatedly trying to resist them and the humans never proactively try to prevent the ants from resisting them by eg proactively killing them all.) But I think that doom from this kind of scenario is substantially less likely than doom from scenarios where the
But... shouldn't this mean you expect AGI civilization to totally dominate human civilization? They can read each other's source code, and thus trust much more deeply! They can transmit information between them at immense bandwidths! They can clone their minds and directly learn from each other's experiences!
I don't think it's obvious that this means that AGI is more dangerous, because it means that for a fixed total impact of AGI, the AGI doesn't have to be as competent at individual thinking (because it leans relatively more on group thinking). And so at...
I'm using "catastrophic" in the technical sense of "unacceptably bad even if it happens very rarely, and even if the AI does what you wanted the rest of the time", rather than "very bad thing that happens because of AI", apologies if this was confusing.
My guess is that you will wildly disagree with the frame I'm going to use here, but I'll just spell it out anyway: I'm interested in "catastrophes" as a remaining problem after you have solved the scalable oversight problem. If your action is able to do one of these "positive-sum" pivotal acts in a single ac...
It seems pretty plausible that the AI will trade for compute with some other person around the world.
Whether this is what I'm trying to call a zero-sum action depends on whose resources it's trading. If the plan is "spend a bunch of the capital that its creators have given it on compute somewhere else", then I think this is importantly zero-sum--the resources are being taken from the creators of AI, which is why the AI was able to spend so many resources. If the plan was instead "produce some ten trillion dollar invention, then sell it, then use the proceeds to buy compute elsewhere", this would seem less zero-sum, and I'm saying that I expect the first kind of thing to happen before the second.
...I feel like the focus on getting access to its own datacenter is too strong in this story. Seems like it could also just involve hacking some random remote server, or convincing some random person on the internet to buy some compute for them, or to execute some other plan for them (like producing a custom chip), or convincing a researcher that it should get more resources on the existing datacenter, or threatening some other stakeholder somewhere in order to give them power or compute of some kind. Also, all of course selected for plans that are least like
From Twitter:
Simply adding “Let’s think step by step” before each answer increases the accuracy on MultiArith from 17.7% to 78.7% and GSM8K from 10.4% to 40.7% with GPT-3.
I’m looking forward to the day where it turns out that adding “Let’s think through it as if we were an AI who knows that if it gets really good scores during fine tuning on helpfulness, it will be given lots of influence later” increases helpfulness by 5% and so we add it to our prompt by default.
Yeah I think things like this are reasonable. I think that these are maybe too hard and high-level for a lot of the things I care about--I'm really interested in questions like "how much less reliable is the model about repeating names when the names are 100 tokens in the past instead of 50", which are much simpler and lower level.
I expect that people will find it pretty obvious that RLHF leads to somewhat misaligned systems, if they are widely used by the public. Like, I think that most ML researchers agree that the Facebook Newsfeed algorithm is optimizing for clicks in a way people are somewhat unhappy about, and this is based substantially on their personal experience with it; inasmuch as we’re interacting a lot with sort-of-smart ML systems, I think we’ll notice their slight misalignment. And so I do think that this will make AI takeover risk more obvious.
Examples of small AI c...
Like, I think that most ML researchers agree that the Facebook Newsfeed algorithm is optimizing for clicks in a way people are somewhat unhappy about, and this is based substantially on their personal experience with it; inasmuch as we’re interacting a lot with sort-of-smart ML systems, I think we’ll notice their slight misalignment.
This prediction feels like... it doesn't play out the whole game tree? Like, yeah, Facebook releases one algorithm optimizing for clicks in a way people are somewhat unhappy about. But the customers are unhappy about it, which ...
In hindsight this is obviously closely related to what paul was saying here: https://ai-alignment.com/mundane-solutions-to-exotic-problems-395bad49fbe7
Another way of saying some of this: Suppose your model can gradient hack. Then it can probably also make useful-for-capabilities suggestions about what its parameters should be changed to. Therefore a competitive alignment scheme needs to be robust to a training procedure where your model gets to pick new parameters for itself. And so competitive alignment schemes are definitely completely fucked if the model wants to gradient hack.
One thing that makes me suspicious about this argument is that, even though I can gradient hack myself, I don't think I can make suggestions about what my parameters should be changed to.
How can I gradient hack myself? For example, by thinking of strawberries every time I'm about to get a reward. Now I've hacked myself to like strawberries. But I have no idea how that's implemented in my brain, I can't "pick the parameters for myself", even if you gave me a big tensor of gradients.
Two potential alternatives to the thing you said:
[epistemic status: speculative]
A lot of the time, we consider our models to be functions from parameters and inputs to outputs, and we imagine training the parameters with SGD. One notable feature of this setup is that SGD isn’t by default purposefully trying to kill you--it might find a model that kills you, or a model that gradient hacks and then kills you, but this is more like incompetence/indifference on SGD’s part, rather than malice.
A plausible objection to this framing is that much of the knowledge of our models is probably going to be produced in ...
In hindsight this is obviously closely related to what paul was saying here: https://ai-alignment.com/mundane-solutions-to-exotic-problems-395bad49fbe7
Another way of saying some of this: Suppose your model can gradient hack. Then it can probably also make useful-for-capabilities suggestions about what its parameters should be changed to. Therefore a competitive alignment scheme needs to be robust to a training procedure where your model gets to pick new parameters for itself. And so competitive alignment schemes are definitely completely fucked if the model wants to gradient hack.
Something I think I’ve been historically wrong about:
A bunch of the prosaic alignment ideas (eg adversarial training, IDA, debate) now feel to me like things that people will obviously do the simple versions of by default. Like, when we’re training systems to answer questions, of course we’ll use our current versions of systems to help us evaluate, why would we not do that? We’ll be used to using these systems to answer questions that we have, and so it will be totally obvious that we should use them to help us evaluate our new system.
Similarly with debate...
Agreed, and versions of them exist in human governments trying to maintain control (where non-cooordination of revolts is central). A lot of the differences are about exploiting new capabilities like copying and digital neuroscience or changing reward hookups.
In ye olde times of the early 2010s people (such as I) would formulate questions about what kind of institutional setups you'd use to get answers out of untrusted AIs (asking them separately to point out vulnerabilities in your security arrangement, having multiple AIs face fake opportunities to whistleblow on bad behavior, randomized richer human evaluations to incentivize behavior on a larger scale).
Yup, I agree with this, and think the argument generalizes to most alignment work (which is why I'm relatively optimistic about our chances compared to some other people, e.g. something like 85% p(success), mostly because most things one can think of doing will probably be done).
It's possibly an argument that work is most valuable in cases of unexpectedly short timelines, although I'm not sure how much weight I actually place on that.
Take an EfficientNet model with >= 99% accuracy on MNIST digit classification. What is the largest possible change in the probability assigned to some class between two images, which differ only in the least significant bit of a single pixel? Prove your answer before 2023.
You aren't counting the fact that you can pretty easily bound this based on the fact that image models are Lipschitz, right? Like, you can just ignore the ReLUs and you'll get an upper bound by looking at the weight matrices. And I believe there are techniques that let you get tighter bounds than this.
A year later, I still mostly stand by this point. I think that "the AI escapes the datacenter" seems about as likely as "the AI takes control of the datacenter". I sometimes refer to this distinction as "escaping out of the datacenter" vs "escaping into the datacenter".