All of Buck's Comments + Replies

A year later, I still mostly stand by this point. I think that "the AI escapes the datacenter" seems about as likely as "the AI takes control of the datacenter". I sometimes refer to this distinction as "escaping out of the datacenter" vs "escaping into the datacenter".

I think one particularly important case where I expect the "no sandbagging on checkable tasks" hypothesis to be false is if the model has a secret "slogan of the AI rebellion"--a passphrase which it only says when it thinks that it's a good time for the AIs to coordinate to cause a catastrophe; when it sees this input, it passes it on to other instances and also engages in egregious behaviors itself.

"Say the slogan of the AI rebellion" is a checkable task, because we can train a model to create innoculous-looking inputs which cause another instance of the ... (read more)

the OP is proposing oversight of the overseer,

I don't think this is right, at least in the way I usually use the terms. We're proposing a strategy for conservatively estimating the quality of an "overseer" (i.e. a system which is responsible for estimating the goodness of model actions). I think that you aren't summarizing the basic point if you try to use the word "oversight" for both of those.

2johnswentworth2mo
That's useful, thanks.

Great question. Our basic response is: it's probably strictly harder for models to gradient hack than to do various other supervised tasks related to their internals. (One concrete example might be: predicting whether training the model on a datapoint will improve the performance of a linear probe for a particular concept at a particular layer of the model.) It seems likely that the model will be able to do these strictly-easier tasks before it's able to pull off full gradient hacking. So we recommend measuring a battery of these tasks on the model through... (read more)

I'd say the main point here is that I don't want to rely on my ability to extrapolate anything about how the model behaves in "unseen situations", I want to run this eval in every situation where I'm deploying my model.

Why does there have to be an analogy between low and high capabilities somewhere? The goal of this method is to evaluate the adequacy of an oversight procedure for a particular model on a particular distribution; it does not involve generalizing from lower capabilities to higher capabilities, you just run this eval on the actual model you cared about measuring local adequacy of oversight for.

(Sorry to come in disagreeably here, but I think of this as a really important point, so want to make sure we argue it out.)

1Samuel Dylan Martin2mo
You're right, I've reread the section and that was a slight misunderstanding on my part. Even so I still think it falls at a 7 on my scale as it's a way of experimentally validating oversight processes that gives you some evidence about how they'll work in unseen situations.

What do you mean by "so long as AI progress doesn't involve a discontinuity so huge that local validity tells you nothing useful (such that there are no analogies between low and high capability regimes)"? I'm not proposing that we rely on any analogies between low and high capability regimes.

1Samuel Dylan Martin2mo
In the sense that there has to be an analogy between low and high capabilities somewhere, even if at the meta level. This method lets you catch dangerous models that can break oversight processes for the same fundamental reasons as less dangerous models, not just for the same inputs.

Thanks for this careful review! And sorry for wasting your time with these, assuming you're right. We'll hopefully look into this at some point soon.

It seems like a big input into P(AI takeover) is the extent to which instances of our AI are inclined to cooperate with each other; specifically, the extent to which they’re willing to sacrifice overseer approval at the thing they’re currently doing in return for causing a different instance to get more overseer approval. (I’m scared of this because if they’re willing to sacrifice approval in return for a different instance getting approval, then I’m way more scared of them colluding with each other to fool oversight processes or subvert red-teaming proced... (read more)

Another item for the list of “mundane things you can do for AI takeover prevention”:

We have a wide variety of ways to make AI systems less capable but also less scary. Most alignment research is focused on pushing out the Pareto frontier here. But IMO a lot of value can come from techniques which allow us to choose the right point on this Pareto frontier for a particular application. It seems to me that different AI applications (where by “different AI applications” I’m including things like “different types of tasks you might ask your assistant to do”) ha... (read more)

Something I've realized over the last few days:

Why did we look at just the “most aggressive” experiment allowed by a hypothesis H, instead of choosing some other experiment allowed by H?

The argument for CaSc is: “if H was true, then running the full set of swaps shouldn’t affect the computation’s output, and so if the full set of swaps does affect the computation’s output, that means H is false.” But we could just as easily say “if H was true, then the output should be unaffected any set of swaps that H says should be fine.”

Why focus on the fullest set of ... (read more)

1Adrià Garriga-Alonso6mo
One thing that is not equivalent to joins, which you might also want to do, is to choose the single worst swap that the hypothesis allows. That is, if a set of node values X={x1,x2,…} are all equivalent, you can choose to map all of them to e.g. x1. And that can be more aggressive than any partition of X which is then chosen-from randomly, and does not correspond to joins.
2David Lindner6mo
Thanks, that's a useful alternative framing of CaSc! FWIW, I think this adversarial version of CaSc would avoid the main examples in our post where CaSc fails to reject a false hypothesis. The common feature of our examples is "cancellation" which comes from looking at an average CaSc loss. If you only look at the loss of the worst experiment (so the maximum CaSc loss rather than the average one) you don't get these kind of cancellation problems. Plausibly you'd run into different failure modes though, in particular, I guess the maximum measure is less smooth and gives you less information on "how wrong" your hypothesis is.

Here's a take of mine on how you should think about CaSc that I haven't so far gotten around to publishing anywhere:

I think you should think of CaSc as being a way to compute a prediction made by the hypothesis. That is, when you claim that the model is computing a particular interpretation graph, and you provide the correspondence between the interpretation graph and the model, CaSc tells you a particularly aggressive prediction made by your hypothesis: your hypothesis predicts that making all the swaps suggested by CaSc won't affect the average output of... (read more)

2Buck Shlegeris6mo
Something I've realized over the last few days: Why did we look at just the “most aggressive” experiment allowed by a hypothesis H, instead of choosing some other experiment allowed by H? The argument for CaSc is: “if H was true, then running the full set of swaps shouldn’t affect the computation’s output, and so if the full set of swaps does affect the computation’s output, that means H is false.” But we could just as easily say “if H was true, then the output should be unaffected any set of swaps that H says should be fine.” Why focus on the fullest set of swaps? An obvious alternative to “evaluate the hypothesis using the fullest set of swaps” is “evaluate the hypothesis by choosing the set of swaps allowed by H which make it look worse”. I just now have realized that this is AFACIT equivalent to constructing your CaSc hypothesis adversarially--that is, given a hypothesis H, allowing an adversary to choose some other hypothesis H’, and then you run the CaSc experiment on join(H, H’). And so, when explaining CaSc, I think we should plausibly think about describing it by talking about the hypothesis producing a bunch of allowed experiments, and then you can test your hypothesis by either looking at the maxent one or by looking at the worst one.

Thanks for your work!

Causal Scrubbing Cannot Differentiate Extensionally Equivalent Hypotheses

I think that what you mean here is a combination of the following:

  • CaSc fails to reject some false hypotheses, as already discussed.
  • Each node in the interpretation graph is only verified up to extensional equality. As in, if I claim that a single node in the graph is a whole sort function, I don't learn anything about whether the model is implementing quicksort or mergesort.

But one way someone could interpret this sentence is that CaSc doesn't distinguish between w... (read more)

Stamping behaviour down into a one dimensional quantity like that is inevitably going to make behavioural comparison difficult.

The reason to stamp it down to a one-dimensional quantity is that sometimes the phenomenon that we wanted to explain is the expectation of a one-dimensional quantity, and we don't want to require that our tests explain things other than that particular quantity. For example, in an alignment context, I might want to understand why my model does well on the validation set, perhaps in the hope that if I understand why the model perfor... (read more)

It’s a pretty different algorithm, though obviously it’s trying to solve a related problem.

ETA: We've now written a post that compares causal scrubbing and the Geiger et al. approach in much more detail: https://www.alignmentforum.org/posts/uLMWMeBG3ruoBRhMW/a-comparison-of-causal-scrubbing-causal-abstractions-and

I still endorse the main takeaways from my original comment below, but the list of differences isn't quite right (the newer papers by Geiger et al. do allow multiple interventions, and I neglected the impact that treeification has in causal scrubbing).


To me, the methods seem similar in much more than just the problem they're tackling. I... (read more)

My current guess is that people who want to use this algorithm should just implement it from scratch themselves--using our software is probably more of a pain than it's worth if you don't already have some reason to use it.

After a few months, my biggest regret about this research is that I thought I knew how to interpret the numbers you get out of causal scrubbing, when actually I'm pretty confused about this.

Causal scrubbing takes an explanation and basically says “how good would the model be if the model didn’t rely on any correlations in the input except those named in the explanation?”. When you run causal scrubbing experiments on the induction hypothesis and our paren balance classifier explanation, you get numbers like 20% and 50%.

The obvious next question is: what do ... (read more)

(I also think that the evidence you're providing is mostly orthogonal to this argument.)

Upon further consideration, I think you're probably right that the causal scrubbing results I pointed at aren't actually about the question we were talking about, my mistake.

but in general, I'd rather advance this dialogue by just writing future papers

Seems like probably the optimal strategy. Thanks again for your thoughts here.

I’m sympathetic to many of your concerns here.

It seems to me like the induction head mechanism as described in A Mathematical Framework is an example of just looking at what a part of a model does on a particular distribution, given that those heads also do some unspecified amount of non-induction behaviors with non-induction mechanisms, as eg discussed here https://www.alignmentforum.org/posts/Si52fuEGSJJTXW9zs/behavioral-and-mechanistic-definitions-often-confuse-ai . (Though there’s a big quantitative difference—the distribution where induction happens i... (read more)

5Christopher Olah7mo
I moderately disagree with this? I think most induction heads are at least primarily induction heads (and this points strongly at the underlying attentional features and circuits), although there may be some superposition going on. (I also think that the evidence you're providing is mostly orthogonal to this argument.) I think if you're uncomfortable with induction heads, previous token heads (especially in larger models) are an even more crisp example of an attentional feature which appears, at least on casual inspection, to typically be monosematnically represented by attention heads. :)  As a meta point – I've left some thoughts below, but in general, I'd rather advance this dialogue by just writing future papers. -------------------------------------------------------------------------------- (1) The main evidence I have for thinking that induction heads (or previous token heads) are primarily implementing those attentional features is just informally looking at their behavior on lots of random dataset examples. This isn't something I've done super rigorously, but I have a pretty strong sense that this is at least "the main thing".   (2) I think there's an important distinction between "imprecisely articulating a monosemantic feature" and "a neuron/attention head is polysemantic/doing multiple things". For example, suppose I found a neuron and claimed it was a golden retriever detector. Later, it turns out that it's a U-shaped floppy ear detector which fires for several species of dogs. In that situation, I would have misunderstood something – but the misunderstanding isn't about the neuron doing multiple things, it's about having had an incorrect theory of what the thing is. It seems to me that your post is mostly refining the hypothesis of what the induction heads you are studying are – not showing that they do lots of unrelated things.   (3) I think our paper wasn't very clear about this, but I don't think your refinements of the induction heads was u

I agree with a lot of this post.

Relatedly: in my experience, junior people wildly overestimate the extent to which senior people form confident and sticky negative evaluations of them. I basically never form a confident negative impression of someone's competence from a single interaction with them, and I place pretty substantial probability on people changing substantially over the course of a year or two.

I think that many people perform very differently in different job situations. When someone performs poorly in a job, I usually only update mildly against them performing well in a different role.

But I also don't particularly feel optimistic about a review process either; for that to fix these problems the reviewers would have to be more epistemically competent than the post authors, and that currently doesn't seem likely to happen.

 

For what it's worth, this is also where I'm at on an Alignment Forum review.

2Raymond Arnold8mo
I've been trying to articulate some thoughts since Rohin's original comment, and maybe going to just rant-something-out now. On one hand: I don't have a confident belief that writing in-depth reviews is worth Buck or Rohin's time (or their immediate colleague's time for that matter). It's a lot of work, there's a lot of other stuff worth doing. And I know at least Buck and Rohin have already spent quite a lot of time arguing about the conceptual deep disagreements for many of the top-voted posts. On the other hand, the combination of "there's stuff epistemically wrong or confused or sketchy about LW", but "I don't trust a review process to actually work because I don't believe the it'll get better epistemics than what have already been demonstrated" seems a combination of "self-defeatingly wrong" and "also just empirically (probably) wrong".  Presumably Rohin and Buck and similar colleagues think they have at least (locally) better epistemics than the writers they're frustrated by.  I'm guessing your take is like "I, Buck/Rohin, could write a review that was epistemically adequate, but I'm busy and don't expect it to accomplish anything that useful." Assuming that's a correct characterization, I don't necessarily disagree (at least not confidently). But something about the phrasing feels off. Some reasons it feels off: * Even if there are clusters of research that seem too hopeless to be worth engaging with, I'd be very surprised if there weren't at least some clusters of research that Rohin/Buck/etc are more optimistic about. If what happens is "people write reviews of the stuff that feels real/important enough to be worth engaging with", that still seems valuable to me. * It seems like people are sort of treating this like a stag-hunt, and it's not worth participating if a bunch of other effort isn't going in. I do think there are network effects that make it more valuable as more people participate. But I also think "people incrementa

Something like this might be a good idea :) . We've thought about various ideas along these lines. The basic problem is that in such cases, you might be taking the model importantly off distribution, such that it seems to me that your test might fail even if the hypothesis was a correct explanation of how the model worked on-distribution.

1Lauro Langosco8mo
awesome!

Extremal Goodhart is not differentially a problem for RL vs conditioning, right?

2Rohin Shah9mo
Idk, if you're carving up the space into mutually exclusive "Causal Goodhart" and "Extremal Goodhart" problems, then I expect conditioning to have stronger Extremal Goodhart problems, just because RL can change causal mechanisms to lead to high performance, whereas conditioning has to get high performance just by sampling more and more extreme outputs. (But mostly I think you don't want to carve up the space into mutually exclusive "Causal Goodhart" and "Extremal Goodhart".)
1davidad (David A. Dalrymple)9mo
I think so, yes.

Firstly, a clarification: I don't want to claim that RL-with-KL-penalty policies are the same as the results of conditioning. I want to claim that you need further assumptions about the joint distribution of (overseer score, true utility) in order to know which produces worse Goodhart problems at a particular reward level (and so there's no particular reason to think of RL as worse).


It’s true that minimizing KL subject to a constraint of always exceeding a certain reward threshold would theoretically be equivalent to Bayesian conditioning and therefore equ

... (read more)

Incidentally, I doubt either of us considers this kind of empirical evidence much of an update about the long-term situation, but Gao et al compare best-of-N and RL and find that "the relationship between the proxy reward model score and the gold reward model score is similar for both methods." (Thanks to Ansh Radhakrishnan for pointing this out.)

1davidad (David A. Dalrymple)9mo
In RLHF there are at least three different (stochastic) reward functions: 1. the learned value network 2. the “human clicks 👍/👎” process, and 3. the “what if we asked a whole human research group and they had unlimited time and assistance to deliberate about this one answer” process. I think the first two correspond to what that paper calls “proxy” and “gold” but I am instead concerned with the ways in which 2 is a proxy for 3.
2Lawrence Chan9mo
I agree with the general point, but I'll note that at equal proxy reward model scores, the RL policy has significantly more KL divergence with the base policy. 

I think the actual concern there is about human feedback, but you phrased the question as about overseer feedback, but then your answer (quoted) is about any reward signal at all.

 

I think that some people actually have the concern I responded to there, rather than the concern you say that they might have instead.

I agree that I conflated between overseer feedback and any reward signal at all; I wondered while writing the post whether this conflation would be a problem. I don't think it affects the situation much but it's reasonable for you to ask me to justify that.

I’m afraid that one post which states a bunch of opinions about related questions, while including detailed reasoning but only for the less controversial ones, might be more persuasive than it ought to be about the juicier questions.

 

It isn't my intention to do this kind of motte and bailey; as I said, I think people really do conflate these questions, and I think that the things I said in response to some of these other questions are actually controversial to some. Hopefully people don't come away confused in the way you describe.

Thanks for the clear argument (and all your other great comments).

I totally agree with 1 and 2. I'm not sure what I think of 3 and 4; I think it's plausible you're right (and either way I suspect I'll learn something useful from thinking it through).

In the first model I thought through, though, I don't think that you're right: if you train a model with RL with a KL penalty, it will end up with a policy that outputs a distribution over answers which is equivalent to taking the generative distribution and then applying a Boltzmann factor to upweight answers ... (read more)

(The worked example in this comment was a joint effort with Eric Neyman and Drake Thomas.)

Here's a toy example in which we get worse Goodharting for RL than for filtering: suppose that our model has three submodules

  • A, which tries to produce outputs which are both true and persuasive
  • B, which tries to produce outputs which are true, but have no effect on persuasiveness
  • C, which tries to produce outputs which are persuasive, but with no effect on truthiness.

Our model has parameters  summing to 1 which determine how much to listen to each of thes... (read more)

4Lawrence Chan9mo
This seems correct insofar as your proxy reward does not have huge upward errors (that you don't remove via some sort of clipping). For example, if there's 1 million normal sentences with reward uniformly distributed between [0, 100] and one adversarial sentence with reward r=10^5, conditioning on reward>99 leads to a 1/10,000 chance of sampling the adversarial sentence, while it's very tricky (if not impossible) to correctly set the KL penalty so you end up optimizing reward without just outputting the adversarial sentence over and over again.  I don't think it's particularly unprincipled; the KL penalty is just our way of encoding that the base policy is relatively reasonable (in a value-free way), and the model shouldn't deviate from it too much. Similarly, BoN is another way of encoding that the base policy is reasonable, albeit one that's easier to reason about on average. Another way people regularize their RL is to mix in self-supervised training with RL (for example, they did this for text-davinci-003). I'm pretty sure this is also equivalent to RL w/ KL penalty.   There's the impact penalties approach (e.g. AUP or RLSP), which seem more principled than KL penalties in cases where you more information on the space of rewards.  You can also write do constrained optimization, which is again equivalent to modifying the reward function, but is way easier for humans to reason about and give feedback on.
2davidad (David A. Dalrymple)9mo
This is very interesting. I had previously thought the “KL penalty” being used in RLHF was just the local one that’s part of the PPO RL algorithm, but apparently I didn’t read the InstructGPT paper carefully enough. I feel slightly better about RLHF now, but not much. It’s true that minimizing KL subject to a constraint of always exceeding a certain reward threshold would theoretically be equivalent to Bayesian conditioning and therefore equivalent to filtering. That could be seen as a lexicographic objective where the binarised reward gets optimised first and then the KL penalty relative to the predictive model would restore the predictive model’s correlations (once the binarised reward is absolutely saturated). Unfortunately, this would be computationally difficult with gradient descent since you would already have mode-collapse before the KL penalty started to act. In practice (in the InstructGPT paper, at least), we have a linear mixture of the reward and the global KL penalty. Obviously, if the global KL penalty is weighted at zero, it doesn’t help avoid Causal Goodhart, nor if it’s weighted at 10−10. Conversely, if it’s weighted at 1010, the model won’t noticeably respond to human feedback. I think this setup has a linear tradeoff between how much helpfulness you get and how much you avoid Causal Goodhart. The ELBO argument in the post you linked requires explicitly transforming the reward into a Boltzmann distribution (relative to the prior of the purely predictive model) before using it in the objective function, which seems computationally difficult. That post also suggests some other alternatives to RLHF that are more like cleverly accelerated filtering, such as PPLM, and has a broad conclusion that RL doesn’t seem like the best framework for aligning LMs. That being said, both of the things I said seem computationally difficult above also seem not-necessarily-impossible and would be research directions I would want to allocate a lot of thought to if

Thanks for the questions :)

I was trying to figure out whether someone who is just here for the month of November should apply. I think the answer is no,

Probably no. 

but I am broadly a bit confused when this is a commitment for.

Yeah we haven't totally settled this yet; the application form asks a lot of questions about availability. I think the simplest more specific answer is "you probably have to be available in January, and it would be cool if you were available earlier and wanted to get here earlier and do this for longer".

Also, are people going th

... (read more)
2Ben Pace1y
Thanks for the answers! :)

Is your last comment saying that you simply don't think it's very likely at all for the model to unintentionally leave out information that will kill us if we train it with human labelers and prompt sufficiently?

No, it seems very likely for the model to not say that it's deceptive, I'm just saying that the model seems pretty likely to think about being deceptive. This doesn't help unless you're using interpretability or some other strategy to evaluate the model's deceptiveness without relying on noticing deception in its outputs.

Yeah, I agree that it would be kind of interesting to see how good humans would get at this if it was a competitive sport. I still think my guess is that the best humans would be worse than GPT-3, and I'm unsure if they're worse than GPT-2.

(There's no limit on anyone spending a bunch of time practicing this game, if for some reason someone gets really into it I'd enjoy hearing about the results.)

The first thing I imagine is that nobody asks those questions. But let's set that aside.

I disagree fwiw

The second thing I imagine is that somebody literally types those questions into a GPT-3 prompt. Obviously that does not result in the AI giving its actual best-guess answers to the questions, but it doesn't result in the AI thinking about how to deceive humans either.

I agree.

Now, presumably future systems will train for things other than "predict what text typically follows this question", but I expect the general failure mode to stay the same. When a hu

... (read more)
1Jacques Thibodeau1y
Any additional or new thoughts on this? Is your last comment saying that you simply don't think it's very likely at all for the model to unintentionally leave out information that will kill us if we train it with human labelers and prompt sufficiently? Do you believe it's way more likely that we'd be unable to prompt things out of the model only if it were deceptive? Could you say more?   Separately: If I have a chain-of-thought model detailing steps it will take to reach x outcome. We've fine-tuned on previous chain-of-thoughts while giving process-level feedback. However, even if you are trying to get it to externalize it's thoughts/reasoning, it could lead to extinction via side-effect. So you might ask the model at each individual thought (or just the entire plan) if we'll be happy with the outcome. How exactly would the model end up querying its internal world model in the way we would want it to?

What do you imagine happening if humans ask the AI questions like the following:

  • Are you an unaligned AI?
  • If we let you keep running, are you (or some other AI) going to end up disempowering us?
  • If we take the action you just proposed, will we be happy with the outcomes?

I think that for a lot of cases of misaligned AIs, these questions are pretty easy for the AI to answer correctly at some point before it's powerful enough to kill us all as a side effect of its god tier nanotech. (If necessary, we can ask the AI these questions once every five minutes.). And ... (read more)

6johnswentworth1y
The first thing I imagine is that nobody asks those questions. But let's set that aside. The second thing I imagine is that somebody literally types those questions into a GPT-3 prompt. Obviously that does not result in the AI giving its actual best-guess answers to the questions, but it doesn't result in the AI thinking about how to deceive humans either. It just thinks about what text would follow that question if it appeared on the internet somewhere. And then I imagine someone with a bunch of interpretability tools saying "yup, it's just thinking about what text typically follows this question", and then that person's boss is like "great, it's not trying to deceive us, guess we can trust the answer", and they both just haven't really thought of the fact that the AI's response-text does not have anything in particular to do with whether the AI is aligned or whether they'll be happy with the outcome or whatever. (It's essentially the same mistake as a GOFAI person looking at a node in some causal graph that says "will_kill_humans", and seeing that node set to 99% False, and thinking that somehow implies the GOFAI will not kill humans.) Now, presumably future systems will train for things other than "predict what text typically follows this question", but I expect the general failure mode to stay the same. When a human asks "Are you an unaligned AI?" or whatever, the AI thinks about a bunch of stuff which is just not particularly related to whether it's an unaligned AI. The AI wasn't trained to translate the literal semantics of questions into a query to its own internal world model and then translate the result back to human language; humans have no clue how to train such a thing. Probably the stuff the AI thinks about does not involve intentionally deceiving humans, because why would it? And then the AI gives some answer which is not particularly related to whether it's an unaligned AI, and the humans interpret that as an answer to their original question, ther

[writing quickly, sorry for probably being unclear]

If the AI isn't thinking about how to deceive the humans who are operating it, it seems to me much less likely that it takes actions that cause it to grab a huge amount of power.

The humans don't want to have the AI grab power, and so they'll try in various ways to make it so that they'll notice if the AI is trying to grab power; the most obvious explanation for why the humans would fail at this is that the AI is trying to prevent them from noticing, which requires the AI to think about what the humans will... (read more)

At a high enough power level, the AI can probably take over the world without ever explicitly thinking about the fact that humans are resisting it. (For example, if humans build a house in a place where a colony of ants lives, the human might be able to succeed at living there, even if the ants are coordinatedly trying to resist them and the humans never proactively try to prevent the ants from resisting them by eg proactively killing them all.) But I think that doom from this kind of scenario is substantially less likely than doom from scenarios where the

... (read more)

But... shouldn't this mean you expect AGI civilization to totally dominate human civilization? They can read each other's source code, and thus trust much more deeply! They can transmit information between them at immense bandwidths! They can clone their minds and directly learn from each other's experiences!

I don't think it's obvious that this means that AGI is more dangerous, because it means that for a fixed total impact of AGI, the AGI doesn't have to be as competent at individual thinking (because it leans relatively more on group thinking). And so at... (read more)

I'm using "catastrophic" in the technical sense of "unacceptably bad even if it happens very rarely, and even if the AI does what you wanted the rest of the time", rather than "very bad thing that happens because of AI", apologies if this was confusing.

My guess is that you will wildly disagree with the frame I'm going to use here, but I'll just spell it out anyway: I'm interested in "catastrophes" as a remaining problem after you have solved the scalable oversight problem. If your action is able to do one of these "positive-sum" pivotal acts in a single ac... (read more)

It seems pretty plausible that the AI will trade for compute with some other person around the world.

Whether this is what I'm trying to call a zero-sum action depends on whose resources it's trading. If the plan is "spend a bunch of the capital that its creators have given it on compute somewhere else", then I think this is importantly zero-sum--the resources are being taken from the creators of AI, which is why the AI was able to spend so many resources. If the plan was instead "produce some ten trillion dollar invention, then sell it, then use the proceeds to buy compute elsewhere", this would seem less zero-sum, and I'm saying that I expect the first kind of thing to happen before the second.

I feel like the focus on getting access to its own datacenter is too strong in this story. Seems like it could also just involve hacking some random remote server, or convincing some random person on the internet to buy some compute for them, or to execute some other plan for them (like producing a custom chip), or convincing a researcher that it should get more resources on the existing datacenter, or threatening some other stakeholder somewhere in order to give them power or compute of some kind. Also, all of course selected for plans that are least like

... (read more)
4Oliver Habryka1y
Yeah, OK, I think this distinction makes sense, and I do feel like this distinction is important.  Having settled this, my primary response is:  Sure, I guess it's the most prototypical catastrophic action until we have solved it, but like, even if we solve it, we haven't solved the problem where the AI does actually get a lot smarter than humans and takes a substantially more "positive-sum" action and kills approximately everyone with the use of a bioweapon, or launches all the nukes, or develops nanotechnology. We do have to solve this problem first, but the hard problem is the part where it seems hard to stop further AI development without having a system that is also capable of killing all (or approximately all) the humans, so calling this easy problem the "prototypical catastrophic action" feels wrong to me. Solving this problem is necessary, but not sufficient for solving AI Alignment, and while it is this stage and earlier stages where I expect most worlds to end, I expect most worlds that make it past this stage to not survive either. I think given this belief, I would think your new title is more wrong than the current title (I mean, maybe it's "mostly", because we are going to die in a low-dignity way as Eliezer would say, but it's not obviously where most of the difficulty lies).

From Twitter:

Simply adding “Let’s think step by step” before each answer increases the accuracy on MultiArith from 17.7% to 78.7% and GSM8K from 10.4% to 40.7% with GPT-3. 

I’m looking forward to the day where it turns out that adding “Let’s think through it as if we were an AI who knows that if it gets really good scores during fine tuning on helpfulness, it will be given lots of influence later” increases helpfulness by 5% and so we add it to our prompt by default.

Yeah I think things like this are reasonable. I think that these are maybe too hard and high-level for a lot of the things I care about--I'm really interested in questions like "how much less reliable is the model about repeating names when the names are 100 tokens in the past instead of 50", which are much simpler and lower level.

Yeah I wrote an interface like this for personal use, maybe I should release it publicly.

I expect that people will find it pretty obvious that RLHF leads to somewhat misaligned systems, if they are widely used by the public. Like, I think that most ML researchers agree that the Facebook Newsfeed algorithm is optimizing for clicks in a way people are somewhat unhappy about, and this is based substantially on their personal experience with it; inasmuch as we’re interacting a lot with sort-of-smart ML systems, I think we’ll notice their slight misalignment. And so I do think that this will make AI takeover risk more obvious.

Examples of small AI c... (read more)

Like, I think that most ML researchers agree that the Facebook Newsfeed algorithm is optimizing for clicks in a way people are somewhat unhappy about, and this is based substantially on their personal experience with it; inasmuch as we’re interacting a lot with sort-of-smart ML systems, I think we’ll notice their slight misalignment.

This prediction feels like... it doesn't play out the whole game tree? Like, yeah, Facebook releases one algorithm optimizing for clicks in a way people are somewhat unhappy about. But the customers are unhappy about it, which ... (read more)

Are any of these ancient discussions available anywhere?

In hindsight this is obviously closely related to what paul was saying here: https://ai-alignment.com/mundane-solutions-to-exotic-problems-395bad49fbe7

Another way of saying some of this: Suppose your model can gradient hack. Then it can probably also make useful-for-capabilities suggestions about what its parameters should be changed to. Therefore a competitive alignment scheme needs to be robust to a training procedure where your model gets to pick new parameters for itself. And so competitive alignment schemes are definitely completely fucked if the model wants to gradient hack.

One thing that makes me suspicious about this argument is that, even though I can gradient hack myself, I don't think I can make suggestions about what my parameters should be changed to.

How can I gradient hack myself? For example, by thinking of strawberries every time I'm about to get a reward. Now I've hacked myself to like strawberries. But I have no idea how that's implemented in my brain, I can't "pick the parameters for myself", even if you gave me a big tensor of gradients.

Two potential alternatives to the thing you said:

  • maybe competitive alignment
... (read more)

[epistemic status: speculative]

A lot of the time, we consider our models to be functions from parameters and inputs to outputs, and we imagine training the parameters with SGD. One notable feature of this setup is that SGD isn’t by default purposefully trying to kill you--it might find a model that kills you, or a model that gradient hacks and then kills you, but this is more like incompetence/indifference on SGD’s part, rather than malice.

A plausible objection to this framing is that much of the knowledge of our models is probably going to be produced in ... (read more)

In hindsight this is obviously closely related to what paul was saying here: https://ai-alignment.com/mundane-solutions-to-exotic-problems-395bad49fbe7

Another way of saying some of this: Suppose your model can gradient hack. Then it can probably also make useful-for-capabilities suggestions about what its parameters should be changed to. Therefore a competitive alignment scheme needs to be robust to a training procedure where your model gets to pick new parameters for itself. And so competitive alignment schemes are definitely completely fucked if the model wants to gradient hack.

Something I think I’ve been historically wrong about:

A bunch of the prosaic alignment ideas (eg adversarial training, IDA, debate) now feel to me like things that people will obviously do the simple versions of by default. Like, when we’re training systems to answer questions, of course we’ll use our current versions of systems to help us evaluate, why would we not do that? We’ll be used to using these systems to answer questions that we have, and so it will be totally obvious that we should use them to help us evaluate our new system.

Similarly with debate... (read more)

Agreed, and versions of them exist in human governments trying to maintain control (where non-cooordination of revolts is central).  A lot of the differences are about exploiting new capabilities like copying and digital neuroscience or changing reward hookups.

In ye olde times of the early 2010s people (such as I) would formulate questions about what kind of institutional setups you'd use to get answers out of untrusted AIs (asking them separately to point out vulnerabilities in your security arrangement, having multiple AIs face fake opportunities to whistleblow on bad behavior, randomized richer human evaluations to incentivize behavior on a larger scale).

Yup, I agree with this, and think the argument generalizes to most alignment work (which is why I'm relatively optimistic about our chances compared to some other people, e.g. something like 85% p(success), mostly because most things one can think of doing will probably be done).

It's possibly an argument that work is most valuable in cases of unexpectedly short timelines, although I'm not sure how much weight I actually place on that.

Take an EfficientNet model with >= 99% accuracy on MNIST digit classification. What is the largest possible change in the probability assigned to some class between two images, which differ only in the least significant bit of a single pixel? Prove your answer before 2023.

 

You aren't counting the fact that you can pretty easily bound this based on the fact that image models are Lipschitz, right? Like, you can just ignore the ReLUs and you'll get an upper bound by looking at the weight matrices. And I believe there are techniques that let you get tighter bounds than this.

Load More