Buck Shlegeris

Wiki Contributions


It seems like a big input into P(AI takeover) is the extent to which instances of our AI are inclined to cooperate with each other; specifically, the extent to which they’re willing to sacrifice overseer approval at the thing they’re currently doing in return for causing a different instance to get more overseer approval. (I’m scared of this because if they’re willing to sacrifice approval in return for a different instance getting approval, then I’m way more scared of them colluding with each other to fool oversight processes or subvert red-teaming procedures, or coordinating coups.)

If the AIs never interact with each other, there’s selection pressure against them caring about the effects of their actions on the reward given to other instances.

But IRL, the AIs are probably actually going to be working with each other a whole lot, and will plausibly often have opportunities to make tradeoffs where they help each other out (in the same way that people at the same org often have opportunities to help each other out varying amounts). Inasmuch as these opportunities arise, it’s probably from the perspective of the org for the AIs to be cooperative with each other, and to care about helping each other succeed at their tasks. (For the same reason that you want workers at an org to help each other when it increases total productivity.)

(It’s actually a bit unclear to me how valuable it is to have AIs that cooperate. Many of the reasons that people at orgs can produce surplus by helping each other don’t seem applicable. E.g. you might not have specialization to different areas of knowledge (except for safety reasons)?)

There are a few ways you might set things up so that the AIs are able to be appropriately cooperative with each other:

  • You could just have a market economy. You already plausibly wanted it to be the case that your AI instances have access to some cash, and can spend it as they think is appropriate to accomplish their goals (and the amount they spend is deducted from their reward). And then you can have your AIs cooperate via an internal economy, where if they want assistance they pay each other.
    • This kind of setup doesn’t obviously train the AIs to care about each other.
    • In this kind of setup, it is not a competitive advantage for the AIs to care about each other. But it is a competitive advantage for the AIs to be scrupulously honest with each other. Idk what effect that has.
    • There are various disadvantages here analogous to the reasons that normal orgs don’t do this.
  • A way scarier, probably dumb, but very simple approach is to set it up so that whenever two AIs interact with each other, their rewards “bleed into” each other a bit, so they’re naturally inclined to help each other out. E.g. maybe whenever the AIs interact in a particular way, you naturally give each of the AIs 3% of the reward that the other one gets within some time period. This is super simple to implement, might make your AI org work more effectively, and seems terrifyingly likely to get you murdered.
  • You could also have your oversight process try to estimate how much good the instance did via helping out other instances, which is basically just the process-based version of training AIs to help each other out.
    • In general, it’s probably possible to use occasional high-quality oversight and adversarial training to train your models to be pretty cooperative. So I expect that even just with this process-based cooperation training, the AIs will be way more cooperative than workers at big companies.

I was thinking about this because I was trying to figure out how much of my P(doom) comes from deceptive alignment. The above arguments that we’ll want to train models for cooperation means that deceptive alignment is less of why we might get collusive models.

Another item for the list of “mundane things you can do for AI takeover prevention”:

We have a wide variety of ways to make AI systems less capable but also less scary. Most alignment research is focused on pushing out the Pareto frontier here. But IMO a lot of value can come from techniques which allow us to choose the right point on this Pareto frontier for a particular application. It seems to me that different AI applications (where by “different AI applications” I’m including things like “different types of tasks you might ask your assistant to do”) have very different tradeoffs between capabilities and scariness. The classic example of this is that you plausibly shouldn’t put your scariest/smartest AIs in charge of running your nuclear weapon silos, because it seems like the returns to being super smart aren’t very high, and the model has particularly good opportunities to behave badly. On the other hand, galaxy brained AIs that were trained end-to-end to use inscrutable thoughts to have incomprehensible ideas can be used fruitfully on tasks where the stakes aren’t high or where verification is easy.

Something I've realized over the last few days:

Why did we look at just the “most aggressive” experiment allowed by a hypothesis H, instead of choosing some other experiment allowed by H?

The argument for CaSc is: “if H was true, then running the full set of swaps shouldn’t affect the computation’s output, and so if the full set of swaps does affect the computation’s output, that means H is false.” But we could just as easily say “if H was true, then the output should be unaffected any set of swaps that H says should be fine.”

Why focus on the fullest set of swaps? An obvious alternative to “evaluate the hypothesis using the fullest set of swaps” is “evaluate the hypothesis by choosing the set of swaps allowed by H which make it look worse”.

I just now have realized that this is AFACIT equivalent to constructing your CaSc hypothesis adversarially--that is, given a hypothesis H, allowing an adversary to choose some other hypothesis H’, and then you run the CaSc experiment on join(H, H’). And so, when explaining CaSc, I think we should plausibly think about describing it by talking about the hypothesis producing a bunch of allowed experiments, and then you can test your hypothesis by either looking at the maxent one or by looking at the worst one.

Here's a take of mine on how you should think about CaSc that I haven't so far gotten around to publishing anywhere:

I think you should think of CaSc as being a way to compute a prediction made by the hypothesis. That is, when you claim that the model is computing a particular interpretation graph, and you provide the correspondence between the interpretation graph and the model, CaSc tells you a particularly aggressive prediction made by your hypothesis: your hypothesis predicts that making all the swaps suggested by CaSc won't affect the average output of your computational graph.

Thinking about it this way is helpful to me for two reasons:

  • False hypotheses can make true predictions; this is basically why CaSc can fail to reject false hypotheses.
  • It also emphasizes why I'm unsympathetic to claims that "it sets the bar too high for something being a legit circuit"--IMO, if you claimed that your model has some internal structure well described by hypothesis that fits into the CaSc structure (which is true of almost all interp hypotheses in practice), I don't really see how the failure of a CaSc test is compatible with that hypothesis being true (modulo my remaining questions about how bad it is for a hypothesis to get a middling CaSc score).

CaSc attempts to compute the single most aggressive prediction made by your hypothesis--this is why we do all allowed swaps. (I'm a bit confused about whether we should think of CaSc as succeeding at being the most aggressive experiment for the hypothesis though, I think there are some subtleties here that my coworkers have worked out that I don't totally understand.)

I think I regret that we phrased our writeup as "CaSc gives you a test of interp hypotheses" rather than saying "CaSc shows you a strong prediction made by your interp hypothesis, which you can then compare to the truth, and if they don't match that's a problem for your hypothesis".

Thanks for your work!

Causal Scrubbing Cannot Differentiate Extensionally Equivalent Hypotheses

I think that what you mean here is a combination of the following:

  • CaSc fails to reject some false hypotheses, as already discussed.
  • Each node in the interpretation graph is only verified up to extensional equality. As in, if I claim that a single node in the graph is a whole sort function, I don't learn anything about whether the model is implementing quicksort or mergesort.

But one way someone could interpret this sentence is that CaSc doesn't distinguish between whether the model does quicksort or mergesort. This isn't generally true--if your interpretation graph broke up its quicksort implementation into multiple nodes, then the CaSc experiment would fail to explain the model's performance if the model itself was actually using merge sort.

Stamping behaviour down into a one dimensional quantity like that is inevitably going to make behavioural comparison difficult.

The reason to stamp it down to a one-dimensional quantity is that sometimes the phenomenon that we wanted to explain is the expectation of a one-dimensional quantity, and we don't want to require that our tests explain things other than that particular quantity. For example, in an alignment context, I might want to understand why my model does well on the validation set, perhaps in the hope that if I understand why the model performs well, I'll be able to predict whether it will generalize correctly onto a particular new distribution.

The main problem with evaluating a hypothesis by KL divergence is that if you do this, your explanation looks bad in cases like the following:  There's some interference in the model which manifests as random noise, and the explanation failed to preserve the interference pattern. In this case, your explanation has a bunch of random error in its prediction of what the model does, which will hurt the KL. But that interference was random and understanding it won't help you know if the mechanism that the model was using is going to generalize well to another distribution.

There are other similar cases than the interference one. For example, if your model has a heuristic that fires on some of the subdistribution you're trying to understand the model's behavior on, but not in a way that ends up affecting the model's average performance, this is basically another source of noise that you (at least often) end up not wanting your explanation to have to capture.

It’s a pretty different algorithm, though obviously it’s trying to solve a related problem.

My current guess is that people who want to use this algorithm should just implement it from scratch themselves--using our software is probably more of a pain than it's worth if you don't already have some reason to use it.

After a few months, my biggest regret about this research is that I thought I knew how to interpret the numbers you get out of causal scrubbing, when actually I'm pretty confused about this.

Causal scrubbing takes an explanation and basically says “how good would the model be if the model didn’t rely on any correlations in the input except those named in the explanation?”. When you run causal scrubbing experiments on the induction hypothesis and our paren balance classifier explanation, you get numbers like 20% and 50%.

The obvious next question is: what do these numbers mean? Are those good numbers or bad numbers? Does that mean that the explanations are basically wrong, or mostly right but missing various minor factors?

My current position is “I don’t really know what those numbers mean."

The main way I want to move forward here is to come up with ways of assessing the quality of interpretability explanations which are based on downstream objectives like "can you use your explanation to produce adversarial examples" or "can you use your explanation to distinguish between different mechanisms the model is using", and then use causal-scrubbing-measured explanation quality as the target which you use to find explanations, but then validate the success of the project based on whether the resulting explanations allow you to succeed at your downstream objective.

(I think this is a fairly standard way of doing ML research. E.g., the point of training large language models isn't that we actually wanted models which have low perplexity at predicting webtext, it's that we want models that understand language and can generate plausible completions and so on, and optimizing a model for the former goal is a good way of making a model which is good at the latter goal, but we evaluate our models substantially based on their ability to generate plausible completions rather than by looking at their perplexity.)

I think I was pretty wrong to think that I knew what “loss explained” means; IMO this was a substantial mistake on my part; thanks to various people (e.g. Tao Lin) for really harping on this point.

(We have some more ideas about how to think about "loss explained" than I articulated here, but I don't think they're very satisfying yet.)


In terms of what we can take away from these numbers right now: I think that these numbers seem bad enough that the interpretability explanations we’ve tested don’t seem obviously fine. If the tests had returned numbers like 95%, I’d be intuitively sympathetic to the position “this explanation is basically correct”. But because these numbers are so low, I think we need to instead think of these explanations as “partially correct” and then engage with the question of how good it is to produce explanations which are only partially correct, which requires thinking about questions like “what metrics should we use for partial correctness” and “how do we trade off between those metrics and other desirable features of explanations”.

I have wide error bars here. I think it’s plausible that the explanations we’ve seen so far are basically “good enough” for alignment applications. But I also think it’s plausible that the explanations are incomplete enough that they should roughly be thought of as “so wrong as to be basically useless”, because similarly incomplete explanations will be worthless for alignment applications. And so my current position is “I don’t think we can be confident that current explanations are anywhere near the completeness level where they’d add value when trying to align an AGI”, which is less pessimistic than “current explanations are definitely too incomplete to add value” but still seems like a pretty big problem.

We started realizing that all these explanations were substantially incomplete in around November last year, when we started doing causal scrubbing experiments. At the time, I thought that this meant that we should think of those explanations as “seriously, problematically incomplete”. I’m now much more agnostic.

(I also think that the evidence you're providing is mostly orthogonal to this argument.)

Upon further consideration, I think you're probably right that the causal scrubbing results I pointed at aren't actually about the question we were talking about, my mistake.

but in general, I'd rather advance this dialogue by just writing future papers

Seems like probably the optimal strategy. Thanks again for your thoughts here.

Load More