Does the AI think of the other AI as "itself" to the extent that it thinks about stuff like this at all?
Do the AIs cooperate in a way which is reasonably similar to cooperating with yourself?

A bird's eye view of ARC's research

Ryan Greenblatt13d20

I do think this was reasonably though not totally predictable ex-ante, but I agree.

A bird's eye view of ARC's research

Ryan Greenblatt13d20

it turns out that enforcing sparsity (and doing an SAE) gives better interp scores than doing PCA

It's not clear to me this is true exactly. As in, suppose I want to explain as much of what a transformer is doing as possible with some amount of time. Would I better off looking at PCA features vs SAE features?

Yes, most/many SAE features are easier to understand than PCA features, but each SAE feature (which is actually sparse) is only a tiny, tiny fraction of what the model is doing. So, it might be that you'd get better interp scores (in terms of how much of what the model is doing) with PCA.

Certainly, if we do literal "fraction of loss explained by human written explanations" both PCA and SAE recover approximately 0% of training compute.

I do think you can often learn very specific more interesting things with SAEs and for various applications SAEs are more useful, but in terms of some broader understanding, I don't think SAEs clearly are "better" than PCA. (There are also various cases where PCA on some particular distribution is totally the right tool for the job.)

Certainly, I don't think it has been shown that we can get non-negligible interp scores with SAEs.

To be clear, I do think we learn something from the fact that SAE features seem to often/mostly at least roughly correspond to some human concept, but I think the fact that there are vastly more SAE features vs PCA features does matter! (PCA was never trying to decompose into this many parts.)

A bird's eye view of ARC's research

Ryan Greenblatt13d73

Somewhat off-topic, but isn't this a non-example:

We have strong evidence that interesting semantic features exist in superposition

I think a more accurate statement would be "We have a strong evidence that neurons don't do a single 'thing' (either in the human ontology or in any other natural ontology)" combined with "We have a strong evidence that the residual stream represents more 'things' than it has dimensions".

Aren't both of these what people would (and did) predict without needing to look at models at all?^[1] As in both of these are the null hypothesis in a certain sense. It would be kinda specific if neurons did do a single thing and we can rule out 1 "thing" per dimension in the residual stream via just noting that Transformers work at all.

I think there are more detailed models of a more specific thing called "superposition" within toy models, but I don't think we have strong evidence of any very specific claim about larger AIs.

(SAE research has shown that SAEs often find directions which seem to at least roughly correspond to concepts that humans have and which can be useful for some methods, but I'm don't think we can make a much stronger claim at this time.)

In fact, I think that mech interp research was where the hypothesis "maybe neurons represent a single thing and we can understand neurons quite well (mostly) in isolation" was raised. And this hypothesis seems to look worse than a more default guess about NNs being hard to understand and there not being an easy way to decompose them into parts for analysis. ↩︎

Daniel Kokotajlo's Shortform

Ryan Greenblatt15d1114

Low importance aside:

Seems slightly unproductive and unnecessarily connotation inducing to say "Anthropic says:". Maybe "The Sabotage eval paper (from Anthropic) says:". Like this was just written by various people on the Anthropic alignment science team and by no means should discussion sections of papers be interpreted to speak for the company overall. Obviously not very important.

Daniel Kokotajlo's Shortform

Ryan Greenblatt15d84

We discuss similar issues in meta-level adversarial evaluations of oversight.

The most simple approach is just to have a human in the loop monitoring every action the AI does in combination with targeting a proxy failure rather than a real failure. This probably can reduce risk to a point where this is a small fraction of overall risk.

Human monitoring might be too expensive, in which case we'll have to go with sandboxing and pure proxy.

Catastrophic sabotage as a major threat model for human-level AI systems

Ryan Greenblatt16d64

I would say that the "internal use restrictions" are just an ad hoc control argument and I'd prefer to explicitly think about it like that. But, it is worth noting that if you are sufficiently conservative in the deployment (as Evan describes), then I think it would be fine to make a very minimal safety case.

Here is an example of what this could look like:

The model isn't quite superhuman in any important way. Supporting claims:
- We're confident the model isn't wildly more powerful than it appears. Evidence: [you have to fill out]
- The model appears to not be superhuman in any important way. Evidence: We ran some capability evals.
Our deployment is so conservative that as long as the model isn't quite super human, the model couldn't do anything even if it wanted to. Supporting claims:
- We thought about it some and it seems so conservative that it is unlikely we could be wrong even though we didn't spend that long doing threat modeling.
- We've actually implemented the relevant infrastructure (for sandboxing etc) and it is used in all the relevant places (including in RL and evaluations of the model).
- We'll enforce this level of conservativeness in a sufficiently robust way that we're confident that employee error won't be an issue. Evidence: [you have to fill out]

Note that this safety case doesn't require running any control evals and it probably requires no ML work other than evals you were probably already doing and supporting evidence for the model not being wildly more capable than it seems. It probably requires some non-trivial infra to set up and enforce this conservative of a deployment.

Catastrophic sabotage as a major threat model for human-level AI systems

Ryan Greenblatt16d82

I think your internal usage restrictions are wildly unaffordable

Sure, but note that Evan says "Lifting those internal usage restrictions requires an affirmative safety case (as specified below) for safe internal usage."

I don't think it is (clearly) wildly unaffordable to require an affirmative safety case prior to proceeding with (substantially) using the model internally? (Note that control could be part of such a case.)

I think Evan is just trying to say "using the model some (but not for XYZ) wouldn't (shouldn't?) even require making an affirmative safety case (at least for the first point at which this triggers)".

I think this can be consistent with treating models like external contractors.

(Though I'd note that "treat the model like external contractors" is maybe too weak and I'd replace this with something like "treat the model like external contractors who have a 25% chance of all being spies of a foreign adversary".)

I do think that making an affirmative safety case for a "very low" level of total lifetime takeover risk (perhaps 0.25%) is very likely to be totally unaffordable at some point (at least in short timelines and with my priors).

Mark Xu's Shortform

Ryan Greenblatt1mo61

This seems to presume that you can divide up research topics into "alignment" vs "control" but this seems wrong to me.

Agreed. (And I made a comment to this effect on Mark's draft of this short form.) I don't think it is that productive to classify research directions, the central thing we're trying to point at is a classification for reasons why you might think that you're fine where one reason is "the AI couldn't do bad stuff without getting caught" and another reason is "the AI wouldn't do bad stuff" (obviously simplifying somewhat).

I also think that white-box control should typically be considered pretty separately from black-box control and the story for it is often pretty different. (Though when thinking about white-box control you should often adopt the central control perspective, e.g., you should incorporate white-box runtime detection techniques (like probes) into an overall control evaluation via exogenously trying to determine the probe reliability and other important aspects of the probe. For instance, the probe can take advantage of black box strategies and how safe we are depends on how many actions AIs need to cause huge problems.)

The claims I'd like to make is:

There should be more research specifically targeting a black-box control story and which is operating from a black-box control perspective. (I'd guess like 20% of the overall AI safety portfolio should be this and maybe 30% of safety work at labs. This is not including work like "get AIs to be more useful at alignment research and get better at DC eval methodology" which should also be subsidized from a control perspective but seem non-central.)
When people do research which could be pretty useful from a black-box control perspective (e.g. debate or other scalable oversight style research), they should consider sometimes explicitly think about the control perspective and consider adjusting their research accordingly (to the extent they can understand this perspective well enough). To the extent that people have an alignment style story they think about while doing their research, they should be roughly equally excited about thinking about the control story. I also find that the control story is often more straightforward.
People thinking about safety plans for near term powerful AI should strongly consider control.

Mark Xu's Shortform

Ryan Greenblatt1mo50

Redwood (I think Buck?) sometimes talks about how labs should have the A-team on control and the B-team on alignment, and I have the same complaint about that claim. It doesn't make much sense for research, most of which helps with both. It does make sense as a distinction for "what plan will you implement in practice" -- but labs have said very little publicly about that.

You're potentially thinking about footnote 4 in our post on control:

As one operationalization, suppose that you had two teams at your AI lab, one of which was much more competent, and you had to assign one to be in charge of minimizing the probability that the AIs you're about to train and deploy were scheming, and the other to be in charge of ensuring control. To address risk due to powerful AI in the next 8 years, we think it would be better to assign the more competent team to ensuring control, because we think that the quality of control interventions is more influential on this risk.

This footnote is a bit confusingly worded, but I think the situation we were trying to say is "Suppose you expect dangerously powerful AI in the next year, and the current year is prior to 2032. We're guessing you should put the better team on control."

This is different than research in the run up.

I also think that to the extent people are trying to do backchained research focused on specific applications, it makes sense to put the better team on control over reducing the chance that scheming arises. (But these aren't the only classes of interventions and some interventions don't nicely fit into these buckets, e.g., you can do work on differentially making AIs more useful for alignment work which isn't well classified as either and you can work on high level interpretability which aims to roughly understand how AIs make decisions in some cases (this high-level interp doesn't clearly help with reducing the chance that scheming arises very directly, but could help with a bunch of stuff).)