I'm the chief scientist at Redwood Research.
I don't think this explains our disagreements. My low confidence guess is we have reasonably similar views on this. But, I do think it drives parts of some disagreements between me and people who are much more optimisitic than me (e.g. various not-very-concerned AI company employees).
I agree the value of (1) vs (2) might also be a crux in some cases.
Sometimes people talk about how AIs will be very superhuman at a bunch of (narrow) domains. A key question related to this is how much this generalizes. Here are two different possible extremes for how this could go:
My sense is that the current frontier LLMs are much closer to (1) than (2) for most of their skills, particularly the skills which they've been heavily trained on (e.g. next token prediction or competitive programming). As AIs in the current paradigm get more capable, they appear to shift some toward (2) and I expect that at the point when AIs are capable of automating virtually all cognitive work that humans can do, we'll be much closer to (2). That said, it seems likely that powerful AIs built in the current paradigm[1] which otherwise match humans at downstream performance will somewhat lag behind humans in integrating/generalizing skills they learn (at least without spending a bunch of extra compute on skill integration) because this ability currently seems to be lagging behind other capabilities relative to humans and AIs can compensate for worse skill integration with other advantages (being extremely knowledgeable, fast speed, parallel training on vast amounts of relevant data including "train once, deploy many", better memory, faster and better communication, etc).
I think different views about the extent to which future powerful AIs will deeply integrate their superhuman abilities versus these abilities being shallowly attached partially drive some disagreements about misalignment risk and what takeoff will look like.
If the paradigm radically shifts by the time we have powerful AIs, then the relative level of integration is much less clear. ↩︎
While I think debate might be a useful prosaic method for some levels of capability, I have a variety of concerns about this approach resulting in worst case guarantees[1]:
I think the argument pushes large parts of the problem into requiring a universal solution for exploration hacking. I think prosaically handling certain cases of exploration hacking for certain levels of capability seems very doable, but the approach discussed in the paper requires a very strong notion of solving exploration hacking (e.g., the AI would tell you any secret it might know via any mechanisms which you can verify, even if that secret would be arbitrarily hard to explore into such as if the AIs know of some arbitrary password you can check).
A universal solution for exploration hacking would itself require solving a particular alignment problem which IMO contains most of the difficulty of the overall alignment problem: fully solving exploration hacking effectively requires "we can ensure the AI will give us any knowledge that we can verify" which is effectively the same as "we can align an AI to trying to give us any knowledge that we can verify" and this doesn't obviously seem importantly easier than the overall alignment problem without further restrictions. Importantly, "giving knowledge" might itself be a difficult task due to ontology translation (see e.g. counterexamples in the ELK report). Another way to put this, is that solving exploration hacking (in the worst case) seems almost as difficult as being able to elicit arbitrary knowledge.
The paper discusses an assumption of "good human input" as being sufficient. However, the relevant question is whether the idealized system (the limit of a well elicited debate) actually finds truth in a given case. And, I think it seems very unclear whether it will find truth in the cases we care about. (E.g., does a reasonable human judge result in good behavior in the infinite structure or does something crazy happen? Do we somehow do better due to approximation?)
Minimally, answering some questions might involve hard to justify heuristics and there are issues with persuasion.
In some insane limit, anything is deductively verifiable (except the prior) using Solomonoff induction or similar, but surely this isn't going to actually save us! (For instance, the AIs will be incapable of arguing about this, and there isn't any reason why a plausible amount of training would elicit this without a magic solution to exploration hacking which put all the difficulty in this part. At this point, you could also just debate over purely logical formal statements going down to Solomonoff induction or some other intractable computational structure.)
Putting aside exploration hacking, it's unclear this yields reasonable results for more mundane reasons, like exploration being hard. And, the relevant level of exploration we need is extremely good as we're trying to elicit good performance over an exponentially large structure that we can't explore.
It also seems like you won't know if you've reached a sufficiently optimal level as I expect that you'll always see returns in terms of debater performance (as measured in e.g. elo) with further RL. So, you'll need to have some way of understanding what optimality suffices for eliciting the knowledge the AI already had.
I've also discussed these concerns in a call where the authors were present, but I thought it might be helpful to quickly write up my concerns. ↩︎
Thanks, I updated down a bit on risks from increasing philosophical competence based on this (as all of these seem very weak)
(Relevant to some stuff I'm doing as I'm writing about work in this area.)
IMO, the biggest risk isn't on your list: increased salience and reasoning about infohazards in general and in particular certain aspects of acausal interactions. Of course, we need to reason about how to handle these risks eventually but broader salience too early (relative to overall capabilities and various research directions) could be quite harmful. Perhaps this motivates suddenly increasing philosophical competence so we quickly move through the regime where AIs aren't smart enough to be careful, but are smart enough to discover info hazards.
Here is the quote from Dario:
More subtly, the same opacity makes it hard to find definitive evidence supporting the existence of these risks at a large scale, making it hard to rally support for addressing them—and indeed, hard to know for sure how dangerous they are.
To address the severity of these alignment risks, we will have to see inside AI models much more clearly than we can today. For example, one major concern is AI deception or power-seeking. The nature of AI training makes it possible that AI systems will develop, on their own, an ability to deceive humans and an inclination to seek power in a way that ordinary deterministic software never will; this emergent nature also makes it difficult to detect and mitigate such developments2. But by the same token, we’ve never seen any solid evidence in truly real-world scenarios of deception and power-seeking3 because we can’t “catch the models red-handed” thinking power-hungry, deceitful thoughts. What we’re left with is vague theoretical arguments that deceit or power-seeking might have the incentive to emerge during the training process, which some people find thoroughly compelling and others laughably unconvincing. Honestly I can sympathize with both reactions, and this might be a clue as to why the debate over this risk has become so polarized.
IMO, this implies that interp would allow for rallying support while it would be hard otherwise, implying the behavioral evidence isn't key.
This is often important in my thinking: when thinking about various internals based methods that could test for scheming (but won't produce direct behavioral evidence), this comes up. I wrote this doc after noticing that I wanted to reference this somewhere.
Also, I often hear people discuss getting non-behavioral evidence for scheming using internals/interp. (As an example, probes for detecting deceptive cognition and then seeing if this fire more than expected on honeypots.) And, understanding this isn't going to result in legible evidence is important for understanding the theory of change for this work: it's important that you can iterate usefully against the method. I think people sometimes explicitly model iterating against these testing methods, but sometimes they don't.
Maybe this doesn't come up as much in your conversation with people, but I've seen internals based testing methods which don't clearly ground out in behavioral evidence discussed often.
(E.g., it's the application that the Anthropic interp team has most discussed, it's the most obvious application of probing for internal deceptive reasoning other than resampling against the probes.)
Notably Dario seemingly thinks that circuit style interp analysis (which IMO would be unlikely to yield behavioral evidence on it's own) is the main way we might get definitive (aka legible) evidence of scheming. So, I think Dario's essay on interp is an example of someone disagreeing with this post! Dario's essay on interp came out after this post was published, otherwise I might have referenced it.
I wasn't trying to trigger any research particular reprioritization with this post, but I historically found that people hadn't really thought through this (relatively obvious once noted) consideration and I think people are sometimes interested in thinking through specific theories of impact for their work.
This isn't that important, but I think the idea of using an exponential parallelization penalty is common in the economics literature. I specifically used 0.4 as around the harshest penalty I've heard of. I believe this number comes from some studies on software engineering where they found something like this.
I'm currently skeptical that toy models of DAGs/tech trees will add much value over:
(Separately AIs might be notably better at coordinating than humans are which might change things substantially. Toy models of this might be helpful.)
I sometimes write proposals for empirical AI safety work without publishing the proposal (as the doc is somewhat rough or hard to understand). But, I thought it might be useful to at least link my recent project proposals publicly in case someone would find them useful.
If you're interested in empirical project proposals, you might also be interested in "7+ tractable directions in AI control".
I've actually already linked this (and the proposal in the next bullet) publicly from "7+ tractable directions in AI control", but I thought it could be useful to link here as well. ↩︎
I agree parallelization penalties might bite hard in practice. But it's worth noting that the AIs in the AutomatedCorp hypothetical also run 50x faster and are more capable.
(A strong marginal parallelization penalty exponent of 0.4 would render the 50x additional workers equivalent to a 5x improvement in labor speed, substantially smaller than the 50x speed improvement.)
Fair, but I think the AI being aware of its behavior is pretty continuous with being aware of the heuristics it's using and ultimately generalizing these (e.g., in some cases the AI learns what code word it is trying to make the user say which is very similar to being aware of any other aspect of the task it is learning). I'm skeptical that very weak/small AIs can do this based on some other papers which show they fail at substantially easier (out-of-context reasoning) tasks.
I think most of the reason why I believe this is improving with capabilities is due to a broader sense of how well AIs generalize capabilities (e.g., how much does o3 get better at tasks it wasn't trained on), but this paper was the most clearly relevant link I could find.