I'm the chief scientist at Redwood Research.
Here is the quote from Dario:
More subtly, the same opacity makes it hard to find definitive evidence supporting the existence of these risks at a large scale, making it hard to rally support for addressing them—and indeed, hard to know for sure how dangerous they are.
To address the severity of these alignment risks, we will have to see inside AI models much more clearly than we can today. For example, one major concern is AI deception or power-seeking. The nature of AI training makes it possible that AI systems will develop, on their own, an ability to deceive humans and an inclination to seek power in a way that ordinary deterministic software never will; this emergent nature also makes it difficult to detect and mitigate such developments2. But by the same token, we’ve never seen any solid evidence in truly real-world scenarios of deception and power-seeking3 because we can’t “catch the models red-handed” thinking power-hungry, deceitful thoughts. What we’re left with is vague theoretical arguments that deceit or power-seeking might have the incentive to emerge during the training process, which some people find thoroughly compelling and others laughably unconvincing. Honestly I can sympathize with both reactions, and this might be a clue as to why the debate over this risk has become so polarized.
IMO, this implies that interp would allow for rallying support while it would be hard otherwise, implying the behavioral evidence isn't key.
This is often important in my thinking: when thinking about various internals based methods that could test for scheming (but won't produce direct behavioral evidence), this comes up. I wrote this doc after noticing that I wanted to reference this somewhere.
Also, I often hear people discuss getting non-behavioral evidence for scheming using internals/interp. (As an example, probes for detecting deceptive cognition and then seeing if this fire more than expected on honeypots.) And, understanding this isn't going to result in legible evidence is important for understanding the theory of change for this work: it's important that you can iterate usefully against the method. I think people sometimes explicitly model iterating against these testing methods, but sometimes they don't.
Maybe this doesn't come up as much in your conversation with people, but I've seen internals based testing methods which don't clearly ground out in behavioral evidence discussed often.
(E.g., it's the application that the Anthropic interp team has most discussed, it's the most obvious application of probing for internal deceptive reasoning other than resampling against the probes.)
Notably Dario seemingly thinks that circuit style interp analysis (which IMO would be unlikely to yield behavioral evidence on it's own) is the main way we might get definitive (aka legible) evidence of scheming. So, I think Dario's essay on interp is an example of someone disagreeing with this post! Dario's essay on interp came out after this post was published, otherwise I might have referenced it.
I wasn't trying to trigger any research particular reprioritization with this post, but I historically found that people hadn't really thought through this (relatively obvious once noted) consideration and I think people are sometimes interested in thinking through specific theories of impact for their work.
This isn't that important, but I think the idea of using an exponential parallelization penalty is common in the economics literature. I specifically used 0.4 as around the harshest penalty I've heard of. I believe this number comes from some studies on software engineering where they found something like this.
I'm currently skeptical that toy models of DAGs/tech trees will add much value over:
(Separately AIs might be notably better at coordinating than humans are which might change things substantially. Toy models of this might be helpful.)
I agree parallelization penalties might bite hard in practice. But it's worth noting that the AIs in the AutomatedCorp hypothetical also run 50x faster and are more capable.
(A strong marginal parallelization penalty exponent of 0.4 would render the 50x additional workers equivalent to a 5x improvement in labor speed, substantially smaller than the 50x speed improvement.)
Doesn't seem that wild to me? When we scale up compute we're also scaling up the size of frontier training runs; maybe past a certain point running smaller experiments just isn't useful (e.g. you can't learn anything from experiments using 1 billionth of the compute of a frontier training run); and maybe past a certain point you just can't design better experiments. (Though I agree with you that this is all unlikely to bite before a 10X speed up.)
Yes, but also, if the computers are getting serially faster, then you also have to be able to respond to the results and implement the next experiment faster as you add more compute. E.g., imagine a (physically implausible) computer which can run any experiment which uses less than 1e100 FLOP in less than a nanosecond. To maximally utilize this, you'd want to be able to respond to results and implement the next experiment in less than a nanosecond as well. This is of course an unhinged hypothetical and in this world, you'd also be able to immediately create superintelligence by e.g. simulating a huge evolutionary process.
I think this post would be better if it taboo'd the word alignment or at least defined it.
I don't understand what the post means by alignment. My best guess is "generally being nice", but I don't see why this is what we wanted. I usually use the term alignment to refer to alignment between the AI and the developer, or using this definition, we say that an AI is aligned with an operator if the AI is trying to do what the operator wants it to do.
I wanted the ability to make AIs which are corrigible and which follow some specification precisely. I don't see how starting by training AIs in simulated RL environments (seeming with any specific reference to corrigability or a spec?) could get an AI which follows our spec.
Do you also dislike Moore's law?
I agree that anchoring stuff to release dates isn't perfect because the underlying variable of "how long does it take until a model is released" is variable, but I think is variability is sufficiently low that it doesn't cause that much of an issue in practice. The trend is only going to be very solid over multiple model releases and it won't reliably time things to within 6 months, but that seems fine to me.
I agree that if you add one outlier data point and then trend extrapolate between just the last two data points, you'll be in trouble, but fortunately, you can just not do this and instead use more than 2 data points.
This also means that I think people shouldn't update that much on the individual o3 data point in either direction. Let's see where things go for the next few model releases.
Yep, good point.
Sure, but none of these things are cruxes for the argument I was making which was that it wasn't that expensive to keep humans physically alive.
I'm not denying that humans might all be out of work quickly (putting aside regulatory capture, goverment jobs, human job programs, etc). My view is more that if alignment is solved it isn't hard for some humans to stay alive and retain control, and these humans could also pretty cheaply keep all other humans at a low competitiveness overhead.
I don't think the typical person should find this reassuring, but the top level posts argues for stronger claims than "the situation might be very unpleasant because everyone will lose their job".
Thanks, I updated down a bit on risks from increasing philosophical competence based on this (as all of these seem very weak)
(Relevant to some stuff I'm doing as I'm writing about work in this area.)
IMO, the biggest risk isn't on your list: increased salience and reasoning about infohazards in general and in particular certain aspects of acausal interactions. Of course, we need to reason about how to handle these risks eventually but broader salience too early (relative to overall capabilities and various research directions) could be quite harmful. Perhaps this motivates suddenly increasing philosophical competence so we quickly move through the regime where AIs aren't smart enough to be careful, but are smart enough to discover info hazards.