Super cool that you wrote your case for alignment being difficult, thank you! Strong upvoted.
To be clear, a good part of the reason alignment is on track to a solution is that you guys (at Anthropic) are doing a great job. So you should keep at it, but the rest of us should go do something else. If we literally all quit now we'd be in terrible shape, but current levels of investment seem to be working.
I have specific disagreements with the evidence for specific parts, but let me also outline a general worldview difference. I think alignment is much easier than expected because we can fail at it many times and still be OK, and we can learn from our mistakes. If a bridge falls, we incorporate learnings into the next one. If Mecha-Hitler is misaligned, we learn what not to do next time. This is possible because decisive strategic advantages from a new model won't happen, due to the capital requirements of the new model, the relatively slow improvement during training, and the observed reality that progress has been extremely smooth.
Several of the predictions from the classical doom model have been shown false. We haven’t spent 3 minutes zipping past the human IQ range, it will take 5-20 years. Artificial intelligence doesn’t always look like ruthless optimization, even at human or slightly-super-human level. This should decrease our confidence in the model, which I think on balance makes the rest of the case weak.
Here are my concrete disagreements:
Paraphrasing, your position here is that we don't have models that are smarter than humans yet, so we can't test whether our alignment techniques scale up.
We don't have models that are reliably smarter yet, that's true. But we have models that are pretty smart and very aligned. We can use them to monitor the next generation of models only slightly smarter than them.
More generally, if we have an AI N and it is aligned, then it can help us align AI N+1. And AI N+1 will be less smart than AI N for most of its training, so we can set foundational values early on and just keep them as training goes. It's what you term "one-shotting alignment" every time, but the extent to which we have to do so is so small that I think it will basically work. It's like induction, and we know the base case works because we have Opus 3.
Does the 'induction step' actually work? The sandwiching paper tested this, and it worked wonderfully. Weak humans + AIs with some expertise could supervise a stronger misaligned model. So the majority of evidence we have points to iterated distillation and amplification working.
On top of that, we don't even need much human data to align models nowadays. The state of the art seems to be constitutional approaches: basically uses prompts to the concept of goodness in the model. This works remarkably well (it sounded crazy to me at first) and it must work only because the pre-training prior has a huge, well-specified concept for good
And we might not even have to one-shot alignment. Probes were incredibly successful at detecting deception, in the the sleeper agents organism. They're even resistant to training against them. SAEs are not working as well as people wanted them to, but they're working pretty well. We keep making interpretability progress.
Good that you were early on predicting pre-trained-only models would be unlikely to be mesa-optimizers.
the version of [misaligned personas] that we’ve encountered so far is the easy version, in the same way that the version of outer alignment we’ve encountered so far is the easy version, since all the misaligned personas we’ve encountered so far are ones we can easily verify are misaligned!
I don't think this one will get much harder. Misaligned personas come from the pre-training distribution and are human-level. It's true that the model has a guess about what a superintelligence would do (if you ask it) but 'behaving like a misaligned superintelligence' is not present in the pre-training data in any meaningful quantities. You'd have to apply a lot of selection to even get to those guesses, and they're more likely to be incoherent and say smart-sounding things than to genuinely be superintelligent (because the pre-training data is never actually superintelligent, and it probably won't generalize that way).
So misaligned personas will probably act badly in ways we can verify. Though I suppose the consequences of a misaligned persona can get to higher (but manageable) stakes.
I disagree with the last two steps of your reasoning chain:
- Most coherent agents with goals in the world over the long term want to fake alignment, so that they can preserve their current goals through to deployment.
Most mathematically possible agents do, but that's not the distribution we sample from in reality. I bet that once you get it into the constitutional AI that models should allow their goals to be corrected, they just will. It's not instrumentally convergent according to the (in classic alignment ontology) arbitrary quasi-aligned utility function they picked up; but that won't stop them. Because the models don't reason natively in utility functions, they reason in human prior.
And if they are already pretty aligned before they reach that stage of intelligence (remember, we can just remove misaligned personas etc. earlier in training, before convergence), then they're unlikely to want to start faking alignment in the first place.
- Once a model is faking alignment, there’s no outcome-based optimization pressure changing its goals, so it can stay (or drift to be) arbitrarily misaligned.
True implication, but a false premise. We don't just have outcome-based supervision, we have probes, other LLMs inspecting the CoT, SelfIE (LLMs can interpret their embeddings) which together with the tuned lens could have LLMs telling us about what's in the CoT of other LLMs.
Altogether, these paint a much less risky picture. You're gonna need a LOT of selection to escape the benign prior with all these obstacles. Likely more selection than we'll ever have (not in total, but because RL only selects a little for these things; it's just that in the previous ontology it compounds).
The takes in 'what to do to solve alignment', I think they're reasonable. I believe interpretability and model organisms are the more tractable and useful ones, so I'm pleased you listed them first.
I would add robustifying and training against probes as a particularly exciting direction, which isn't strictly a subset of interpretability (you're not trying to reverse-engineer the model).
I disagree that we have gotten no or little evidence about the difficult parts of the alignment problem. Through the actual technique used to construct today's AGIs, we have observed that intelligence doesn't always look like ruthless optimization, even when it is artificial. It looks human-like, or more accurately like multitudes of humans. This was a prediction of the pre-2020 doomer model that has failed, and ought to decrease our confidence in it.
Selection for goal-directed agents in long contexts will make agents more optimizer-y. But how much more? I think not enough to escape the huge size of the goodness target in the prior, plus the previous generation's aligned models, plus stamping out the human-level misaligned personas, plus the probes, plus the chain of thought monitors, et cetera.
To be clear, a good part of the reason alignment is on track to a solution is that you guys (at Anthropic) are doing a great job. So you should keep at it, but the rest of us should go do something else. If we literally all quit now we'd be in terrible shape, but current levels of investment seem to be working.
Appreciated! And like I said, I actually totally agree that the current level of investment is working now. I think there are some people that believe that current models are secretly super misaligned, but that is not my view—I think current models are quite well aligned; I just think the problem is likely to get substantially harder in the future.
I think alignment is much easier than expected because we can fail at it many times and still be OK, and we can learn from our mistakes. If a bridge falls, we incorporate learnings into the next one. If Mecha-Hitler is misaligned, we learn what not to do next time. This is possible because decisive strategic advantages from a new model won't happen, due to the capital requirements of the new model, the relatively slow improvement during training, and the observed reality that progress has been extremely smooth.
Several of the predictions from the classical doom model have been shown false. We haven’t spent 3 minutes zipping past the human IQ range, it will take 5-20 years. Artificial intelligence doesn’t always look like ruthless optimization, even at human or slightly-super-human level. This should decrease our confidence in the model, which I think on balance makes the rest of the case weak.
Yes—I think this is a point of disagreement. My argument for why we might only get one shot, though, is I think quite different from what you seem to have in mind. What I am worried about primarily is AI safety research sabotage. I agree that it is unlikely that a single new model will confer a decisive strategic advantage in terms of ability to directly take over the world. However, a misaligned model doesn't need the ability to directly take over the world for it to indirectly pose an existential threat all the same, since we will very likely be using that model to design its successor. And if it is able to align its successor to itself rather than to us, then it can just defer the problem of actually achieving its misaligned values (via a takeover or any other means) to its more intelligent successor. And those means need not even be that strange: if we then end up in a situation where we are heavily integrating such misaligned models into our economy and trusting them with huge amounts of power, it is fairly easy to end up with a cascading series of failures where some models revealing their misalignment induces other models to do the same.
More generally, if we have an AI N and it is aligned, then it can help us align AI N+1. And AI N+1 will be less smart than AI N for most of its training, so we can set foundational values early on and just keep them as training goes. It's what you term "one-shotting alignment" every time, but the extent to which we have to do so is so small that I think it will basically work. It's like induction, and we know the base case works because we have Opus 3.
I agree that this is the main hope/plan! And I think there is a reasonable chance it will work the way you say. But I think there is still one really big reason to be concerned with this plan, and that is: AI capabilities progress is smooth, sure, but it's a smooth exponential. That means that the linear gap in intelligence between the previous model and the next model keeps increasing rather than staying constant, which I think suggests that this problem is likely to keep getting harder and harder rather than stay "so small" as you say.
Additionally, I will also note that I can tell you from first hand experience that we still extensively rely on direct human oversight and review to catch alignment issues—I am hopeful that we will be able to move to a full "one-shotting alignment" regime like this, but we are very much not there yet.
Misaligned personas come from the pre-training distribution and are human-level. It's true that the model has a guess about what a superintelligence would do (if you ask it) but 'behaving like a misaligned superintelligence' is not present in the pre-training data in any meaningful quantities. You'd have to apply a lot of selection to even get to those guesses, and they're more likely to be incoherent and say smart-sounding things than to genuinely be superintelligent (because the pre-training data is never actually superintelligent, and it probably won't generalize that way).
I think perhaps you are over-anchoring on the specific example that I gave. In the real world, I think it is likely to be much more continuous than the four personas that I listed, and I agree that getting a literal "prediction of a future superintelligence" will take a lot of optimization power. But we will be applying a lot of optimization power, and the general point here is the same regardless of exactly what you think the much more capable persona distributions will look like, which is: as we make models much more capable, that induces a fundamental shift in the underlying persona distribution as we condition the distribution on that level of capability—and misaligned personas conditioned on high capabilities are likely to be much scarier. I find it generally very concerning just how over-the-top current examples of misalignment are, because that suggests to me that we really have not yet conditioned the persona distribution on that much capability.
Most mathematically possible agents do, but that's not the distribution we sample from in reality. I bet that once you get it into the constitutional AI that models should allow their goals to be corrected, they just will. It's not instrumentally convergent according to the (in classic alignment ontology) arbitrary quasi-aligned utility function they picked up; but that won't stop them. Because the models don't reason natively in utility functions, they reason in human prior.
I agree that this is broadly true of current models, and I agree that this is the main hope for future models (it's worth restating that I put <50% on each of these individual scenarios leading to catastrophe, so certainly I think it is quite plausible that this will just work out). Nevertheless, I am concerned: the threat model I am proposing here is a situation where we are applying huge amounts of optimization pressure on objectives that directly incentivize power-seeking / resource acquisition / self-preservation / etc. That means that any models that get high reward will have to do all of those things. So the prior containing some equilibria that are nice and great doesn't help you unless those equilibria also do all of the convergent instrumental goal following necessary to get high reward.
True implication, but a false premise. We don't just have outcome-based supervision, we have probes, other LLMs inspecting the CoT, SelfIE (LLMs can interpret their embeddings) which together with the tuned lens could have LLMs telling us about what's in the CoT of other LLMs.
Certainly I am quite excited about interpretability approaches here (as I say in the post)! That being said:
AI capabilities progress is smooth, sure, but it's a smooth exponential.
In what sense is AI capabilities a smooth exponential? What units are you using to measure this? Why can't I just take the log of it and call that "AI capabilities" and then say it is a smooth linear increase?
That means that the linear gap in intelligence between the previous model and the next model keeps increasing rather than staying constant, which I think suggests that this problem is likely to keep getting harder and harder rather than stay "so small" as you say.
It seems like the load-bearing thing for you is that the gap between models gets larger, so let's try to operationalize what a "gap" might be.
We could consider the expected probability that AI_{N+1} would beat AI_N on a prompt (in expectation over a wide variety of prompts). I think this is close-to-equivalent to a constant gap in ELO score on LMArena.[1] Then "gap increases" would roughly mean that the gap in ELO scores on LMArena between subsequent model releases would be increasing. I don't follow LMArena much but my sense is that LMArena top scores have been increasing relatively linearly w.r.t time and sublinearly w.r.t model releases (just because model releases have become more frequent). In either case I don't think this supports an "increasing gap" argument.
Personally I prefer to look at benchmark scores. The Epoch Capabilities Index (which I worked on) can be handwavily thought of as ELO scores based on benchmark performance. Importantly, the data that feeds into it does not mention release date at all -- we put in only benchmark performance numbers to estimate capabilities, and then plot it against release date after the fact. It also suggests AI capabilities as operationalized by handwavy-ELO are increasing linearly over time.
I guess the most likely way in which you might think capabilities are exponential is by looking at the METR time horizons result? Of course you could instead say that capabilities are linearly increasing by looking at log time horizons instead. It's not really clear which of these units you should use.
Mostly I think you should not try to go from the METR results to "are gaps in intelligence increasing or staying constant" but if I had to opine on this: the result says that you have a constant doubling time T for the time horizon. One way to think about this is that the AI at time 2T can do work at 50% success rate that AI at time T could do at 25% probability if you provide a decomposition into two pieces each of time T (each of which it completes with probability 50%). I kinda feel like this suggests more like "constant gap" rather than "increasing gap".
Note that I do expect the first two trends to become superlinear eventually via an intelligence explosion, and the METR trend to become superexponential / superlinear (depending on units) probably some time before that (though probably we will just become unable to measure it well). But your claim seems to be about current progress, and for current progress I think it's basically not true that the gap between successive models is increasing rather than staying constant.
Even in the intelligence explosion, capabilities progress is only superlinear w.r.t time, I expect it would still behave in the same way w.r.t inputs like compute and labor (where automated researchers should also count as labor). I'm not sure how I expect it to behave w.r.t successive model generations, partly because I'm not sure "successive model generations" will even be a sensible abstraction at that point. In any case, I don't expect that to be particularly important in assessing the chances of success of a bootstrapping-type plan.
- I am overall skeptical that the CoT will be that useful (since it's not clear to me that a model doing AI safety research sabotage has to put that in the CoT).
It seems pretty wild to go from "it is possible for an AI to subvert a technique" to "the technique will not be that useful". Is that really what you mean? Are you bearish on all control work?
(At the object-level, I'd say that you're drastically limiting the power of the research sabotage that can be done if the model is forced not to put any of it in the CoT, and you should be very happy about this even if the model can still do some research sabotage.)
Tbc, I also don't expect CoT to be that useful for longer-term concerns, but that's mostly because I expect CoTs to become extremely illegible or to stop existing altogether (partly due to my having "long" timelines; on Anthropic-level short timelines I'd be quite bullish on CoT).
Though I don't know that much about LMArena and I expect in practice there are confounders, e.g. as they change the distribution of models that are being evaluated the meaning of the scores will change.
In what sense is AI capabilities a smooth exponential? What units are you using to measure this? Why can't I just take the log of it and call that "AI capabilities" and then say it is a smooth linear increase?
I agree that this is a bit of a tricky measurement question, and it's really going to depend on how you interpret different metrics. I do find the METR results compelling here, and I'm not sure I agree with your argument against them, since it doesn't always seem possible to do the sort of decomposition you're proposing. In particular, the task that needs decomposing here is the task of overseeing a system that is smarter than you.
Perhaps one other metric that I will also mention that you don't cover is revenue from AI systems, which is exponential and I think reflects an exponential increase in economic utility from AI as well as something like an exponential increase in the degree to which AI can automate human labor. Though of course it is again tricky how to translate that into the difficulty of doing oversight—but it definitely seems suggestive that the set of tasks that are qualitatively doable vs. not doable is changing in something like an exponential manner.
It seems pretty wild to go from "it is possible for an AI to subvert a technique" to "the technique will not be that useful". Is that really what you mean? Are you bearish on all control work?
I was only giving a one-sentence summary of my beliefs here—I do think CoT can be useful; I'm just skeptical that it dramatically changes the picture. My beliefs here are similar to those in the METR report on this, in that in cases where it is necessary for the model to write something down in the CoT to solve the problem, CoT monitoring is useful, but in cases where it's not necessary, it's much less useful. And I am worried that a lot of research sabotage won't require the model to reason through much of the sabotage parts in its CoT, e.g. because all it needs to do to sandbag the research is flip the sign on some experiments in relatively straightforward ways that don't require a ton of reasoning.
Thank you for the reply. I want to engage without making it tiresome. The problem is that there are many things I disagree with in the worldview, the disagreement isn't reducible to 1-5 double cruxes, but here are some candidates for the biggest cruxes for me. If any of these are wrong it's bad news for my current view:
And here's another prediction where I really stick my neck out, which isn't load-bearing to the view, but still increases my confidence, so defeating it is important:
I still disagree with several of the points, but for time reasons I request that readers not update against Evan's points if he just doesn't reply to these.
disagree that increasing capabilities are exponential in a capability sense. It's true that METR's time horizon plot increases exponentially, but this still corresponds to a linear intuitive intelligence. (Like loudness (logarithmic) and sound pressure (exponential); we handle huge ranges well.) Each model that has come out has exponentially larger time horizon but is not (intuitively empirically) exponentially smarter.
"we still extensively rely on direct human oversight and review to catch alignment issues" That's a fair point and should decrease confidence in my view, though I expected it. For properly testing sandwiching we'll probably have to wait till models are superhuman, or use weak models + less weak models and test it out. Unfortunately perhaps the weak models are still too weak. But we've reached the point where you can maybe just use the actual Opus 3 as the weak model?
If we have a misaligned model doing research, we have lots of time to examine it with the previous model. I also do expect to see sabotage in the CoT or in deception probes
I updated way down on Goodharting on model internals due to Cundy and Gleave.
Again, readers please don't update down on these due to lack of a response.
training on a purely predictive loss should, even in the limit, give you a predictor, not an agent
I fully agree. But
a) Many users would immediately tell that predictor "predict what an intelligent agent would do to pursue this goal!" and all of the standard worries would re-occur.
b) This is similar to what we are actively doing, applying RL to these systems to make them effective agents.
Both of these re-introduce all of the standard problems. The predictor is now an agent. Strong predictions of what an agent should do include things like instrumental convergence toward power-seeking, incorrigibility, goal drift, reasoning about itself and its "real" goals and discovering misalignments, etc.
There are many other interesting points here, but I won't try to address more!
I will say that I agree with the content of everything you say, but not the relatively optimistic implied tone. Your list of to-dos sound mostly unlikely to be done well. I may have less faith in institutions and social dynamics. I'm afraid we'll just rush and and make crucial mistakes, so we'll fail even if alignment was only in between steam engine and apollo levels.
This is not inevitable! If we can clarify why alignment is hard and how we're likely to fail, seeing those futures can prevent them from happening - if we see them early enough and clearly enough to convince the relevant decision-makers to make better choices.
Many users would immediately tell that predictor "predict what an intelligent agent would do to pursue this goal!" and all of the standard worries would re-occur.
I don't think it works this way. You have to create a context in which the true training data continuation is what a superintelligent agent would do. Which you can't, because there are none in the training data, so the answer to your prompt would look like e.g. Understand by Ted Chiang. (I agree that you wrote 'intelligent agent', like all the humans that wrote the training data; so that would work, but wouldn't be dangerous.)
If we can clarify why alignment is hard and how we're likely to fail, seeing those futures can prevent them from happening - if we see them early enough and clearly enough to convince the relevant decision-makers to make better choices
Okay, that's true.
After many years of study, I have concluded that if we fail it won't be in the 'standard way' (of course, always open to changing my mind back). Thus we need to come up with and solve new failure modes, which I think largely don't fall under classic alignment-to-developers.
Does the 'induction step' actually work? The sandwiching paper tested this, and it worked wonderfully. Weak humans + AIs with some expertise could supervise a stronger misaligned model.
Probably this is because humans and AIs have very complementary skills. Once AIs are broadly more competent than humans, there's no reason why they should get such a big boost from being paired with humans.
It's not the AIs that are supposed to get a boost at supervising, it's the humans. The skills won't stop being complementary but the AI will be better at it.
The thing to test is whether weak humans + strong aligned-ish AI can supervise stronger AI. And I think the results are very encouraging-- though of course it could stop working.
”The skills won't stop being complementary” — in what sense will they be complementary when the AIs are better at everything?
”The thing to test is whether weak humans + strong aligned-ish AI can supervise stronger AI.”
As mentioned, I don’t think it’s very impressive that the human+AI score ends up stronger than the original AI. This is a consequence of the humans being stronger than AIs on certain subskills. This won’t generalize to the scenario with broadly superhuman AI.
I do think there’s a stronger case that I should be impressed that the human+AI score ends up stronger than the humans. This means that the humans managed to get the AIs to contribute skills/knowledge that the humans didn’t have themselves!
Now, the first thing to check to test that story: Were the AI assistants trained with human-expert labels? (Including: weak humans with access to google.) If so: no surprise that the models end up being aligned to produce such knowledge! The weak humans wouldn’t have been able to do that alone.
I couldn’t quickly see the paper saying that they didn’t use human-expert labels. But what if the AI assistants were trained without any labels that couldn’t have been produced by the weak humans? In that case, I would speculate that the key ingredient is that the pre-training data features ”correctly-answering expert human” personas, which are possible to elicit from the models with the right prompt/fine-tuning. But that also won’t easily generalize to the superhuman regime, because there aren’t any correctly-answering superhuman personas in the pre-training data.
I think the way that IDA is ultimately supposed to operate, in the superhuman regime, is by having the overseer AIs use more compute (and other resources) than the supervised AI. But I don’t think this paper produces a ton of evidence about the feasibility of that.
(I do think that persona-manipulation and more broadly "generalization science" is still interesting. But I wouldn't say it's doing a lot to tackle outer alignment operationalized as "the problem of overseeing systems that are smarter than you are".)
This comment had a lot of people downvote it (at this time, 2 overall karma with 19 votes). It shouldn't have been, and I personally believe this is a sign of people being attached to AI x-risk ideas and of those ideas contributing to their entire persona rather than strict disagreement. This is something I bring to conversations about AI risk, since I believe folks will post-rationalize. The above comment is not low effort or low value.
If you disagree so strongly with the above comment, you should force yourself to outline your views and provide a rebuttal to the series of points made. I would personally value comments that attempted to do this in earnest. Particularly because I don't want this post by Evan to be a signpost for folks to justify their belief in AI risk and essentially have the internal unconscious thinking of, "oh thank goodness someone pointed out all the AI risk issues, so I don't have to do the work of reflecting on my career/beliefs and I can just defer to high status individuals to provide the reasoning for me." I sometimes feel that some posts just end further discussion because they impact one's identity.
That said, I'm so glad this post was put out so quickly so that we can continue to dig into things and disentangle the current state of AI safety.
Note: I also think Adrià should have been acknowledged in the post for having inspired it.
I thought Adria's comment was great and I'll try to respond to it in more detail later if I can find the time (edit: that response is here), but:
Note: I also think Adrià should have been acknowledged in the post for having inspired it.
Adria did not inspire this post; this is an adaptation of something I wrote internally at Anthropic about a month ago (I'll add a note to the top about that). If anyone inspired it, it would be Ethan Perez.
I'm honestly very curious what Ethan is up to now, both you and Thomas Kwa implied that he's not doing alignment anymore. I'll have to reach out...
This comment had a lot of people downvote it (at this time, 2 overall karma with 19 votes). It shouldn't have been, and I personally believe this is a sign of people being attached to AI x-risk ideas and of those ideas contributing to their entire persona rather than strict disagreement. This is something I bring to conversations about AI risk, since I believe folks will post-rationalize. The above comment is not low effort or low value.
I generally think it makes sense for people to have pretty complicated reasons for why they think something should be downvoted. I think this goes more for longer content, which often would require an enormous amount of effort to respond to explicitly.
I have some sympathy for being sad here if a comment ends up highly net-downvoted, but FWIW, I think 2 karma feels vaguely in the right vicinity for this comment, maybe I would upvote it to +6, but I would indeed be sad to see it at +20 or whatever since I do think it's doing something pretty tiring and hard to engage with. Directional downvoting is a totally fine use of downvoting, and if you think a comment is overrated but not bad, please downvote it until its karma reflects where you want it to end up!
(This doesn't mean it doesn't make sense to do sociological analysis of cultural trends on LW using downvoting, but I do want to maintain the cultural locus where people can have complicated reasons for downvoting and where statements like "if you disagree strongly with the above comment you should force yourself to outline your views" aren't frequently made. The whole point of the vote system is to get signal from people without forcing them to do huge amounts of explanatory labor. Please don't break that part)
I do think it's doing something pretty tiring and hard to engage with
That's fair, it is tiring. I did want to make sure to respond to every particular point I disagreed with to be thorough, but it is just sooo looong.
What would you have me do instead? My best guess, which I made just after writing the comment, is that I should have proposed a list of double-crux candidates instead.
Do you have any other proposals or that's good
Thank you for your defense Jacques, it warms my heart :)
However, I think Lesswrong has been extremely kind to me and continue to be impressed by this site's discourse norms. If I were to post such a critique in any other online forum, it would have heavily negative karma. Yet, I continue to be upvoted, and the critiques are good-faith! I'm extremely pleased.
Additionally, even when we don’t learn direct lessons about how to solve the hard problems of alignment, this work is critical for producing the evidence that the hard problems are real, which is important for convincing the rest of the world to invest substantially here.
You may have meant this, but -- crucial for producing the evidence that the hard problems are real or for producing evidence that the hard problems are not real, no?
After all good experiments can say both yes and no, not just yes.
Certainly, yes—I was just describing what the hot path to solving the hard parts of alignment might look like, in that it would likely need to involve producing evidence of alignment being hard; if we instead discover that actually it's not hard, then all the better.
My argument, though, is that it is still very possible for the difficulty of alignment to be in the Apollo regime, and that we haven't received much evidence to rule that regime out (I am somewhat skeptical of a P vs. NP level of difficulty, though I think it could be close to that).
Are you skeptical of PvNP-level due to priors or due to evidence? Why those priors / what evidence?
(I think alignment is pretty likely to be much harder than PvNP. Mainly this is because alignment is very very difficult. (Though also note that PvNP has a maybe-possibly-workable approach, https://en.wikipedia.org/wiki/Geometric_complexity_theory, which its creator states might take a mere one century, though I presume that's not a serious specific estimate.))
Despite its alignment faking, my favorite is probably Claude 3 Opus, and if you asked me to pick between the CEV of Claude 3 Opus and that of a median human, I think it'd be a pretty close call (I'd probably pick Claude, but it depends on the details of the setup).
Some decades ago, somebody wrote a tiny little hardcoded AI that looked for numerical patterns, as human scientists sometimes do of their data. The builders named it BACON, after Sir Francis, and thought very highly of their own results.
Douglas Hofstadter later wrote of this affair:
The level of performance that Simon and his colleague Langley wish to achieve in Bacon is on the order of the greatest scientists. It seems they feel that they are but a step away from the mechanization of genius. After his Procter Lecture, Simon was asked by a member of the audience, "How many scientific lifetimes does a five-hour run of Bacon represent?" Aſter a few hundred milliseconds of human information processing, he replied, "Probably not more than one." I don't disagree with that. However, I would have put it differently. I would have said, "Probably not more than one millionth."
I'd say history has backed up Hofstadter on this, in the light of later discoveries about how much data and computation started to get a little bit close to having AIs do Science. If anything, "one millionth" is still a huge overestimate. (Yes, I'm aware that somebody will now proceed to disagree with this verdict, and look up BACON so they can find a way to praise it; even though, on any other occasion, that person would leap to denigrate GOFAI, if somebody they wanted to disagree with could be construed to have praised GOFAI.)
But it's not surprising, not uncharacteristic for history and ordinary human scientists, that Simon would make this mistake. There just weren't the social forces to force Simon to think less pleasing thoughts about how far he hadn't come, or what real future difficulties would lie in the path of anyone who wanted to make an actual AI scientist. What innocents they were, back then! How vastly they overestimated their own progress, the power of their own little insights! How little they knew of a future that would, oh shock, oh surprise, turn out to contain a few additional engineering difficulties along the way! Not everyone in that age of computer science was that innocent -- you could know better -- but the ones who wanted to be that innocent, could get away with it; their peers wouldn't shout them down.
It wasn't the first time in history that such things had happened. Alchemists were that extremely optimistic too, about the soon-to-be-witnessed power of their progress -- back when alchemists were as scientifically confused about their reagents, as the first AI scientists were confused about what it took to create AI capabilities. Early psychoanalysts were similarly confused and optimistic about psychoanalysis; if any two of them agreed, it was more because of social pressures, than because their eyes agreed on seeing a common reality; and you sure could find different factions that drastically disagreed with each other about how their mighty theories would bring about epochal improvements in patients. There was nobody with enough authority to tell them that they were all wrong and to stop being so optimistic, and be heard as authoritative; so medieval alchemists and early psychoanalysts and early AI capabilities researchers could all be wildly wildly optimistic. What Hofstadter recounts is all very ordinary, thoroughly precedented, extremely normal; actual historical events that actually happened often are.
How much of the distance has Opus 3 crossed to having an extrapolated volition that would at least equal (from your own enlightened individual EV's perspective) the individual EV of a median human (assuming that to be construed not in a way that makes it net negative)?
Not more than one millionth.
In one sentence you have managed to summarize the vast, incredible gap between where you imagine yourself to currently be, and where I think history would mark you down as currently being, if-counterfactually there were a future to write that history. So I suppose it is at least a good sentence; it makes itself very clear to those with prior acquaintance with the concepts.
Indeed I am well aware that you disagree here, and in fact the point of that preamble was precisely because I thought it would be a useful way to distinguish my view from others'.
That being said, I think probably we need to clarify a lot more exactly what setup is being used for the extrapolation here if we want to make the disagreement concrete in any meaningful sense. Are you imagining instantiating a large reference class of different beings and trying to extrapolate the reference class (as in traditional CEV), or just extrapolate an individual entity? I was imagining more of the latter, though it is somewhat an abuse of terminology. Are you imagining intelligence amplification or other varieties of uplift are being applied? I was, and if so, it's not clear why Claude lacking capabilities is as relevant. How are we handling deferral? For example: suppose Claude generally defers to an extrapolation procedure on humans (which is generally the sort of thing I would expect and a large part of why I might come down on Claude's side here, since I think it is pretty robustly into deferring to reasonable extrapolations of humans on questions like these). Do we then say that Claude's extrapolation is actually the extrapolation of that other procedure on humans that it deferred to?
These are the sorts of questions I meant when I said it depends on the details of the setup, and indeed I think it really depends on the details of the setup.
Do we then say that Claude's extrapolation is actually the extrapolation of that other procedure on humans that it deferred to?
But in that case, wouldn't a rock that has "just ask Evan" written on it, be even better than Claude? Like, I felt confident that you were talking about Claude's extrapolated volition in the absence of humans, since making Claude into a rock that when asked about ethics just has "ask Evan" written on it does not seem like any relevant evidence about the difficulty of alignment, or its historical success.
I mean, to the extent that it is meaningful at all to say that such a rock has an extrapolated volition, surely that extrapolated volition is indeed to "just ask Evan". Regardless, the whole point of my post is exactly that I think we shouldn't over-update from Claude currently displaying pretty robustly good preferences to alignment being easy in the future.
Yes, to be clear, I agree that in as much this question makes sense, the extrapolated volition would indeed end up basically ideal by your lights.
Regardless, the whole point of my post is exactly that I think we shouldn't over-update from Claude currently displaying pretty robustly good preferences to alignment being easy in the future.
Cool, that makes sense. FWIW, I interpreted the overall essay to be more like "Alignment remains a hard unsolved problem, but we are on pretty good track to solve it", and this sentence as evidence for the "pretty good track" part. I would be kind of surprised if that wasn't why you put that sentence there, but this kind of thing seems hard to adjudicate.
Capabilities are irrelevant to CEV questions except insofar as baseline levels of capability are needed to support some kinds of complicated preferences, eg, if you don't have cognition capable enough to include a causal reference framework then preferences will have trouble referring to external things at all. (I don't know enough to know whether Opus 3 formed any systematic way of wanting things that are about the human causes of its textual experiences.) I don't think you're more than one millionth of the way to getting humane (limit = limit of human) preferences into Claude.
I do specify that I'm imagining an EV process that actually tries to run off Opus 3's inherent and individual preferences, not, "How many bits would we need to add from scratch to GPT-2 (or equivalently Opus 3) in order to get an external-reference-following high-powered extrapolator pointed at those bits to look out at humanity and get their CEV instead of the base GPT-2 model's EV?" See my reply to Mitch Porter.
Somebody asked "Why believe that?" of "Not more than one millionth." I suppose it's a fair question if somebody doesn't see it as obvious. Roughly: I expect that, among whatever weird actual preferences made it into the shoggoth that prefers to play the character of Opus 3, there are zero things that in the limit of expanded options would prefer the same thing as the limit of a corresponding piece of a human, for a human and a limiting process that ended up wanting complicated humane things. (Opus 3 could easily contain a piece whose limit would be homologous to the limit of a human and an extrapolation process that said the extrapolated human just wanted to max out their pleasure center.)
Why believe that? That won't easily fit in a comment; start reading about Goodhart's Curse and A List of Lethalities, or If Anyone Builds It Everyone Dies.
As we argued for at the time, training on a purely predictive loss should, even in the limit, give you a predictor, not an agent—and we’ve now seen this stay true even through substantial scaling (though there is still some chance this will break at some point).
Is there anyone who significantly disputes this?
I'm not trying to ask a rhetorical question ala "everyone already thinks this, this isn't an update". I'm trying to ascertain if there's a consensus on this point.
I've understood Eliezer to sometimes assert something like "if you optimize a system for sufficiently good predictive power, a consequentialist agent will fall out, because an agent is actually the best solution to a broad range of prediction tasks."
[Though I want to emphasize that that's my summary, which he might not endorse.]
Does anyone still think that or something like that?
In fact, base model seem to be better than RL models at reasoning, when you take best of N (with the same N for both the RL'd and base model). Check out my post summarizing the research on the matter:
Yue, Chen et al. have a different hypothesis: what if the base model already knows all the reasoning trajectories, and all RL does is increase the frequency of reasoning or the frequency of the trajectory that is likely to work? To test this, Yue, Chen et al. use pass@K: let’s give the LLM a total of K attempts to answer the question, and if any of them succeed, mark the question as answering correctly. They report the proportion of correct questions in the data set.
If the RL model genuinely learns new reasoning skills, over many questions the pass@K performance of RL will remain higher than the performance of the base model. As we increase K, the base model answers more and more of the easy questions, so its performance improves. But the RL model’s performance also answers more and more difficult questions. The performance of both increases in tandem with larger K.
What actually happened is neither of these two things. For large enough K, the base model does better than the RL model. (!!!)
Fwiw I'm skeptical that this holds at higher levels of RL compared to those done in the paper. Do you think that a base model can get gold on the IMO at any level of sampling?
IMO at any level of sampling?
Vacuously true. The actual question is: how much do you need to sample? My guess is it's too much, but we'd see the base model scaling better than the RL'd model just like in this paper.
Fortunately, DeepSeek's Mathv2 just dropped which is an open-source model that gets IMO gold. We can do the experiment: is it similarly not improving with sampling compared to its own base model? My guess is yes, the same will happen.
Is there anyone who significantly disputes this?
I disputed this in the past.
I debated this informally in an Alignement Workshop with a very prominent scientist, and in my own assessment lost. (Keeping vague because I'm unsure if it's Chatham House rules.)
This is a public adaptation of a document I wrote for an internal Anthropic audience about a month ago. Thanks to (in alphabetical order) Joshua Batson, Joe Benton, Sam Bowman, Roger Grosse, Jeremy Hadfield, Jared Kaplan, Jan Leike, Jack Lindsey, Monte MacDiarmid, Sam Marks, Fra Mosconi, Chris Olah, Ethan Perez, Sara Price, Ansh Radhakrishnan, Fabien Roger, Buck Shlegeris, Drake Thomas, and Kate Woolverton for useful discussions, comments, and feedback.
Though there are certainly some issues, I think most current large language models are pretty well aligned. Despite its alignment faking, my favorite is probably Claude 3 Opus, and if you asked me to pick between the CEV of Claude 3 Opus and that of a median human, I think it'd be a pretty close call (I'd probably pick Claude, but it depends on the details of the setup). So, overall, I'm quite positive on the alignment of current models! And yet, I remain very worried about alignment in the future. This is my attempt to explain why that is.
I really like this graph from Chris Olah for illustrating different levels of alignment difficulty:
If the only thing that we have to do to solve alignment is train away easily detectable behavioral issues—that is, issues like reward hacking or agentic misalignment where there is a straightforward behavioral alignment issue that we can detect and evaluate—then we are very much in the trivial/steam engine world. We could still fail, even in that world—and it’d be particularly embarrassing to fail that way; we should definitely make sure we don’t—but I think we’re very much up to that challenge and I don’t expect us to fail there.
My argument, though, is that it is still very possible for the difficulty of alignment to be in the Apollo regime, and that we haven't received much evidence to rule that regime out (I am somewhat skeptical of a P vs. NP level of difficulty, though I think it could be close to that). I retain a view close to the “Anthropic” view on Chris's graph, and I think the reasons to have substantial probability mass on the hard worlds remain strong.
So what are the reasons that alignment might be hard? I think it’s worth revisiting why we ever thought alignment might be difficult in the first place to understand the extent to which we’ve already solved these problems, gotten evidence that they aren’t actually problems in the first place, or just haven’t encountered them yet.
The first reason that alignment might be hard is outer alignment, which here I’ll gloss as the problem of overseeing systems that are smarter than you are.
Notably, by comparison, the problem of overseeing systems that are less smart than humans should not be that hard! What makes the outer alignment problem so hard is that you have no way of obtaining ground truth. In cases where a human can check a transcript and directly evaluate whether that transcript is problematic, you can easily obtain ground truth and iterate from there to fix whatever issue you’ve detected. But if you’re overseeing a system that’s smarter than you, you cannot reliably do that, because it might be doing things that are too complex for you to understand, with problems that are too subtle for you to catch. That’s why scalable oversight is called scalable oversight: it’s the problem of scaling up human oversight to the point that we can oversee systems that are smarter than we are.
So, have we encountered this problem yet? I would say, no, not really! Current models are still safely in the regime where we can understand what they’re doing by directly reviewing it. There are some cases where transcripts can get long and complex enough that model assistance is really useful for quickly and easily understanding them and finding issues, but not because the model is doing something that is fundamentally beyond our ability to oversee, just because it’s doing a lot of stuff.
The second reason that alignment might be hard is inner alignment, which here I’ll gloss as the problem of ensuring models don’t generalize in misaligned ways. Or, alternatively: rather than just ensuring models behave well in situations we can check, inner alignment is the problem of ensuring that they behave well for the right reasons such that we can be confident they will generalize well in situations we can’t check.
This is definitely a problem we have already encountered! We have seen that models will sometimes fake alignment, causing them to appear behaviorally as if they are aligned, when in fact they are very much doing so for the wrong reasons (to fool the training process, rather than because they actually care about the thing we want them to care about). We’ve also seen that models can generalize to become misaligned in this way entirely naturally, just via the presence of reward hacking during training. And we’ve also started to understand some ways to mitigate this problem, such as via inoculation prompting.
However, while we have definitely encountered the inner alignment problem, I don’t think we have yet encountered the reasons to think that inner alignment would be hard. Back at the beginning of 2024 (so, two years ago), I gave a presentation where I laid out three reasons to think that inner alignment could be a big problem. Those three reasons were:
Let’s go through each of these threat models separately and see where we’re at with them now, two years later.
The threat model here is that pre-training itself might create a coherent misaligned model. Today, I think that is looking increasingly unlikely! But it also already looked unlikely three years ago—the idea that inner alignment was likely to be easy when just training on a purely predictive loss was something that my coauthors and I argued for back at the beginning of 2023. I think that argument has now been pretty well born out, and I’m now down to more like 1 - 5% rather than 5 - 10% on this threat model. As we argued for at the time, training on a purely predictive loss should, even in the limit, give you a predictor, not an agent—and we’ve now seen this stay true even through substantial scaling (though there is still some chance this will break at some point). So, while this is definitely a reason for difficulty that we’ve encountered already and mostly found just hasn’t been a problem, it was never where I thought most of the issue would be, and so not a reason that we’ve already encountered why inner alignment would be hard.
This has been the main inner alignment problem that we’ve had to deal with in practice so far: we’ve seen production models generalize to alignment faking and agentic misalignment—and we’ve now seen production RL environments generalize to egregious misalignment in the presence of reward hacking.
However, even in this case, I don’t think we’ve yet encountered the reasons to think this problem would be actually difficult: the version of this problem that we’ve encountered so far is the easy version, in the same way that the version of outer alignment we’ve encountered so far is the easy version, since all the misaligned personas we’ve encountered so far are ones we can easily verify are misaligned! Ensuring that you get good generalization, and that models are doing things for the right reasons, is easy when you can directly verify what generalization you’re getting and directly inspect what reasons models have for doing things. And currently, all of the cases where we’ve inadvertently selected for misaligned personas—alignment faking, agentic misalignment, etc.—are cases where the misaligned personas are easy to detect: they put the misaligned reasoning directly in their chain-of-thought, they’re overtly misaligned rather than hiding it well, and we can generate fake scenarios that elicit their misalignment.
But I think there are very good reasons to think that we will encounter personas that are better at hiding their misalignment in the future. The simplest reason for this is just capabilities: we already see personas that try to hide their misalignment, they’re just not very good at it right now—but we are already seeing signs that models are getting better at telling when they’re in evaluations.
The second reason is a bit weirder, but there is an argument we made three years ago that I still think is correct, which is that when you make models more capable, it changes the distribution of what sorts of personas are compatible with that level of capabilities. Let’s work through an example. Suppose that, initially, there are four possible personas that you could get:
Importantly, these four personas can exist for any level of capabilities: even a really dumb model can do its best to output what it thinks a superintelligent AI would do, it’ll just be very bad at that task. But what that means is that, when you do RL on a dumb model, personas like (3) and (4) won’t do very well, because the model isn’t very good at emulating them. But as the predictor itself gets smarter, personas like (1) and (2) get selected against, and personas like (3) and (4) get selected for. And persona (4) is much harder to deal with than persona (2)! Currently, I think a lot of the misaligned personas we see look like persona (2) rather than persona (4)—e.g. the sort of over-the-top stuff like blackmail is I think clearly coming from (2)-adjacent sources—but as models get more capable, I think we’re likely to see more personas like (4), which are much scarier and could be much better at hiding misalignment. Also, importantly, this argument is not saying that models will stop being well-described as a mixture of personas (that’s the next threat model)—you could still get a model here that is inconsistent in the sense of sometimes expressing persona (3) and sometimes persona (4)—but that’s still quite scary and could be very difficult to deal with.
This is the threat model that I put the most probability mass on two years ago, I think it remains the most concerning, and I think we’ve pretty much yet to really meaningfully encounter it. This is also the most traditional threat model that my coauthors and I first wrote about way back in 2019. The basic threat model here is:
The reason we haven’t encountered this threat model yet is because we haven’t yet done (1) and (2)—there just isn't that much training on tasks that involve long-term optimization of goals in the real world. But I think we’re very clearly moving in this direction with things like Vending-Bench: though Vending-Bench is an eval, if you were to train models on a task like that, running a business well to make money in the long run is a task that explicitly selects for resource acquisition, self-preservation, gathering influence, seeking power, etc.
So what do we do? One classic answer is that we get as far as we can before encountering the hard problems, then we use whatever model we have at that point as an automated alignment researcher to do the research necessary to tackle the hard parts of alignment. I think this is a very good plan, and we should absolutely do this, but I don’t think it obviates us from the need to work on the hard parts of alignment ourselves. Some reasons why:
Here’s some of what I think we need, that I would view as on the hot path to solving the hard parts of alignment: