Strong upvote from me. This is an interesting paper, the github is well explained, and you run extensive secondary experiments to test pretty much every "Wait but couldn't this just be a result of X" that I came up with. I'm especially impressed by the range of generalization results.
Some questions I still have:
This is a pretty cool paper. Despite feeling overall quite positive about it, I have some reservations:
My summary to augment the main one:
Broadly human level AI may be here soon and will have a large impact. Anthropic has a portfolio approach to AI safety, considering both: optimistic scenarios where current techniques are enough for alignment, intermediate scenarios where substantial work is needed, and pessimistic scenarios where alignment is impossible; they do not give a breakdown of probability mass in each bucket and hope that future evidence will help figure out what world we're in (though see the last quote below). These buckets are helpful for understanding the goal of developing: better techniques for making AI systems safer, and better ways of identifying how safe or unsafe AI systems are. Scaling systems is required for some good safety research, e.g., some problems only arise near human-level, Debate and Constitutional AI need big models, need to understand scaling to understand future risks, if models are dangerous, compelling evidence will be needed.
They do three kinds of research: Capabilities which they don’t publish, Alignment Capabilities which seems mostly about improving chat bots and applying oversight techniques at scale, and Alignment Science which involves interpretability and red-teaming of the approaches developed in Alignment Capabilities. They broadly take an empirical approach to safety, and current research directions include: scaling supervision, mechanistic interpretability, process-oriented learning, testing for dangerous failure modes, evaluating societal impacts, and understanding and evaluating how AI systems learn and generalize.
I'll note that I'm confused about the Optimistic, Intermediate, and Pessimistic scenarios: how likely does Anthropic think each is? What is the main evidence currently contributing to that world view? How are you actually preparing for near-pessimistic scenarios which "could instead involve channeling our collective efforts towards AI safety research and halting AI progress in the meantime?"
I doubt it's a crux for you, but I think your critique of Debate makes pessimistic assumptions which I think are not the most realistic expectation about the future.
Let’s play the “follow-the-trying game” on AGI debate. Somewhere in this procedure, we need the AGI debaters to have figured out things that are outside the space of existing human concepts—otherwise what’s the point? And (I claim) this entails that somewhere in this procedure, there was an AGI that was “trying” to figure something out. That brings us to the usual inner-alignment questions: if there’s an AGI “trying” to do something, how do we know that it’s not also “trying” to hack its way out of the box, seize power, and so on? And if we can control the AGI’s motivations well enough to answer those questions, why not throw out the whole “debate” idea and use those same techniques (whatever they are) to simply make an AGI that is “trying” to figure out the correct answer and tell it to us?
When I imagine saying the above quote to a smart person who doesn't buy AI x-risk, their response is something like "woah slow down there. Just because the AI is "trying" to do something doesn't mean it stands any chance of doing actually dangerous things like hacking out of the box. The ability to hack out of the box doesn't mysteriously line up with the level of intelligence that would be useful for an AI debate." This person seems largely right, and I think your argument is mainly "it won't work to let two superintelligences to debate each other about important things" rather than a stronger claim like "any AIs smart enough to have a productive debate might be trying to do dangerous things and have non-negligible chance of succeeding".
We could be envisioning different pictures for how debate is useful as a technique. I think it will break for sufficiently high intelligence levels, for reasons you discuss, but we might still get useful work out of it in models like GPT-4/5. Additionally, it seems to me that there are setups of Debate in which we aren't all-or-nothing on the instrumental subgoals, consequentialist planning, and meta cognition, especially in (unlikely) worlds where the people implementing debate are taking many precautions. Fundamentally, Debate is about getting more trustworthy outputs from untrustworthy systems, and I expect we can get useful debates from AIs that do not run a significant risk of the failures you describe.
Again, I doubt this is a main crux for whether you will work on Debate, and that seems quite reasonable. If it's the case that, "Debate is unlikely to scale all the way to dangerous AGIs", then to the extent that we want to focus on the "dangerous AGIs" domain we might just want to skip it and work on other stuff.
Makes sense. FWIW, based on Jan's comments I think the main/only thing the OpenAI alignment team is aiming for here is i, differentially speeding up alignment research. It doesn't seem like Jan believes in this plan; personally I don't believe in this plan.
4. We want to focus on aspects of research work that are differentially helpful to alignment. However, most of our day-to-day work looks like pretty normal ML work, so it might be that we'll see limited alignment research acceleration before ML research automation happens.
I don't know how to link to the specific comment, but here somewhere. Also:
We can focus on tasks differentially useful to alignment research
Your pessimism about iii still seems a bit off to me. I agree that if you were coordinating well between all the actors than yeah you could just hold off on AI assistants. But the actual decision the OpenAI alignment team is facing could be more like "use LLMs to help with alignment research or get left behind when ML research gets automated". If facing such choices I might produce a plan like theirs, but notably I would be much more pessimistic about it. When the universe limits you to one option, you shouldn't expect it to be particularly good. The option "everybody agrees to not build AI assistants and we can do alignment research first" is maybe not on the table, or at least it probably doesn't feel like it is to the alignment team at OpenAI.
(iii) because if this was true, then we could presumably just solve alignment without the help of AI assistants.
Either I misunderstand this or it seems incorrect.
It could be the case that the current state of the world doesn’t put us on track to solve Alignment in time, but using AI assistants to increase the rate of Alignment : Capabilities work by some amount is sufficient.
The use of AI assistants for alignment : capabilities doesn't have to track with the current rate of Alignment : Capabilities work. For instance, if the AI labs with the biggest lead are safety conscious, I expect the ratio of alignment : capabilities research they produce to be much higher (compared to now) right before AGI. See here.
Summary:If interpretability research is highly tractable and we can build highly interpretable systems without sacrificing competitiveness, then it will be better to build such systems from the ground up, rather than taking existing unsafe systems and tweaking them to be safe. By analogy, if you have a non-functioning car, it is easy to bring in functional parts to fix the engine and make the car drive safely, compared to it being hard to take a functional elephant and tweak it to be safe. In a follow up post, the author clarifies that this could be thought of as engineering (well-founded AI) vs. reverse engineering (interpretability). One pushback form John Wentworth is that we currently do not know how to build the car, or how the basic chemistry in the engine actually works; we do interpretability research in order to understand these processes better. Ryan Greenblatt pushes back that the post is more accurate if the word “interpretability” was replaced with “microscope AI” or “comprehensive reverse engineering”; this is because we do not need to understand every part of complex models in order to tell if they are deceiving us, so the level our interpretability understanding needs to be to be useful is lower than the level it needs to be for us to build the car from the ground up. Neel Nanda writes a similar comment about how, to him, high tractability is a lower bar, much lower than understanding every part of a system such that we could build it.
follow up post