By process-based RL, I mean: the reward for an action doesn't depend on the consequences of executing that action. Instead it depends on some overseer's evaluation of the action, potentially after reading justification or a debate about it or talking with other AI assistants or whatever. I think this has roughly the same risk profile as imitation learning, while potentially being more competitive.
I'm generally excited and optimistic about coordination. If you are just saying that AI non-proliferation isn't that much harder than nuclear non-proliferation, then I think I'm with you. But I think (i) it's totally fair to call that "strong global coordination," (ii) you would probably have to do a somewhat better job than we did of nuclear non-proliferation.
I think the technical question is usually going to be about how to trade off capability against risk. If you didn't care about that at all, you could just not build scary ML systems. I'm saying that you should build smaller models with process-based RL.
It might be good to focus on legible or easy-to-enforce lines rather than just trading off capability vs risk optimally. But I don't think that "no RL" is effective as a line---it still leaves you with a lot of reward-hacking (e.g. by planning against an ML model, or predicting what actions lead to a high reward, or expert iteration...). Trying to avoid all these things requires really tightly monitoring every use of AI, rather than just training runs. And I'm not convinced it helps significantly with deceptive alignment.
So in any event it seems like you are going to care about model size. "No big models" is also a way easier line to enforce. This is pretty much like saying "minimize the amount of black-box end-to-end optimization you do," which feels like it gets closer to the heart of the issue.
If you are taking that approach, I think you would probably prefer to do process-based RL with smaller models, rather than imitation learning with bigger models (and will ultimately want to use outcomes in relatively safe ways). Yes it would be safer to use neither process-based RL nor big models, and just make your AI weaker. But the main purpose of technical work is to reduce how demanding the policy ask is---how much people are being asked to give up, how unstable the equilibrium is, how much powerful AI we can tolerate in order to help enforce or demonstrate necessity. Otherwise we wouldn't be talking about these compromises at all---we'd just be pausing AI development now until safety is better understood.
I would quickly change my tune on this if e.g. we got some indication that process-based RL increased rather than decreased the risk of deceptive alignment at a fixed level of capability.
It would be safest of all to just not build powerful AI for a very long time. But alas, that seems wildly uncompetitive and so would require some kind of strong global coordination (and would create considerable instability and leave significant value on the table for other worldviews).
It's possible that "human-level AI with CoT" will be competitive enough, but I would guess not.
So to me the obvious approach is to use chain of thought and decomposition to improve performance, and then to distill the result back into the model.
You could try to do distillation with imitation learning. This is way more likely to be competitive then with no distillation at all.
But it still seems like it has a very good chance of being uncompetitive because the imitation objective significantly impairs performance and creates all kinds of artifacts. Using process-based RL for distillation seems like it has essentially the same safety profile to using imitation learning, while avoiding the obvious pathologies and having a much higher probability of being competitive.
(People give various reasons that RL in the distillation step is less safe than imitation learning in the distillation step, but so far I haven't found anything at all persuasive.)
I think there's still a good chance that process-based RL in the distillation step still can't be competitive and so you need to talk about how to develop new techniques or prudently incorporate outcomes. But I think it's at least much more likely to be competitive than CoT-only, or imitation learning in the distillation step. (Perhaps it cuts down the probability of deal-breaking uncompetitiveness by 30%, compared to using imitation learning alone for distillation.)
Note that Evals has just published a description of some of their work evaluating GPT-4 and Claude. Their publication does not include transcripts, the details of the LM agents they evaluated, or detailed qualitative discussion of the strengths and weaknesses of the agents they evaluated. I believe that eventually Evals should be considerably more liberal about sharing this kind of information; my post is explaining why I believe that.
Yeah, I think sections 2, 3, 4 are probably more important and should maybe have come first in the writeup. (But other people think that 1 dominates.) Overall it's not a very well-constructed post.
At any rate thanks for highlighting this point. For the kinds of interventions I'm discussing (sharing information about LM agent capabilities and limitations) I think there are basically two independent reasons you might be OK with it---either you like sharing capabilities in general, or you like certain kinds of LM agent improvements---and either one is sufficient to carry the day.
Although this is an important discussion I want to emphasize up front that I don't think it's closely related to the argument in the OP. I tried to revise the OP to emphasize that the first section of the article is about LM agent improvements that are relevant to engineering better scaffolding rather than improving our ability to optimize such agents end to end.
I've seen little evidence of this so far, and don't think current LLM performance is even that well-characterized by this. This would be great, but I don't currently think its true.
If you allow models to think for a while they do much better than if you just ask them to answer the question. By "think for a while" we mean they generate one sentence after another in the same way a human would. Their ability to use chain of thought seems to come essentially entirely from copying human chains of thought rather than e.g. using filler tokens to parallelize cognition or RL fine-tuning teaching them novel cognitive strategies.
I agree that models also memorize a lot of facts. Almost all the facts they actually use are facts that humans know, which they memorized by observing humans using them or stating them. So I don't really consider this evidence one way or the other.
If you want to state any concrete prediction about the future I'm happy to say whether I agree with it. For example:
My sense right now is that this feels a bit semantic.
I do think that right now LMs are by far closest to doing useful work by exploiting human-legible interfaces and decompositions. Chain of thought, simple decompositions, and imitations of human tool use are already important for LM performance. While more complex LM agents add only a small amount of additional value, it seems like extrapolating trends would make them pretty important soon.
Overall I think the world is shaping up extremely far in the direction of "AI systems learn to imitate human cognitive steps and then compose them into impressive performance." I'm happy to bet about whether that trend will continue to the extent we can operationalize it. E.g. I'd bet on increasingly wide gaps between each of LM snap judgments, chain of thought, and tool-use. I don't have a strong view about more complex decompositions unless context length is a serious limitation. I would guess that end-to-end optimization will make at most marginal differences in efficacy (probably smaller than RLHF).
To the extent models trained with RLHF are doing anything smart in the real world I think it's basically ~100% by solving a human-comprehensible task. Namely humans give the system a task, and it tries to do some rough combination of what a particular kind of human demonstrator would do and what a particular kind of human evaluator would rate highly. There is no further optimization to take intelligent actions in the world.
I changed the section to try to make it a bit more clear that I mean "understanding of LM agents." For the purpose of this post, I am trying to mostly talk about things like understanding the capabilities and limitations of LM agents, and maybe even incidental information about decomposition and prompting that help overcome these limitations. This is controversial because it may allow people to build better agents, but I think this kind of understanding is helpful if people continue to build such agents primarily out of chain of thought and decomposition, while not having much impact on our ability to optimize end-to-end.
I don't think I disagree with many of the claims in Jan's post, generally I think his high level points are correct.
He lists a lot of things as "reasons for optimism" that I wouldn't consider positive updates (e.g. stuff working that I would strongly expect to work) and doesn't list the analogous reasons for pessimism (e.g. stuff that hasn't worked well yet). Similarly I'm not sure conviction in language models is a good thing but it may depend on your priors.
One potential substantive disagreement with Jan's position is that I'm somewhat more scared of AI systems evaluating the consequences of each other's actions and therefore more interested in trying to evaluate proposed actions on paper (rather than needing to run them to see what happens). That is, I'm more interested in "process-based" supervision and decoupled evaluation, whereas my understanding is that Jan sees a larger role for systems doing things like carrying out experiments with evaluation of results in the same way that we'd evaluate employee's output.
(This is related to the difference between IDA and RRM that I mentioned above. I'm actually not sure about Jan's all-things-considered position, and I think this piece is a bit agnostic on this question. I'll return to this question below.)
The basic tension here is that if you evaluate proposed actions you easily lose competitiveness (since AI systems will learn things overseers don't know about the consequences of different possible actions) whereas if you evaluate outcomes then you are more likely to have an abrupt takeover where AI systems grab control of sensors / the reward channel / their own computers (since that will lead to the highest reward). A subtle related point is that if you have a big competitiveness gap from process-based feedback, then you may also be running an elevated risk from deceptive alignment (since it indicates that your model understands things about the world that you don't).
In practice I don't think either of those issues (competitiveness or takeover risk) is a huge issue right now. I think process-based feedback is pretty much competitive in most domains, but the gap could grow quickly as AI systems improve (depending on how well our techniques work). On the other side, I think that takeover risks will be small in the near future, and it is very plausible that you can get huge amounts of research out of AI systems before takeover is a significant risk. That said I do think eventually that risk will become large and so we will need to turn to something else: new breakthroughs, process-based feedback, or fortunate facts about generalization.
As I mentioned, I'm actually not sure what Jan's current take on this is, or exactly what view he is expressing in this piece. He says:
Another important open question is how much easier evaluation is if you can’t rely on feedback signals from the real world. For example, is evaluation of a piece of code easier than writing it, even if you’re not allowed to run it? If we’re worried that our AI systems are writing code that might contain trojans and sandbox-breaking code, then we can’t run it to “see what happens” before we’ve reviewed it carefully.
I'm not sure where he comes down on whether we should use feedback signals from the real world, and if so what kinds of precaution we should take to avoid takeover and how long we should expect them to hold up. I think both halves of this are just important open questions---will we need real world feedback to evaluate AI outcomes? In what cases will we be able to do so safely? If Jan is also just very unsure about both of these questions then we may be on the same page.
I generally hope that OpenAI can have strong evaluations of takeover risk (including: understanding their AI's capabilities, whether their AI may try to take over, and their own security against takeover attempts). If so, then questions about the safety of outcomes-based feedback can probably be settled empirically and the community can take an "all of the above" approach. In this case all of the above is particularly easy since everything is sitting on the same spectrum. A realistic system is likely to involve some messy combination of outcomes-based and process-based supervision, we'll just be adjusting dials in response to evidence about what works and what is risky.
1. I think OpenAI is also exploring work on interpretability and on easy-to-hard generalization. I also think that the way Jan is trying to get safety for RRM is fairly different for the argument for correctness of IDA (e.g. it doesn't depend on goodness of HCH, and instead relies on some claims about offense-defense between teams of weak agents and strong agents), even though they both involve decomposing tasks and iteratively training smarter models.
2. I think it's unlikely debate or IDA will scale up indefinitely without major conceptual progress (which is what I'm focusing on), and obfuscated arguments are a big part of the obstacle. But there's not much indication yet that it's a practical problem for aligning modestly superhuman systems (while at the same time I think research on decomposition and debate has mostly engaged with more boring practical issues). I don't think obfuscated arguments have been a major part of most people's research prioritization.
3. I think many people are actively working on decomposition-focused approaches. I think it's a core part of the default approach to prosaic AI alignment at all the labs, and if anything is feeling even more salient these days as something that's likely to be an important ingredient. I think it makes sense to emphasize it less for research outside of labs, since it benefits quite a lot from scale (and indeed my main regret here is that working on this for GPT-3 was premature). There is a further question of whether alignment people need to work on decomposition/debate or should just leave it to capabilities people---the core ingredient is finding a way to turn compute into better intelligence without compromising alignment, and that's naturally something that is interesting to everyone. I still think that exactly how good we are at this is one of the major drivers for whether the AI kills us, and therefore is a reasonable topic for alignment people to push on sooner and harder than it would otherwise happen, but I think that's a reasonable topic for controversy.
We intend to leave this prize open until the end of September. At that point we will distribute prizes (probably just small prizes for useful arguments and algorithms, but no full solution).
I now pretty strongly suspect that the version of problem 1 with logarithmic dependence on ε is not solvable. We would award a prize for an algorithm running in time O(mnε−1) which can distinguish matrices with no PSD completion from those with a completion where the ratio of min to max eigenvalue is at least ε. And of course a lower bound is still fair game.
That said, I don't expect any new submissions to win prizes and so wouldn't recommend that anyone start working on it.