This is a link post for https://aligned.substack.com/p/alignment-mvp
I'm writing a sequence of posts on the approach to alignment I'm currently most excited about. This second post argues that instead of trying to solve the alignment problem once and for all, we can succeed with something less ambitious: building a system that allows us to bootstrap better alignment techniques.
Building weak AI systems that help improve alignment seems extremely important to me and is a significant part of my optimism about AI alignment. I also think it's a major reason that my work may turn out not to be relevant in the long term.
I still think there are tons of ways that delegating alignment can fail, such that it matters that we do alignment research in advance:
Overall I think that "make sure we are able to get good alignment research out of early AI systems" is comparably important to "do alignment ourselves." Realistically I think the best case for "do alignment ourselves" is that if "do alignment" is the most important task to automate, then just working a ton on alignment is a great way to automate it. But that still means you should be investing quite a significant fraction of your time in automating alignment.
I also basically buy that language models are now good enough that "use them to help with alignment" can be taken seriously and it's good to be attacking it directly.
What do you (or others) think is the most promising, soon-possible way to use language models to help with alignment? A couple of possible ideas:
This seems to completely ignore the main problem with approaches which try to outsource alignment research to AGI: optimizing for alignment strategies which look promising to a human reviewer will also automatically incentivize strategies which fool the human reviewer. Evaluation is not actually easier than generation, when Goodhart is the main problem to begin with.
I think it's very unclear how big a problem Goodhart is for alignment research---it seems like a question about a particular technical domain. There are domains where evaluation is much easier; most obviously mathematics, but also in e.g. physics or computer science, there are massive gaps between recognition and generation even if you don't have formal theorem statements. There are also domains where it's not much easier, where the whole thing rests on complicated judgments where the search for clever arguments just isn't doing much work.
It looks to me like alignment is somewhere in the middle, though it's not at all clear---right now there are different strands of alignment progress, which seem to have very different properties with respect to the ease of evaluation.
The kind of Goodhart we are usually concerned about is stuff like "it's easier to hijack the reward signal than to actually perform a challenging task," and I don't think that's very tightly correlated with the question about alignment. So this feels like the rhetoric here involves a bit of an equivocation.
Just a couple weeks ago I had this post talking about how, in some technical areas, we've been able to find very robust formulations of particular concepts (i.e. "True Names"). The domains where evaluation is much easier - math, physics, CS - are the domains where we have those robust formulations. Even within e.g. physics, evaluation stops being easy when we're in a domain where we don't have a robust mathematical formulation of the phenomena of interest.
The other point of that post is that we do not currently have such formulations for the phenomena of interest in alignment, and (one way of framing) the point of foundational agency research is to find them.
So I agree that the difficulty of evaluation varies by domain, but I don't think it's some mysterious hard-to-predict thing. The places where robust evaluation is easy all build on qualitatively-similar foundational pieces, and alignment does not yet have those sorts of building blocks.
Go take a look at that other post, it has two good examples of how Goodhart shows up as a central barrier to alignment.
I don't buy the empirical claim about when recognition is easier than generation. As an example, I think that you can recognize robust formulations much more easily than you can generate them in math, computer science, and physics. In general I think "recognition is not trivial" is different from "recognition is as hard as generation."
If it turns out that evaluation of alignment proposals is not easier than generation, we're in pretty big trouble because we'll struggle to convince others that any good alignment proposals humans come up with are worth implementing.
You could still argue by generalization that we should use alignment proposals produced by humans who had a lot of good proposals on other problems even if we're not sure about those alignment proposals. But then you're still susceptible to the same kinds of problems.
You might think that humans are more robust on the distribution of [proposals generated by humans trying to solve alignment] vs [proposals generated by a somewhat superhuman model trying to get a maximal score]
yeah that's a fair point
But this is pretty likely the case though, isn't it? Actually I think by default the situation will be the opposite: it will be too easy to convince others that some alignment proposal is worth implementing, because humans are in general too easily convinced by informal arguments that look good but contain hidden flaws (and formalizing the arguments is both very difficult and doesn't help much because you're still depending on informal arguments for why the formalized theoretical concepts correspond well enough to the pre-theoretical concepts that we actually care about). Look at the history of philosophy, or cryptography, if you doubt this.
But suppose we're able to convince people to distrust their intuitive sense of how good an argument is, and to keep look for hidden flaws and counterarguments (which might have their own hidden flaws and so on). Well how do we know when it's safe to end this process and actually hit the run button?
It feels to me like there's basically no question that recognizing good cryptosystems is easier than generating them. And recognizing attacks on cryptosystems is easier than coming up with attacks (even if they work by exploiting holes in the formalisms). And recognizing good abstract arguments for why formalisms are inadequate is easier than generating them. And recognizing good formalisms is easier than generating them.
This is all true notwithstanding the fact that we often make mistakes. (Though as we've discussed before, I think that a lot of the examples you point to in cryptography are cases where there were pretty obvious gaps in formalisms or possible improvements in systems, and those would have motivated a search for better alternatives if doing so was cheap with AI labor.)
The example of cryptography was mainly intended to make the point that humans are by default too credulous when it comes to informal arguments. But consider your statement:
Consider some cryptosystem widely considered to be secure, like AES. How much time did humanity spend on learning / figuring out how to recognize good cryptosystems (e.g. finding all the attacks one has to worry about, like differential cryptanalysis), versus specifically generating AES with the background knowledge in mind? Maybe the latter is on the order of 10% of the former?
Then consider that we don't actually know that AES is secure, because we don't know all the possible attacks and we don't know how to prove it secure, i.e., we don't know how to recognize a good cryptosystem. Suppose one day we figure that out, wouldn't finding an actually good cryptosystem be trivial at that point compared to all the previous effort?
Some of your other points are valid, I think, but cryptography is just easier than alignment (don't have time to say more as my flight is about to take off), and philosophy is perhaps a better analogy for the more general point.
I think we need to unpack "sufficiently aligned"; here's my attempt. There are A=2^10000 10000-bit strings. Maybe 2^1000 of them are coherent English text, and B=2^200 of these are alignment proposals that look promising to a human reviewer, and C=2^100 of them are actually correct and will result in aligned AI.The thesis of the post requires that we can make a "sufficiently aligned" AI that, conditional on a proposal looking promising, is likely to be actually correct.
We can't get those 100 bits through further selection for appearance. It seems plausible that we can get them somehow, though.
Is your story:
It sounds like you are thinking of 2. But I think we have reasonably good intuitions about that. I think for short evaluations "fool us" is obviously easier. For long evaluations (including similarly-informed critics pointing out holes etc.) I think that it rapidly becomes easier to just do good work (though it clearly depends on the kind of work).
Consider the space of 10-page google docs. Within this space, we pick out all the google docs which some human evaluator would consider a good alignment proposal. (You can imagine the human is assisted in some way if you want, it makes little difference to this particular argument.) Then the question is, what fraction of these will actually be good alignment proposals? So, we have two relevant numbers:
Now, the key heuristic: in a high-dimensional space, adding any non-simple constraint will exponentially shrink the search space. "Number of proposals which look good to the human AND are actually good" has one more complicated constraint than "Number of proposals which look good to the human", and will therefore be exponentially smaller.
So in "it would be much easier to trick us than to write down a good proposal", the relevant operationalization of "easier" for this argument is "the number of proposals which both look good and are good is exponentially smaller than the number which look good".
I think that argument applies just as easily to a human as to a model, doesn't it?
So it seems like you are making an equally strong claim that "if a human tries to write down something that looks like good alignment work almost all of it will be persuasive but bad." And I think that's kind of true and kind of not true. In general I think you can get much better estimates by thinking about delegating to sociopathic humans (or to humans with slightly different comparative advantages) than trying to make a counting argument.
(I think the fact that "how smart the human is" doesn't matter mostly just proves that the counting argument is untethered from the key considerations.)
A human writing their own alignment proposal has introspective access to the process-which-generates-the-proposal, and can get a ton of bits from that. They can trust the process, rather than just the output.
A human who is good at making their own thinking process legible to others, coupled with an audience who knows to look for that, could get similar benefits in a more distributed manner. Faking a whole though process is more difficult, for a human, than simply faking an output. That does not apply nearly as well to an AI; it is far more likely that the AI's thought-process would be very different from ours, such that it would be easier to fake a human-legible path than to truly follow one from the start.
I think "how smart the human is" is not a key consideration.
I think how well we can evaluate claims and arguments about AI alignment absolutely determines whether delegating alignment to machines is easier than doing alignment ourselves. A heuristic argument that says "evaluation isn't easier than generation, and that claim is true regardless of how good you are at evaluation until you get basically perfect at it" seems obviously wrong to me. If that's a good summary of the disagreement I'm happy to just leave it there.
Yup, that sounds like a crux. Bookmarked for later.
I strongly agree with you that it'll eventually be very difficult for humans to tell apart AI-generated alignment proposals that look good and aren't good from ones that look good and are actually good.
There is a much stronger version of the claim "alignment proposals are easier to evaluate than to generate" that I think we're discussing in this thread, where you claim that humans will be able to tell all good alignment proposals apart from bad ones or at least not accept any bad ones (precision matters much more than recall here since you can compensate bad recall with compute). If this strong claim is true, then conceptually RLHF/reward modeling should be sufficient as an alignment technique for the minimal viable product. Personally I think that this strong version of the claim is unlikely to be true, but I'm not certain that I will be false for the first systems that can do useful alignment research.
As William points out below, if we get AI-assisted human evaluation to work well, then we can uncover flaws in alignment proposals that are too hard to find for unassisted humans. This is a weaker version of the claim, because you're just claiming that humans + AI assistance are better at evaluating alignment proposals than human + AI assistance are at generating them. Generally I'm pretty optimistic about that level of supervision actually allowing us to supervise superhuman alignment research; I've written more about this here: https://aligned.substack.com/p/ai-assisted-human-feedback
Is the claim here that the 2^200 "persuasive ideas" would actually pass the scrutiny of top human researchers (for example, Paul Christiano studies one of them for a week and concludes that it is probably a full solution)? Or do you just mean that they would look promising in a shorter evaluation done for training purposes?
I think this concern is only relevant if your strategy is to do RL on human evaluations of alignment research. If instead you just imitate the distribution of current alignment research, I don't think you get this problem, at least anymore than we have it now--and I think you can still substantially accelerate alignment research with just imitation. Of course, you still have inner alignment issues, but from an outer alignment perspective I think imitation of human alignment research is a pretty good thing to try.
Evaluation assistance as mentioned in the post on AI-assisted human feedback could help people avoid being fooled (e.g. in debate where the opponent can point out how you're being fooled). It's still an open question how well that will work in practice and how quickly it will Goodhart (these techniques should fail on some things, as discussed in the ELK report), but it seems possible that models will be helpful enough on alignment before they Goodhart.
What are people's timelines for deceptive alignment failures arising in models, relative to AI-based alignment research being useful?
Today's language models are on track to become quite useful, without showing signs of deceptive misalignment or its eyebrow-raising pre-requisites (e.g., awareness of the training procedure), afaik. So my current best guess is that we'll be able to get useful alignment work from superhuman sub-deception agents for 5-10+ years or so. I'm very curious if others disagree here though
I personally have pretty broad error bars; I think it's plausible enough that AI won't help with automating alignment that it's still valuable for us to work on alignment, and plausible enough that AI will help with automating alignment that it significantly increases our chances of survival and is worth preparing for making use of. I also tend to think that current progress in language modeling seems to suggest that models will reach the point of being extremely helpful with alignment way before they become super scary.
Eliezer has consistently expressed confidence that AI systems smart enough to help with alignment will also be smart enough that they'll inevitably be trying to kill you. I don't think he's really explained this view, and I've never found it particularly compelling. I think this a lot of folks around LW have absorbed a similar view; I'm not totally sure how much it comes from Eliezer but I'd guess that's a lot of it.
I think part of Eliezer's views of this come from a view of intelligence and recursive self-improvement that imply that explosive recursive self-improvement begins before high object-level competence on other research tasks. I think this view is most likely mistaken, but my guess is that it's tied up with Eliezer's views about how to build AGI closely enough that Eliezer won't want to defend his position here.
(My position is the very naive one, that recursive self-improvement will become critical at roughly the same time that AI systems are better than humans at contributing to further AI progress, which has roughly a 50-50 shot of happening before alignment progress.)
Beyond that, Eliezer has not said very much about where these intuitions are coming from. What he has said does not seem (to me) to have fared particularly well over the last few years. For example:
In fact it does not seem hard to get AI systems to understand the relevant parts of human language (relative to being able to easily kill all humans or to inevitably be trying to kill all humans). And it does not seem hard to get an AI to predict which things you will judge to be relevant, well enough that this is a very bad way of explaining why Holden's proposal would fail.
Of course getting an AI to tell you what it's really thinking may be hard (and indeed I think it's hard enough that I think there's a significant probability that we will all die because we failed to solve it). And I think Eliezer even has a fair model of why it's hard (or at least I've often defended him based on a more charitable reading of his overall views).
But my point is that to the extent Eliezer has explained why he thinks AI won't be helpful until it's too late, so far it doesn't seem like adjacent intuitions have stood the test of time well.
IMO, the alignment MVP claim Jan is making is approximately '‘we only need to focus on aligning narrow-ish alignment research models that are just above human level, which can be done with RRM (and maybe some other things, but no conceptual progress?)’'
I'd imagine some cruxes to include:
- whether it's possible to build models capable of somewhat superhuman alignment research that do not have inner agents
- whether people will build systems that require conceptual progress in alignment to make safe before we can build the alignment MVP and get significant work out of it
I think I’m something like 30% on ‘The highest-leverage point for alignment work is once we have models that are capable of alignment research - we should focus on maximising the progress we make at that point, rather than on making progress now, or on making it to that point - most of the danger comes after it’
Things this maybe implies: