Paul Christiano


Iterated Amplification

Wiki Contributions


This is a great example. The rhyming find in particular is really interesting though I'd love to see it documented more clearly if anyone has done that.

I strongly suspect that whatever level of non-myopic token prediction a base GPT-3 model does, the tuned ones are doing more of it.

My guess would be that it's doing basically the same amount of cognitive work looking for plausible completions, but that it's upweighting that signal a lot.

Suppose the model always looks ahead and identifies some plausible trajectories based on global coherence. During generative training it only slightly increases the first word of each of those plausible trajectories, since most of the time the text won't go in the particular coherent direction that the model was able to foresee. But after fine-tuning it learns to focus really hard on the concrete plausible trajectories it found, since those systematically get a high reward.

Either way, I think this more general phenomenon is a sufficiently specific hypothesis about the effect of RLHF that it could be tested directly and that would be really interesting and valuable. It can also be done using only API access which is great.

RLHF and Fine-Tuning have not worked well so far. Models are often unhelpful, untruthful, inconsistent, in many ways that had been theorized in the past. We also witness goal misspecification, misalignment, etc. Worse than this, as models become more powerful, we expect more egregious instances of misalignment, as more optimization will push for more and more extreme edge cases and pseudo-adversarial examples.

These three links are:

  • The first is Mysteries of mode collapse, which claims that RLHF (as well as OpenAI's supervised fine-tuning on highly-rated responses) decreases entropy. This doesn't seem particularly related to any of the claims in this paragraph, and I haven't seen it explained why this is a bad thing.
  • The second is Discovering language model behaviors with model-written evaluations and shows that Anthropic's models trained with RLHF have systematically different personalities than the pre-trained model.  I'm not exactly sure what claims you are citing, but I think it probably involves some big leaps to interpret this as either directly harmful or connected with traditional stories about risk.
  • The third is Compendium of problems with RLHF, which primarily links to the previous 2 failures and then discusses theoretical limitations.

I think these are bad citations for the claim that methods are "not working well" or that current evidence points towards trouble.

The current problems you list---"unhelpful, untruthful, and inconsistent"---don't seem like good examples to illustrate your point. These are mostly caused by models failing to correctly predict which responses a human would rate highly. That happens because models have limited capabilities and is rapidly improving as models get smarter. These are not the problems that most people in the community are worried about, and I think it's misleading to say this is what was "theorized" in the past.

I think RLHF is obviously inadequate for aligning really powerful models, both because you cannot effectively constrain a deceptively aligned model and because human evaluators will eventually not be able to understand the consequences of proposed actions. And I think it is very plausible that large language models will pose serious catastrophic risks from misalignment before they are transformative (it seems very hard to tell). But I feel like this post isn't engaging with the substance of those concerns or sensitive to the actual state of evidence about how severe the problem looks like it will be or how well existing mitigations might work.

My overall take is:

  • I generally think that people interested in existential safety should want the deployment of AI to be slower and smoother, and that accelerating AI progress is generally bad.
  • The main interesting version of this tradeoff is when people specifically work on advancing AI in order to make it safer, either so that safety-conscious people have more influence or so that the rollout is more gradual. I think that's a way bigger deal than capability externalities from safety work, and in my view it's also closer to the actual margin of what's net-positive vs net-negative.
  • I think that OpenAI provides altruistic justifications for accelerating AI development, but is probably net bad for existential safety (and also would likely be dispreferred by existing humans, though it depends a lot on how you elicit their preferences and integrate them with my empirical views). That said, I don't think this case is totally obvious or a slam dunk, and I think that people interested in existential safety should be laying out their arguments (both to labs and to the broader world) and engaging productively rather than restating the conclusion louder and with stronger moral language.
  • I think that Anthropic's work also accelerates AI arrival, but it is much easier for it to come out ahead on a cost-benefit: they have significantly smaller effects on acceleration, and a more credible case that they will be safer than alternative AI developers. I have significant unease about this kind of plan, partly for the kinds of reasons you list and also a broader set of moral intuitions. As a result it's not something I would do personally. But I've spent some time thinking it through as best I can and it does seem like the expected impact is good. To the extent that people disagree I again think they should be actually making the argument and trying to understand the disagreement.

Dangerous Argument 2: We should avoid capability overhangs, so that people are not surprised. To do so, we should extract as many capabilities as possible from existing AI systems.

I'm saying that faster AI progress now tends to lead to slower AI progress later. I think this is a really strong bet, and the question is just: (i) quantitatively how large is the effect, (ii) how valuable is time now relative to time later. And on balance I'm not saying this makes progress net positive, just that it claws back a lot of the apparent costs.

For example, I think a very plausible view is that accelerating how good AI looks right now by 1 year will accelerate overall timelines by 0.5 years. It accelerates by less than 1 because there are other processes driving progress: gradually scaling up investment, compute progress, people training up in the field, other kinds of AI progress, people actually building products. Those processes can be accelerated by AI looking better, but they are obviously accelerated by less than 1 year (since they actually take time in addition to occurring at an increasing rate as AI improves).

I think that time later is significantly more valuable than time now (and time now is much more valuable than time in the old days). Safety investment and other kinds of adaptation increase greatly as the risks become more immediate (capabilities investment also increases, but that's already included); safety research gets way more useful (I think most of the safety community's work is 10x+ less valuable than work done closer to catastrophe, even if the average is lower than that). Having a longer period closer to the end seems really really good to me.

If we lose 1 year now, and get back 0.5 years later., and if years later are 2x as good as years now, you'd be breaking even.

My view is that progress probably switched from being net positive to net negative (in expectation) sometime around GPT-3. If we had built GPT-3 in 2010, I think the world's situation would probably have been better. We'd maybe be at our current capability level in 2018, scaling up further would be going more slowly because the community had already picked low hanging fruit and was doing bigger training runs, the world would have had more time to respond to the looming risk, and we would have done more good safety research.

At this point I think faster progress is probably net negative, but it's still less bad than you would guess if you didn't take this into account.

(It's fairly likely that in retrospect I will say that's wrong---that really good safety work didn't start until we had systems where we could actually study safety risks, and that safety work in 2023 made so much less difference than time closer to the end such that faster progress in 2023 was actually net beneficial---but I agree that it's hard to tell and if catastrophically risky AI is very soon then current progress can be significantly net negative.)

I do think that this effect differs for different kinds of research, and things like "don't think of RLHF" or "don't think of chain of thought" are particularly bad reasons for progress to be slow now (because it will get fixed particularly quickly later since it doesn't have long lead times and in fact is pretty obvious, though you may just skip straight to more sophisticated versions). But I'm happy to just set aside that claim and focus on the part where overhang just generally makes technical progress less bad.

Eliminating capability overhangs is discovering AI capabilities faster, so also pushes us up the exponential! For example, it took a few years for chain-of-thought prompting to become more widely known than among a small circle of people around AI Dungeon. Once chain-of-thought became publicly known, labs started fine-tuning models to explicitly do chain-of-thought, increasing their capabilities significantly. This gap between niche discovery and public knowledge drastically slowed down progress along the growth curve!

It seems like chain of thought prompting mostly became popular around the time models became smart enough to use it productively, and I really don't think that AI dungeon was the source of this idea (why do you think that?) I think it quickly became popular around the time of PaLM just because it was a big and smart model. And I really don't think the people who worked on GSM8k or Minerva were like "This is a great idea from the AI dungeon folks," I think this is just the thing you try if you want your AI to work on math problems, it had been discussed a lot of times within labs and I think was rolled out fast enough that delays had almost ~0 effect on timelines. Whatever effect they had seems like it must be ~100% from increasing the timeline for scaling up investment.

And even granting the claim about chain of thought, I disagree about where current progress is coming from. What exactly is the significant capability increase from fine-tuning models to do chain of thought? This isn't part of ChatGPT or Codex or AlphaCode. What exactly is the story?

To make a useful version of this post I think you need to get quantitative.

I think we should try to slow down the development of unsafe AI. And so all else equal I think it's worth picking research topics that accelerate capabilities as little as possible. But it just doesn't seem like a major consideration when picking safety projects. (In this comment I'll just address that; I also think the case in the OP is overstated for the more plausible claim that safety-motivated researchers shouldn't work directly on accelerating capabilities.)

A simple way to model the situation is:

  • There are A times as many researchers who work on existential safety as there are total researchers in AI. I think a very conservative guess is something like A = 1% or 10% depending on how broadly you define it; a more sophisticated version of this argument would focus on more narrow types of safety work.
  • If those researchers don't worry about capabilities externalities when they choose safety projects, they accelerate capabilities by about B times as much as if they had focused on capabilities directly. I think a plausible guess for this is like B = 10%.
  • The badness of accelerating capabilities by 1 month is C times larger than the goodness of accelerating safety research by 1 month. I think a reasonable conservative guess for this today is like 10. It depends on how much other "good stuff" you think is happening in the world that is similarly important to safety progress on reducing the risk from AI---are there 2 other categories, or 10 other categories, or 100 other categories? This number will go up as the broader world starts responding and adapting to AI more; I think in the past it was much less than 10 because there just wasn't that much useful preparation happening.

The ratio of (capabilities externalities) / (safety impact) is something like A x B x C, which is like 1-10% of the value of safety work. This suggests that capabilities externalities can be worth thinking about.

There are lots of other corrections which I think generally point further in the direction of not worrying. For example, in this estimate we said that AI safety research is 10% of all the important safety-improving-stuff happening in the world whereas we implicitly assumed that capabilities research is 100% of all the timelines-shortening-stuff happening in the world, whereas I think that if we consider those two questions symmetrically I would argue that capabilities research is <30% of all the timelines-shortening-stuff (including e.g. compute progress, experience with products, human capital growth...). So my personal estimate for safety externalities is more like 1% of cost of safety progress.

And on top of that if you totally avoid 30% of possible safety topics, the total cost is much larger than reducing safety output by 30%---generally there are diminishing returns and stepping on toes effects and so on. I'd guess it's at least 2-4x worse than just scaling down everything.

There are lots of ways you could object to this, but ultimately it seems to come down to quantitative claims:

  • My sense is that the broader safety community considers parameter B is much larger than I do, and this is the most substantive and interesting disagreement. People seem to have an estimate more like 100%, and when restricting t  "good alignment work" they often seem to think more like 10x. Overall, if I believed B=1 then it becomes worth thinking about and if you take B=10 then it becomes a major consideration in project choice.
  • In this post Nate explains his view that alignment progress is serial while capabilities progress is parallelizable, but I disagree and he doesn't really present any evidence or argument. Of course you could list other considerations not in the simple model that you think are important.
  • I think the most common way is just to think that the supposed safety progress isn't actually delivering safety. But then "this doesn't help" is really the crux---the work would be bad even if it weren't net negative. Arguing about net negativity is really burying the lede.

I agree that ultimately AI systems will have an understanding built up from the world using deliberate cognitive steps (in addition to plenty of other complications) not all of which are imitated from humans.

The ELK document mostly focuses on the special case of ontology identification, i.e. ELK for a directly learned world model. The rationale is: (i) it seems like the simplest case, (ii) we don't know how to solve it, (iii) it's generally good to start with the simplest case you can't solve, (iv) it looks like a really important special case, which may appear as a building block or just directly require the same techniques as the general problem.

We briefly discuss the more general situation of learned optimization here. I don't think that discussion is particularly convincing, it just describes how we're thinking about it.

On the bright side, it seems like our current approach to ontology identification (based on anomaly detection) would have a very good chance of generalizing to other cases of ELK. But it's not clear and puts more strain on the notion of explanation we are using.

At the end of the day I strongly suspect the key question is whether we can make something like a "probabilistic heuristic argument" about the reasoning learned by ML systems, explaining why they predict (e.g.) images of human faces. We need arguments detailed enough to distinguish between sensor tampering (or lies) and real anticipated faces, i.e. they may be able to treat some claims as unexplained empirical black boxes but they can't have black boxes so broad that they would include both sensor tampering and real faces.

If such arguments exist and we can find them then I suspect we've dealt with the hard part of alignment. If they don't exist or we can't find them then I think we don't really have a plausible angle of attack on ELK. I think a very realistic outcome is that it's a messy empirical question, in which case our contribution could be viewed as clarifying an important goal for "interpretability" but success will ultimately come down to a bunch of empirical research.

I feel like to the extent there is confusion/conflation over the terminology, it was mainly due to Paul's (probably unintentional) overloading of "AI alignment" with the new and narrower meaning (in Clarifying “AI Alignment”)

I don't think this is the main or only source of confusion:

  • MIRI folks also frequently used the narrower usage. I think the first time I saw "aligned" was in Aligning Superintelligence with Human Interests from 2014 (scraped by wayback on January 3 2015) which says "We call a smarter-than-human system that reliably pursues beneficial goals “aligned with human interests” or simply “aligned.”"
  • Virtually every problem people discussed as part of AI alignment was also part of intent alignment. The name was deliberately chosen to evoke "pointing" your AI in a direction. Even in the linked post Eliezer uses "pointing the AI in the right direction" as a synonym for alignment.
  • It was proposed to me as a replacement for the narrower term AI control, which quite obviously doesn't include all the broader stuff. In the email thread where Rob suggested I adopt it he suggested it was referring to what Nick Bostrom called the "second principal-agent problem" between AI developers and the AI they build.

the overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world

I want to emphasize again that this definition seems extremely bad. A lot of people think their work helps AI actually produce good outcomes in the world when run, so pretty much everyone would think their work counts as alignment. 

It includes all work in AI ethics, if in fact that research is helpful for ensuring that future AI has a good outcome. It also includes everything people work on in AI capabilities, if in fact capability increases improve the probability that a future AI system produces good outcomes when run. It's not even restricted to safety, since it includes realizing more upside from your AI. It includes changing the way you build AI to help address distributional issues, if the speaker (very reasonably!) thinks those are important to the value of the future. I didn't take this seriously as a definition and didn't really realize anyone was taking it seriously, I thought it was just an instance of speaking loosely.

But if people are going to use the term this way, I think at a minimum they cannot complain about linguistic drift when "alignment" means anything at all. Obviously people are going to disagree about what AI features lead to "producing good outcomes." Almost all the time I see definitional arguments it's where people (including Eliezer) are objecting that "alignment" includes too much stuff and should be narrower, but this is obviously not going to be improved by adopting an absurdly broad definition.

In the 2017 post Vladimir Slepnev is talking about your AI system having particular goals, isn't that the narrow usage? Why are you citing this here?

I misread the date on the Arbital page (since Arbital itself doesn't have timestamps and it wasn't indexed by the Wayback machine until late 2017) and agree that usage is prior to mine.

Load More