johnswentworth

Sequences

From Atoms To Agents
"Why Not Just..."
Basic Foundations for Agent Models

Wiki Contributions

Comments

Sorted by

A few problems with this frame.

First: you're making reasonably-pessimistic assumptions about the AI, but very optimistic assumptions about the humans/organization. Sure, someone could look for the problem by using AIs to do research on other subject that we already know a lot about. But that's a very expensive and complicated project - a whole field, and all the subtle hints about it, need to be removed from the training data, and then a whole new model trained! I doubt that a major lab is going to seriously take steps much cheaper and easier than that, let alone something that complicated.

One could reasonably respond "well, at least we've factored apart the hard technical bottleneck from the part which can be solved by smart human users or good org structure". Which is reasonable to some extent, but also... if a product requires a user to get 100 complicated and confusing steps all correct in order for the product to work, then that's usually best thought of as a product design problem, not a user problem. Making the plan at least somewhat robust to people behaving realistically less-than-perfectly is itself part of the problem.

Second: looking for the problem by testing on other fields itself has subtle failure modes, i.e. various ways to Not Measure What You Think You Are Measuring. A couple off-the-cuff examples:

  • A lab attempting this strategy brings in some string theory experts to evaluate their attempts to rederive string theory with AI assistance. But maybe (as I've heard claimed many times) string theory is itself an empty echo-chamber, and some form of sycophancy or telling people what they want to hear is the only way this AI-assisted attempt gets a good evaluation from the string theorists.
  • It turns out that fields-we-don't-understand mostly form a natural category distinct from fields-we-do-understand, or that we don't understand alignment precisely because our existing tools which generalize across many other fields don't work so well on alignment. Either of those would be a (not-improbable-on-priors) specific reason to expect that our experience attempting to rederive some other field does not generalize well to alignment.

And to be clear, I don't think of these as nitpicks, or as things which could go wrong separately from all the things originally listed. They're just the same central kinds of failure modes showing up again, and I expect them to generalize to other hacky attempts to tackle the problem.

Third: it doesn't really matter whether the model is trying to make it hard for us to notice the problem. What matters is (a) how likely we are to notice the problem "by default", and (b) whether the AI makes us more or less likely to notice the problem, regardless of whether it's trying to do so. The first story at top-of-thread is a good central example here:

  • Perhaps the path to superintelligence looks like applying lots of search/optimization over shallow heuristics. Then we potentially die to things which aren't smart enough to be intentionally deceptive, but nonetheless have been selected-upon to have a lot of deceptive behaviors (via e.g. lots of RL on human feedback).

Generalizing that story to attempts to outsource alignment work to earlier AI: perhaps the path to moderately-capable intelligence looks like applying lots of search/optimization over shallow heuristics. If the selection pressure is sufficient, that system may well learn to e.g. be sycophantic in exactly the situations where it won't be caught... though it would be "learning" a bunch of shallow heuristics with that de-facto behavior, rather than intentionally "trying" to be sycophantic in exactly those situations. Then the sycophantic-on-hard-to-verify-domains AI tells the developers that of course their favorite ideas for aligning the next generation of AI will work great, and it all goes downhill from there.

scheming is the main plausible source of catastrophic risk from the first AIs that either pose substantial misalignment risk or that are extremely useful...

Seems quite wrong. The main plausible source of catastrophic risk from the first AIs that either pose substantial misalignment risk or that are extremely useful is that they cause more powerful AIs to be built which will eventually be catastrophic, but which have problems that are not easily iterable-upon (either because problems are hidden, or things move quickly, or ...).

And causing more powerful AIs to be built which will eventually be catastrophic is not something which requires a great deal of intelligent planning; humanity is already racing in that direction on its own, and it would take a great deal of intelligent planning to avert it. This story, for example:

  • People try to do the whole "outsource alignment research to early AGI" thing, but the human overseers are themselves sufficiently incompetent at alignment of superintelligences that the early AGI produces a plan which looks great to the overseers (as it was trained to do), and that plan totally fails to align more-powerful next-gen AGI at all. And at that point, they're already on the more-powerful next gen, so it's too late.

This story sounds clearly extremely plausible (do you disagree with that?), involves exactly the sort of AI you're talking about ("the first AIs that either pose substantial misalignment risk or that are extremely useful"), but the catastropic risk does not come from that AI scheming. It comes from people being dumb by default, the AI making them think it's ok (without particularly strategizing to do so), and then people barreling ahead until it's too late.

These other problems all seem like they require the models to be way smarter in order for them to be a big problem.

Also seems false? Some of the relevant stories:

  • As mentioned above, the "outsource alignment to AGI" failure-story was about exactly the level of AI you're talking about.
  • In worlds where hard takeoff naturally occurs, it naturally occurs when AI is just past human level in general capabilities (and in particular AI R&D), which I expect is also roughly the same level you're talking about (do you disagree with that?).
  • The story about an o1-style AI does not involve far possibilities and would very plausibly kick in at-or-before the first AIs that either pose substantial misalignment risk or that are extremely useful.

A few of the other stories also seem debatable depending on trajectory of different capabilities, but at the very least those three seem clearly potentially relevant even for the first highly dangerous or useful AIs.

Yeah, I'm aware of that model. I personally generally expect the "science on model organisms"-style path to contribute basically zero value to aligning advanced AI, because (a) the "model organisms" in question are terrible models, in the sense that findings on them will predictably not generalize to even moderately different/stronger systems (like e.g. this story), and (b) in practice IIUC that sort of work is almost exclusively focused on the prototypical failure story of strategic deception and scheming, which is a very narrow slice of the AI extinction probability mass.

I think a very common problem in alignment research today is that people focus almost exclusively on a specific story about strategic deception/scheming, and that story is a very narrow slice of the AI extinction probability mass. At some point I should probably write a proper post on this, but for now here are few off-the-cuff example AI extinction stories which don't look like the prototypical scheming story. (These are copied from a Facebook thread.)

  • Perhaps the path to superintelligence looks like applying lots of search/optimization over shallow heuristics. Then we potentially die to things which aren't smart enough to be intentionally deceptive, but nonetheless have been selected-upon to have a lot of deceptive behaviors (via e.g. lots of RL on human feedback).
  • The "Getting What We Measure" scenario from Paul's old "What Failure Looks Like" post.
  • The "fusion power generator scenario".
  • Perhaps someone trains a STEM-AGI, which can't think about humans much at all. In the course of its work, that AGI reasons that an oxygen-rich atmosphere is very inconvenient for manufacturing, and aims to get rid of it. It doesn't think about humans at all, but the human operators can't understand most of the AI's plans anyway, so the plan goes through. As an added bonus, nobody can figure out why the atmosphere is losing oxygen until it's far too late, because the world is complicated and becomes more so with a bunch of AIs running around and no one AI has a big-picture understanding of anything either (much like today's humans have no big-picture understanding of the whole human economy/society).
  • People try to do the whole "outsource alignment research to early AGI" thing, but the human overseers are themselves sufficiently incompetent at alignment of superintelligences that the early AGI produces a plan which looks great to the overseers (as it was trained to do), and that plan totally fails to align more-powerful next-gen AGI at all. And at that point, they're already on the more-powerful next gen, so it's too late.
  • The classic overnight hard takeoff: a system becomes capable of self-improving at all but doesn't seem very alarmingly good at it, somebody leaves it running overnight, exponentials kick in, and there is no morning.
  • (At least some) AGIs act much like a colonizing civilization. Plenty of humans ally with it, trade with it, try to get it to fight their outgroup, etc, and the AGIs locally respect the agreements with the humans and cooperate with their allies, but the end result is humanity gradually losing all control and eventually dying out.
  • Perhaps early AGI involves lots of moderately-intelligent subagents. The AI as a whole mostly seems pretty aligned most of the time, but at some point a particular subagent starts self-improving, goes supercritical, and takes over the rest of the system overnight. (Think cancer, but more agentic.)
  • Perhaps the path to superintelligence looks like scaling up o1-style runtime reasoning to the point where we're using an LLM to simulate a whole society. But the effects of a whole society (or parts of a society) on the world are relatively decoupled from the things-individual-people-say-taken-at-face-value. For instance, lots of people talk a lot about reducing poverty, yet have basically-no effect on poverty. So developers attempt to rely on chain-of-thought transparency, and shoot themselves in the foot.

Kudos for correctly identifying the main cruxy point here, even though I didn't talk about it directly.

The main reason I use the term "propaganda" here is that it's an accurate description of the useful function of such papers, i.e. to convince people of things, as opposed to directly advancing our cutting-edge understanding/tools. The connotation is that propagandists over the years have correctly realized that presenting empirical findings is not a very effective way to convince people of things, and that applies to these papers as well.

And I would say that people are usually correct to not update much on empirical findings! Not Measuring What You Think You Are Measuring is a very strong default, especially among the type of papers we're talking about here.

Someone asked what I thought of these, so I'm leaving a comment here. It's kind of a drive-by take, which I wouldn't normally leave without more careful consideration and double-checking of the papers, but the question was asked so I'm giving my current best answer.

First, I'd separate the typical value prop of these sort of papers into two categories:

  • Propaganda-masquerading-as-paper: the paper is mostly valuable as propaganda for the political agenda of AI safety. Scary demos are a central example. There can legitimately be valuable here.
  • Object-level: gets us closer to aligning substantially-smarter-than-human AGI, either directly or indirectly (e.g. by making it easier/safer to use weaker AI for the problem).

My take: many of these papers have some value as propaganda. Almost all of them provide basically-zero object-level progress toward aligning substantially-smarter-than-human AGI, either directly or indirectly.

Notable exceptions:

  • Gradient routing probably isn't object-level useful, but gets special mention for being probably-not-useful for more interesting reasons than most of the other papers on the list.
  • Sparse feature circuits is the right type-of-thing to be object-level useful, though not sure how well it actually works.
  • Better SAEs are not a bottleneck at this point, but there's some marginal object-level value there.

I mean, there are lots of easy benchmarks on which I can solve the large majority of the problems, and a language model can also solve the large majority of the problems, and the language model can often have a somewhat lower error rate than me if it's been optimized for that. Seems like GPQA (and GPQA diamond) are yet another example of such a benchmark.

Even assuming you're correct here, I don't see how that would make my original post pretty misleading?

I remember finishing early, and then spending a lot of time going back over all them a second time, because the goal of the workshop was to answer correctly with very high confidence. I don't think I updated any answers as a result of the second pass, though I don't remember very well.

@Buck Apparently the five problems I tried were GPQA diamond, they did not take anywhere near 30 minutes on average (more like 10 IIRC?), and I got 4/5 correct. So no, I do not think that modern LLMs probably outperform (me with internet access and 30 minutes).

Load More