Ajeya Cotra

Comments

The case for aligning narrowly superhuman models

In my head the point of this proposal is very much about practicing what we eventually want to do, and seeing what comes out of that; I wasn't trying here to make something different sound like it's about practice. I don't think that a framing which moved away from that would better get at the point I was making, though I totally think there could be other lines of empirical research under other framings that I'd be similarly excited about or maybe more excited about.

In my mind, the "better than evaluators" part is kind of self-evidently intriguing for the basic reason I said in the post (it's not obvious how to do it, and it's analogous to the broad, outside view conception of the long-run challenge which can be described in one sentence/phrase and isn't strongly tied to a particular theoretical framing):

I’m excited about tackling this particular type of near-term challenge because it feels like a microcosm of the long-term AI alignment problem in a real, non-superficial sense. In the end, we probably want to find ways to meaningfully supervise (or justifiably trust) models that are more capable than ~all humans in ~all domains.[4] So it seems like a promising form of practice to figure out how to get particular humans to oversee models that are more capable than them in specific ways, if this is done with an eye to developing scalable and domain-general techniques.

A lot of people in response to the draft were pushing in the direction that I think you were maybe gesturing at (?) -- to make this more specific to "knowing everything the model knows" or "ascription universality"; the section "Why not focus on testing a long-term solution?" was written in response to Evan Hubinger and others. I think I'm still not convinced that's the right way to go.

The case for aligning narrowly superhuman models

I don't feel confident enough in the frame of "inaccessible information" to say that the whole agenda is about it. It feels like a fit for "advice", but not a fit for "writing stories" or "solving programming puzzles" (at least not an intuitive fit -- you could frame it as "the model has inaccessible information about [story-writing, programming]" but it feels more awkward to me). I do agree it's about "strongly suspecting it has the potential to do better than humans" rather than about "already being better than humans." Basically, it's about trying to find areas where lackluster performance seems to mostly be about "misalignment" rather than "capabilities" (recognizing those are both fuzzy terms).

The case for aligning narrowly superhuman models

Yeah, you're definitely pointing at an important way the framing is awkward. I think the real thing I want to say is "Try to use some humans to align a model in a domain where the model is better than the humans at the task", and it'd be nice to have a catchy term for that. Probably a model which is better than some humans (e.g. MTurkers) at one task (e.g. medical advice) will also be better than those same humans at many other tasks (e.g. writing horror stories); but at the same time for each task, there's some set of humans (e.g. doctors in the first case and horror authors in the second) where the model does worse.

I don't want to just call it "align superhuman AI today" because people will be like "What? We don't have that", but at the same time I don't want to drop "superhuman" from the name because that's the main reason it feels like "practicing what we eventually want to do." I considered "partially superhuman", but "narrowly" won out.

I'm definitely in the market for a better term here.

MIRI comments on Cotra's "Case for Aligning Narrowly Superhuman Models"

The conceptual work I was gesturing at here is more Paul's work, since MIRI's work (afaik) is not really neural net-focused. It's true that Paul's work also doesn't assume a literal worst case; it's a very fuzzy concept I'm gesturing at here. It's more like, Paul's research process is to a) come up with some procedure, b) try to think of any "plausible" set of empirical outcomes that cause the procedure to fail, and c) modify the procedure to try to address that case. (The slipperiness comes in at the definition of "plausible" here, but the basic spirit of it is to "solve for every case" in the way theoretical CS typically aims to do in algorithm design, rather than "solve for the case we'll in fact encounter.")

MIRI comments on Cotra's "Case for Aligning Narrowly Superhuman Models"

This was a really helpful articulation, thanks! I like "frankness", "forthrightness", "openness", etc. (These are all terms I was brainstorming to get at the "ascription universality" concept at one point.)

The case for aligning narrowly superhuman models

The case in my mind for preferring to elicit and solve problems at scale rather than in toy demos (when that's possible) is pretty broad and outside-view, but I'd nonetheless bet on it: I think a general bias toward wanting to "practice something as close to the real thing as possible" is likely to be productive. In terms of the more specific benefits I laid out in this section, I think that toy demos are less likely to have the first and second benefits ("Practical know-how and infrastructure" and "Better AI situation in the run-up to superintelligence"), and I think they may miss some ways to get the third benefit ("Discovering or verifying a long-term solution") because some viable long-term solutions may depend on some details about how large models tend to behave.

I do agree that working with larger models is more expensive and time-consuming, and sometimes it makes sense to work in a toy environment instead, but other things being equal I think it's more likely that demos done at scale will continue to work for superintelligent systems, so it's exciting that this is starting to become practical.

The case for aligning narrowly superhuman models

Yeah, in the context of a larger alignment scheme, it's assuming that in particular the problem of answering the question "How good is the AI's proposed action?" will factor down into sub-questions of manageable size.

The case for aligning narrowly superhuman models

The intuition for it is something like this: suppose I'm trying to make a difficult decision, like where to buy a house. There are hundreds of cities I'd be open to, each one has dozens of neighborhoods, and each neighborhood has dozens of important features, like safety, fun things to do, walkability, price per square foot, etc. If I had a long time, I would check out each neighborhood in each city in turn and examine how it does on each dimension, and pick the best neighborhood.

If I instead had an army of clones of myself, I could send many of them to each possible neighborhood, with each clone examining one dimension in one neighborhood. The mes that were all checking out different aspects of neighborhood X can send up an aggregated judgment to a me that is in charge of "holistic judgment of neighborhood X", and the mes that focus on holistic judgments of neighborhoods can do a big pairwise bracket to filter up a decision to the top me.

The case for aligning narrowly superhuman models

Yes sorry — I'm aware that in the HCH procedure no one human thinks for a long time. I'm generally used to mentally abstracting HCH (or whatever scheme fits that slot) as something that could "effectively replicate the benefits you could get from having a human thinking a long time," in terms of the role that it plays in an overall scheme for alignment. This isn't guaranteed to work out, of course. My position is similar to Rohin's above:

I just personally find it easier to think about "benefits of a human thinking for a long time" and then "does HCH get the same benefits as humans thinking for a long time" and then "does iterated amplification get the same benefits as HCH".

The case for aligning narrowly superhuman models

My understanding is that HCH is a proposed quasi-algorithm for replicating the effects of a human thinking for a long time.

Load More