Evan Hubinger

I am a Research Fellow at MIRI working on inner alignment for amplification.

See: "What I'll doing at MIRI."

Pronouns: he/him/his

Evan Hubinger's Comments

Three Kinds of Competitiveness

It's interesting how Paul advocates merging cost and performance-competitiveness, and you advocate merging performance and date-competitiveness.

Also I advocated merging cost and date competitiveness (into training competitiveness), so we have every combination covered.

Three Kinds of Competitiveness

In the context of prosaic AI alignment, I've recently taken to splitting up competitiveness into “training competitiveness” and “objective competitiveness,”[1] where training competitiveness refers to the difficulty of training the system to succeed at its objective and objective competitiveness refers to the usefulness of a system that succeeds at that objective. I think my training competitiveness broadly maps onto a combination of your cost and date competitiveness and my objective competitiveness broadly maps onto your performance competitiveness. I think I mildly like my dichotomy better than your trichotomy in terms of thinking about prosaic AI alignment schemes, as I think it provides a better picture of the specific parts of a prosaic AI alignment proposal that are helping or hindering its overall competitiveness—e.g. if it's not very objective competitive, that tells you that you need a stronger objective, and if it's not very training competitive, that tells you that you need a better training process (it's also nice in terms of mirroring the inner/outer alignment distinction). That being said, your trichotomy is certainly more general in terms of applying to things that aren't just prosaic AI alignment.

  1. Objective competitiveness isn't a great term, though, since it can be misread as the opposite of subjective competitiveness—perhaps I'll switch now to using performance competitiveness instead. ↩︎

Zoom In: An Introduction to Circuits

I think for the remaining 5% to be hiding really big important stuff like the presence of optimization (which is to say, mesa-optimization) or deceptive cognition, it has to be the case that there was adversarial obfuscation (e.g. gradient hacking). Of course, I'm only hypothesizing here, but it seems quite unlikely for that sort of stuff to just be randomly obfuscated.

Given that assumption, I think it's possible to translate 95% transparency into a safety guarantee: just use your transparency to produce a consistent gradient away from deception such that your model never becomes deceptive in the first place and thus never does any sort of adversarial obfuscation.[1] I suspect that the right way to do this is to use your transparency tools to enforce some sort of simple condition that you are confident in rules out deception such as myopia. For more context, see my comment here and the full “Relaxed adversarial training for inner alignment” post.

  1. It is worth noting that this does introduce the possibility of getting obfuscation by overfitting the transparency tools, though I suspect that that sort of overfitting-style obfuscation will be significantly easier to deal with than actively adversarial obfuscation by a deceptive mesa-optimizer. ↩︎

Towards a mechanistic understanding of corrigibility

I don't think there's really a disagreement there—I think what Paul's saying is that he views corrigibility as the right way to get an acceptability guarantee.

Does iterated amplification tackle the inner alignment problem?

You are correct that amplification is primarily a proposal for how to solve outer alignment, not inner alignment. That being said, Paul has previously talked about how you might solve inner alignment in an amplification-style setting. For an up-to-date, comprehensive analysis of how to do something like that, see “Relaxed adversarial training for inner alignment.”

What is the difference between robustness and inner alignment?

This a good question. Inner alignment definitely is meant to refer to a type of robustness problem—it's just also definitely not meant to refer to the entirety of robustness. I think there are a couple of different levels on which you can think about exactly what subproblem inner alignment is referring to.

First, the definition that's given in “Risks from Learned Optimization”—where the term inner alignment comes from—is not about competence vs. intent robustness, but is directly about the objective that a learned search algorithm is searching for. Risks from Learned Optimization broadly takes the position that though it might not make sense to talk about learned models having objectives in general, it certainly makes sense to talk about a model having an objective if it is internally implementing a search process, and argues that learned models internally implementing search processes (which the paper calls mesa-optimizers) could be quite common. I would encourage reading the full paper to get a sense of how this sort of definition plays out.

Second, that being said, I do think that the competence vs. intent robustness framing that you mention is actually a fairly reasonable one. “2-D Robustness” presents the basic picture here, though in terms of a concrete example of what robust capabilities without robust alignment could actually look like, I am somewhat partial to my maze example. I think the maze example in particular presents a very clear story for how capability and alignment robustness can come apart even for agents that aren't obviously running a search process. The 2-D robustness distinction is also the subject of this alignment newsletter, which I'd also highly recommend taking a look at, as it has some more commentary on thinking about this sort of a definition as well.

Bayesian Evolving-to-Extinction

If that ticket is better at predicting the random stuff it's writing to the logs—which it should be if it's generating that randomness—then that would be sufficient. However, that does rely on the logs directly being part of the prediction target rather than only through some complicated function like a human seeing them.

Bayesian Evolving-to-Extinction

There is also the "lottery ticket hypothesis" to consider (discussed on LW here and here) -- the idea that a big neural network functions primarily like a bag of hypotheses, not like one hypothesis which gets adapted toward the right thing. We can imagine different parts of the network fighting for control, much like the Bayesian hypotheses.

This is a fascinating point. I'm curious now how bad things can get if your lottery tickets have side channels but aren't deceptive. It might be that the evolving-to-extinction policy of making the world harder to predict through logs is complicated enough that it can only emerge through a deceptive ticket deciding to pursue it—or it could be the case that it's simple enough that one ticket could randomly start writing stuff to logs, get selected for, and end up pursuing such a policy without ever actually having come up with it explicitly. This seems likely to depend on how powerful your base optimization process is and how easy it is to influence the world through side-channels. If it's the case that you need deception, then this probably isn't any worse than the gradient hacking problem (though possibly it gives us more insight into how gradient hacking might work)—but if it can happen without deception, then this sort of evolving-to-extinction behavior could be a serious problem in its own right.

Synthesizing amplification and debate

Yep; that's basically how I'm thinking about this. Since I mostly want this process to limit to amplification rather than debate, I'm not that worried about the debate equilibrium not being exactly the same, though in most cases I expect in the limit that such that you can in fact recover the debate equilibrium if you anneal towards debate.

Synthesizing amplification and debate

The basic debate RL setup is meant to be unchanged here—when I say “the RL reward derived from ” I mean that in the zero-sum debate game sense. So you're still using self-play to converge on the Nash in the situation where you anneal towards debate, and otherwise you're using that self-play RL reward as part of the loss and the supervised amplification loss as the other part.

Are the arguments the same thing as answers?

The arguments should include what each debater thinks the answer to the question should be.

I think yours is aiming at the second and not the first?


Load More