Evan Hubinger

I (Evan Hubinger) am a Research Fellow at MIRI working on inner alignment for amplification.

See: "What I'll doing at MIRI."

Pronouns: he/him/his

Email: evanjhub@gmail.com

Selected work:

Formal Solution to the Inner Alignment Problem

Sure, but you have no guarantee that the model you learn is actually going to be optimizing anything like that reward function—that's the whole point of the inner alignment problem. What's nice about the approach in the original paper is that it keeps a bunch of different models around, keeps track of their posterior, and only acts on consensus, ensuring that the true model always has to approve. But if you just train a single model on some reward function like that with deep learning, you get no such guarantees.

Formal Solution to the Inner Alignment Problem

I had thought that maybe since a Q-learner is trained as if the cached point estimate of the Q-value of the next state is the Truth, it won't, in a single forward pass, consider different models about what the actual Q-value of the next state is. At most, it will consider different models about what the very next transition will be.

a) Does that seem right? and b) Aren't there some policy gradient methods that don't face this problem?

This seems wrong to me—even though the Q learner is *trained* using its own point estimate of the next state, it isn't, at inference time, given access to that point estimate. The Q learner has to choose its Q values before it knows anything about what the Q value estimates will be of future states, which means it certainly should have to consider different models of what the next transition will be like.

Formal Solution to the Inner Alignment Problem

Hmmm... I don't think I was ever even meaning to talk specifically about RL, but regardless I don't expect nearly as large of a difference between Q-learning and policy gradient algorithms. If we imagine both types of algorithms making use of the same size massive neural network, the only real difference is how the output of that neural network is interpreted, either directly as a policy, or as Q values that are turned into a policy via something like softmax. In both cases, the neural network is capable of implementing any arbitrary policy and should be getting a similar sort of feedback signal from the training process—especially if you're using a policy gradient algorithm that involves something like advantage estimation rather than actual rollouts, since the update rule in that situation is going to look very similar to the Q learning update rule. I do expect some minor differences in the sorts of models you end up with, such as Q learning being more prone to non-myopic behavior across episodes, and I think there are some minor reasons that policy gradient algorithms are favored in real-world settings, since they get to learn their exploration policy rather than having it hard-coded and can handle continuous action domains—but overall I think these sorts of differences are pretty minor and shouldn't affect whether these approaches can reach general intelligence or not.

Formal Solution to the Inner Alignment Problem

I agree that this is progress (now that I understand it better), though:

if SGD is MAP then it seems plausible that e.g. SGD + random initial conditions or simulated annealing would give you something like top N posterior models

I think there is strong evidence that the behavior of models trained via the same basic training process are likely to be highly correlated. This sort of correlation is related to low variance in the bias-variance tradeoff sense, and there is evidence that not only do massive neural networks tend to have pretty low variance, but that this variance is likely to continue to decrease as networks become larger.

Formal Solution to the Inner Alignment Problem

I agree that at some level SGD has to be doing something approximately Bayesian. But that doesn't necessarily imply that you'll be able to get any nice, Bayesian-like properties from it such as error bounds. For example, if you think of SGD as effectively just taking the MAP model starting from sort of simplicity prior, it seems very difficult to turn that into something like the top posterior models, as would be required for an algorithm like this.

Formal Solution to the Inner Alignment Problem

It's hard for me to imagine that an agent that finds an "easiest-to-find model" and then calls it a day could ever do human-level science.

I certainly don't think SGD is a powerful enough optimization process to do science directly, but it definitely seems powerful enough to find an agent which does do science.

if local search is this bad, I don't think it is a viable path to AGI

We know that local search processes can produce AGI, so viability is a question of efficiency—and we know that SGD is at least efficient enough to solve a wide variety of problems from image classification, to language modeling, to complex video games, all given just current compute budgets. So while I could certainly imagine SGD being insufficient, I definitely wouldn't want to bet on it.

Formal Solution to the Inner Alignment Problem

Yeah; I think I would say I disagree with that. Notably, evolution is not a generally intelligent predictor, but is still capable of producing generally intelligent predictors. I expect the same to be true of processes like SGD.

Formal Solution to the Inner Alignment Problem

Here's the setup I'm imagining, but perhaps I'm still misunderstanding something. Suppose you have a bunch of deceptive models that choose to cooperate and have larger weight in the prior than the true model (I think you believe this is very unlikely, though I'm more ambivalent). Specifically, they cooperate in that they perfectly mimic the true model up until the point where the deceptive models make up enough of the posterior that the true model is no longer being consulted, at which point they all simultaneously defect. This allows for arbitrarily bad worst-case behavior.

Formal Solution to the Inner Alignment Problem

But for the purpose of analyzing it's output, I don't think this discussion is critical if we agree that we can expect a good heuristic search through models will identify any model that a human could hypothesize.

I think I would expect essentially all models that a human could hypothesize to be in the search space—but if you're doing a local search, then you only ever really see the easiest to find model with good behavior, not all models with good behavior, which means you're relying a lot more on your prior/inductive biases/whatever is determining how hard models are to find to do a lot more work for you. Cast into the Bayesian setting, a local search like this is relying on something like the MAP model not being deceptive—and escaping that to instead get models sampled independently from the top proportion or whatever seems very difficult to do via any local search algorithm.

I think you are understanding inner alignment very differently than we define it in Risks from Learned Optimization, where we introduced the term.

This is not true for deceptively aligned models, which is the situation I'm most concerned about, and—as we argue extensively in Risks from Learned Optimization—there are a lot of reasons why a model might end up pursuing a simpler/faster/easier-to-find proxy even if that proxy yields suboptimal training performance.