Evan Hubinger


Risks from Learned Optimization


Comparing Utilities

But a decision theory like that does mix levels between the decision theory and the utility function!

I agree, though it's unclear whether that's an actual level crossing or just a failure of our ability to be able to properly analyze that strategy. I would lean towards the latter, though I am uncertain.

A crux for me is the coalition metaphor for utilitarianism. I think of utilitarianism as sort of a natural endpoint of forming beneficial coalitions, where you've built a coalition of all life.

This is how I think about preference utilitarianism but not how I think about hedonic utilitarianism—for example, a lot of what I value personally is hedonic-utilitarianism-like, but from a social perspective, I think preference utilitarianism is a good Schelling point for something we can jointly agree on. However, I don't call myself a preference utilitarian—rather, I call myself a hedonic utilitarian—because I think of social Schelling points and my own personal values as pretty distinct objects. And I could certainly imagine someone who terminally valued preference utilitarianism from a personal perspective—which is what I would call actually being a preference utilitarian.

Furthermore, I think that if you're actually a preference utilitarian vs. if you just think preference utilitarianism is a good Schelling point, then there are lots of cases where you'll do different things. For example, if you're just thinking about preference utilitarianism as a useful Schelling point, then you want to carefully consider the incentives that it creates—such as the one that you're pointing to—but if you terminally value preference utilitarianism, then that seems like a weird thing to be thinking about, since the question you should be thinking about in that context should be more like what is it about preferences that you actually value and why.

If we imagine forming a coalition incrementally, and imagine that the coalition simply averages utility functions with its new members, then there's an incentive to join the coalition as late as you can, so that your preferences get the largest possible representation. (I know this isn't the same problem we're talking about, but I see it as analogous, and so a point in favor of worrying about this sort of thing.)

We can correct that by doing 1/n averaging: every time the coalition gains members, we make a fresh average of all member utility functions (using some utility-function normalization, of course), and everybody voluntarily self-modifies to have the new mixed utility function.

One thing I will say here is that usually when I think about socially agreeing on a preference utilitarian coalition, I think about doing so from more of a CEV standpoint, where the idea isn't just to integrate the preferences of agents as they currently are, but as they will/should be from a CEV perspective. In that context, it doesn't really make sense to think about incremental coalition forming, because your CEV (mostly, with some exceptions) should be the same regardless of what point in time you join the coalition.

But the problem with this is, we end up punishing agents for self-modifying to care about us before joining. (This is more closely analogous to the problem we're discussing.) If they've already self-modified to care about us more before joining, then their original values just get washed out even more when we re-average everyone.

I guess this just seems like the correct outcome to me. If you care about the values of the coalition, then the coalition should care less about your preferences, because they can partially satisfy them just by doing what the other people in the coalition want.

So really, the implicit assumption I'm making is that there's an agent "before" altruism, who "chose" to add in everyone's utility functions. I'm trying to set up the rules to be fair to that agent, in an effort to reward agents for making "the altruistic leap".

It certainly makes sense to reward agents for choosing to instrumentally value the coalition—and I would include instrumentally choosing to self-modify yourself to care more about the coalition in that—but I'm not sure why it makes sense to reward agents for terminally valuing the coalition—that is, terminally valuing the coalition independently of any decision theoretic considerations that might cause you to instrumentally modify yourself to do so.

Again, I think this makes more sense from a CEV perspective—if you instrumentally modify yourself to care about the coalition for decision-theoretic reasons, that might change your values, but I don't think that it should change your CEV. In my view, your CEV should be about your general strategy for how to self-modify yourself in different situations rather than the particular incarnation of you that you've currently modified to.

Comparing Utilities

If we simply take the fixed point, Primus is going to get the short end of the stick all the time: because Primus cares about everyone else more, everyone else cares about Primus' personal preferences less than anyone else's.

Simply put, I don't think more altruistic individuals should be punished! In this setup, the "utility monster" is the perfectly selfish individual. Altruists will be scrambling to help this person while the selfish person does nothing in return.

I'm not sure why you think this is a problem. Supposing you want to satisfy the group's preferences as much as possible, shouldn't you care about Primus less since Primus will be more satisfied just from you helping the others? I agree that this can create perverse incentives in practice, but that seems like the sort of thing that you should be handling as part of your decision theory, not your utility function.

A different way to do things is to interpret cofrences as integrating only the personal preferences of the other person.

I feel like the solution of having cofrences not count the other person's cofrences just doesn't respect people's preferences—when I care about the preferences of somebody else, that includes caring about the preferences of the people they care about. It seems like the natural solution to this problem is to just cut things off when you go in a loop—but that's exactly what taking the fixed point does, which seems to reinforce the fixed point as the right answer here.

Mesa-Search vs Mesa-Control

could you give an example of a task that would require learning in this way? (Note the within-timestep restriction; without that I grant you that there are tasks that require learning)

How about language modeling? I think that the task of predicting what a human will say next given some prompt requires learning in a pretty meaningful way, as it requires the model to be able to learn from the prompt what the human is trying to do and then do that.

My computational framework for the brain

Some things which don't fully make sense to me:

  • If the cortical algorithm is the same across all mammals, why do only humans develop complex language? Do you think that the human neocortex is specialized for language in some way, or do you think that other mammal's neocortices would be up to the task if sufficiently scaled up? What about our subcortex—do we get special language-based rewards? How would the subcortex implement those?
  • Furthermore, there are lots of commonalities across human languages—e.g. word order patterns and grammar similarities, see e.g. linguistic universals—how does that make sense if language is neocortical and the neocortex is a blank slate? Do linguistic commonalities come from the subcortex, from our shared environment, or from some way in which our neocortex is predisposed to learn language?
  • Also, on a completely different note, in asking “how does the subcortex steer the neocortex?” you seem to presuppose that the subcortex actually succeeds in steering the neocortex—how confident in that should we be? It seems like there are lots of things that people do that go against a naive interpretation of the subcortical reward algorithm—abstaining from sex, for example, or pursuing complex moral theories like utilitarianism. If the way that the subcortex steers the neocortex is terrible and just breaks down off-distribution, then that sort of cuts into your argument that we should be focusing on understanding how the subcortex steers the neocortex, since if it's not doing a very good job then there's little reason for us to try and copy it.
Mesa-Search vs Mesa-Control

And can we taboo the word 'learning' for this discussion, or keep it to the standard ML meaning of 'update model weights through optimisation'? Of course, some domains require responsive policies that act differently depending on what they observe, which is what Rohin observes elsewhere in these comments. In complex tasks on the way to AGI, I can see the kind of responsiveness required become very sophisticated indeed, possessing interesting cognitive structure. But it doesn't have to be the same kind of responsiveness as the learning process of an RL agent; and it doesn't necessarily look like learning in the everyday sense of the word. Since the space of things that could be meant here is so big, it would be good to talk more concretely.

I agree with all of that—I was using the term “learning” to be purposefully vague precisely because the space is so large and the point that I'm making is very general and doesn't really depend on exactly what notion of responsiveness/learning you're considering.

Now, I understand that you argue that if a policy was to learn an internal search procedure, or an internal learning procedure, then it could predict the rewards it would get for different actions. It would then pick the action that scores best according to its prediction, thereby 'updating' based on returns it hasn't yet received, and actions it hasn't yet made. I agree that it's possible this is helpful, and it would be interesting to study existing meta-learners from this perspective (though my guess is that they don't do anything so sophisticated). It isn't clear to me a priori that from the point of view of the policy this is the best strategy to take.

This does in fact seem like an interesting angle from which to analyze this, though it's definitely not what I was saying—and I agree that current meta-learners probably aren't doing this.

I'm not sure what you mean when you say 'taking actions requires learning'. Do you mean something other than the basic requirement that a policy depends on observations?

What I mean here is in fact very basic—let me try to clarify. Let be the optimal policy. Furthermore, suppose that any polynomial-time (or some other similar constraint) algorithm that well-approximates has to perform some operation . Then, my point is just that, for to achieve performance comparable with , it has to do . And my argument for that is just simply because we know that you have to do to get good performance, which means either has to do or the gradient descent algorithm has to—but we know the gradient descent algorithm can't be doing something crazy like running at each step and putting the result into the model because the gradient descent algorithm only updates the model on the given state after the model has already produced its action.

Mesa-Search vs Mesa-Control

This is not true of RL algorithms in general -- If I want, I can make weight updates after every observation.

You can't update the model based on its action until its taken that action and gotten a reward for it. It's obviously possible to throw in updates based on past data whenever you want, but that's beside the point—the point is that the RL algorithm only gets new information with which to update the model after the model has taken its action, which means if taking actions requires learning, then the model itself has to do that learning.

Do mesa-optimizer risk arguments rely on the train-test paradigm?

I don't think that doing online learning changes the analysis much at all.

As a simple transformation, any online learning setup at time step is equivalent to training on steps 1 to and then deploying on step . Thus, online learning for steps won't reduce the probability of pseudo-alignment any more than training for steps will because there isn't any real difference between online learning for steps and training for steps—the only difference is that we generally think of the training environment as being sandboxed and the online learning environment as not being sandboxed, but that just makes online learning more dangerous than training.

You might argue that the fact that you're doing online learning will make a difference after time step because if the model does something catastrophic at time step then online learning can modify it to not do that in the future—but that's always true: what it means for an outcome to be catastrophic is that it's unrecoverable. There are always things that we can try to do after our model starts behaving badly to rein it in—where we have a problem is when it does something so bad that those methods won't work. Fundamentally, the problem of inner alignment is a problem of worst-case guarantees—and doing some modification to the model after it takes its action doesn't help if that action already has the potential to be arbitrarily bad.

interpreting GPT: the logit lens

That's a great idea!

Thanks! I'd be quite excited to know what you find if you end up trying it.

Hmm... I guess there is some reason to think the basis elements have special meaning (as opposed to the elements of any other basis for the same space), since the layer norm step operates in this basis.

But I doubt there are actually individual components the embedding cares little about, as that seems wasteful (you want to compress 50K into 1600 as well as you possibly can), and if the embedding cares about them even a little bit then the model needs to slot in the appropriate predictive information, eventually.

Thinking out loud, I imagine there might be pattern where embeddings of unlikely tokens (given the context) are repurposed in the middle for computation (you know they're near-impossible so you don't need to track them closely), and then smoothly subtracted out at the end. There's probably a way to check if that's happening.

I wasn't thinking you would do this with the natural component basis—though it's probably worth trying that also—but rather doing some sort of matrix decomposition on the embedding matrix to get a basis ordered by importance (e.g. using PCA or NMF—PCA is simpler though I know NMF is what OpenAI Clarity usually uses when they're trying to extract interpretable basis elements from neural network activations) and then seeing what the linear model looks like in that basis. You could even just do something like what you're saying and find some sort of basis ordered by the frequency of the tokens that each basis element corresponds to (though I'm not sure exactly what the right way would be to generate such a basis).

interpreting GPT: the logit lens

This is very neat. I definitely agree that I find the discontinuity from the first transformer block surprising. One thing which occurred to me that might be interesting to do is to try and train a linear model to reconstitute the input from the activations at different layers to get an idea of how the model is encoding the input. You could either train one linear model on data randomly sampled from different layers, or a separate linear model for each layer, and then see if there are any interesting patterns like whether the accuracy increases or decreases as you get further into the model. You could also see if the resulting matrix has any relationship to the embedding matrix (e.g. are the two matrices farther apart or closer together than would be expected by chance?). One possible hypothesis that this might let you test is whether the information about the input is being stored indirectly via what the model's guess is given that input or whether it's just being stored in parts of the embedding space that aren't very relevant to the output (if it's the latter, the linear model should put a lot of weight on basis elements that have very little weight in the embedding matrix).

Matt Botvinick on the spontaneous emergence of learning algorithms

Unfortunately, I also only have so much time, and I don't generally think that repeating myself regularly in AF/LW comments is a super great use of it.

Load More