Evan Hubinger


Risks from Learned Optimization

Wiki Contributions


Towards Deconfusing Gradient Hacking

(Moderation note: added to the Alignment Forum from LessWrong.)

[AN #166]: Is it crazy to claim we're in the most important century?

Note though that it does not defuse all such uneasiness -- you can still look at how early we appear to be (given the billions of years of civilization that could remain in the future), and conclude that the simulation hypothesis is true, or that there is a Great Filter in our future that will drive us extinct with near-certainty. In such situations there would be no extraordinary impact to be had today by working on AI risk.

I don't think I agree with this—in particular, it seems like even given the simulation hypothesis, there could still be quite a lot of value to be had from influencing how that simulation goes. For example, if you think you're in an acausal trade simulation, succeeding in building aligned AI would have the effect of causing the simulation runner to trade with an aligned AI rather than a misaligned one, which could certainly have an “extraordinary impact.”

Meta learning to gradient hack

Interesting! I'd definitely be excited to know if you figure out what it's doing.

Meta learning to gradient hack

70 steps is not very many—does training converge if you train for longer (e.g. 700, 7000, 70000)?

Also, in addition to regularization making this strategy not very effective, I'd also suspect that hyperparameter tuning would break it as well—e.g. I'd be interested in what happens if you do black-box hyperparameter tuning on the base training process's hyperparameters after applying meta-learning (though then, to be fair to the meta-learning process, you'd also probably want to do the meta-learning in a setting with variable hyperparameters).

Meta learning to gradient hack

(Moderation node: added to the Alignment Forum from LessWrong.)

Selection Theorems: A Program For Understanding Agents

Thanks! I hope the post is helpful to you or anyone else trying to think about the type signatures of goals. It's definitely a topic I'm pretty interested in.

Selection Theorems: A Program For Understanding Agents

Have you seen Mark and my “Agents Over Cartesian World Models”? Though it doesn't have any Selection Theorems in it, and it just focuses on the type signatures of goals, it does go into a lot of detail about possible type signatures for agent's goals and what the implications of those type signatures would be, starting from the idea that a goal can be defined on any part of a Cartesian boundary.

AI safety via market making

Suppose the intended model is to predict H's estimate at convergence, and the actual model is predicting H's estimate at round N for some fixed N larger than any convergence time in the training set. Would you call this an "inner alignment failure", an "outer alignment failure", or something else (not an alignment failure)?

I would call that an inner alignment failure, since the model isn't optimizing for the actual loss function, but I agree that the distinction is murky. (I'm currently working on a new framework that I really wish I could reference here but isn't quite ready to be public yet.)

It seems that convergence on such questions might take orders of magnitude more time than what M was trained on. What do you think would actually happen if the humans asked their AI advisor to help with a decision like this? (What are some outcomes you think are plausible?)

That's a hard question to answer, and it really depends on how optimistic you are about generalization. If you just used current methods but scaled up, my guess is you would get deception and it would try to trick you. If we condition on it not being deceptive, I'd guess it was pursuing some weird proxies rather than actually trying to report the human equilibrium after any number of steps. If we condition on it actually trying to report the human equilibrium after some number of steps, though, my guess is that the simplest way to do that isn't to have some finite cutoff, so I'd guess it'd do something like an expectation over exponentially distributed steps or something.

What's your general thinking about this kind of AI risk (i.e., where an astronomical amount of potential value is lost because human-AI systems fail to make the right decisions in high-stakes situations that are eventuated by the advent of transformative AI)? Is this something you worry about as an alignment researcher, or do you (for example) think it's orthogonal to alignment and should be studied in another branch of AI safety / AI risk?

Definitely seems worth thinking about and taking seriously. Some thoughts:

  • Ideally, I'd like to just avoid making any decisions that lead to lock-in while we're still figuring things out (e.g. wait to build anything like a sovereign for a long time). Of course, that might not be possible/realistic/etc.
  • Hopefully, this problem will just be solved as AI systems become more capable—e.g. if you have a way of turning any unaligned benchmark system into a new system that honestly/helpfully reports everything that the unaligned benchmark knows, then as the unaligned benchmark gets better, you should get better at making decisions with the honest/helpful system.
Pathways: Google's AGI

(Moderation note: added to the Alignment Forum from LessWrong.)

AI safety via market making

What would happen if we instead use convergence as the stopping condition (and throw out any questions that take more than some fixed or random threshold to converge)? Can we hope that M would be able to extrapolate what we want it to do, and predict H's reflective equilibrium even for questions that take longer to converge than what it was trained on?

This is definitely the stopping condition that I'm imagining. What the model would actually do, though, if you, at deployment time, give it a question that takes the human longer to converge on than any question it ever saw in training isn't a question I can really answer, since it's a question that's dependent on a bunch of empirical facts about neural networks that we don't really know.

The closest we can probably get to answering these sorts of generalization questions now is just to liken the neural net prior to a simplicity prior, ask what the simplest model is that would fit the given training data, and then see if we can reason about what the simplest model's generalization behavior would be (e.g. the same sort of reasoning as in this post). Unfortunately, I think that sort of analysis generally suggests that most of these sorts of training setups would end up giving you a deceptive model, or at least not the intended model.

That being said, in practice, even if in theory you think you get the wrong thing, you might still be able to avoid that outcome if you do something like relaxed adversarial training to steer the training process in the desired direction via an overseer checking the model using transparency tools while you're training it.

Regardless, the point of this post, and AI safety via market making in general, though, isn't that I think I have a solution to these sorts of inner-alignment-style tricky generalization problems—rather, it's that I think AI safety via market making is a good/interesting outer-alignment-style target to push for, and that I think AI safety via market making also has some nice properties (e.g. compatibility with per-step myopia) that potentially make it easier to do inner alignment for (but still quite difficult, as with all other proposals that I know of).

Now, if we just want to evaluate AI safety via market making's outer alignment, we can just suppose that somehow we do get a model that just produces the answer that H would at convergence, and ask whether that answer is good. And even then I'm not sure—I think that there's still the potential for debate-style bad equilibria where some bad/incorrect arguments are just more convincing to the human than any good/correct argument, even after the human is exposed to all possible counterarguments. I do think that the market-making equilibrium is probably better than the debate equilibrium, since it isn't limited to just two sides, but I don't believe that very strongly.

Mostly, for me, the point of AI safety via market making is just that it's another way to get a similar sort of result as AI safety via debate, but that it allows you to do it via a mechanism that's more compatible with myopia.

Load More