ML Alignment Theory Scholars Program Winter 2021
Risks from Learned Optimization

Wiki Contributions


Reward is not the optimization target

That isn't the main point I had in mind. See my comment to Chris here.

Left a comment.

Yup, the training story regime sounds good by my lights. Am I intended to conclude something further from this remark of yours, though?

Nope, just wanted to draw your attention to another instance of alignment researchers already understanding this point.

Also, I want to be clear that I like this post a lot and I'm glad you wrote it—I think it's good to explain this sort of thing more, especially in different ways that are likely to click for different people. I just think your specific claim that most alignment researchers don't understand this already is false.

Reward is not the optimization target

Reward functions often are structured as objectives, which is why we talk about them that way. In most situations, if you had access to e.g. AIXI, you could directly build a “reward maximizer.”

I agree that this is not always the case, though, as in the discussion here. That being said, I think it is often enough the case that it made sense to focus on that particular case in RFLO.

evhub's Shortform
  • A deceptive model doesn't have to have some sort of very explicit check for whether it's in training or deployment any more than a factory-cleaning robot has to have a very explicit check for whether it's in the jungle instead of a factory. If it someday found itself in a very different situation than currently (training), it would reconsider its actions, but it doesn't really think about it very often because during training it just looks too unlikely.
Reward is not the optimization target

I do in fact think that few people actually already deeply internalized the points I'm making in this post, even including a few people who say they have or that this post is obvious. Therefore, I concluded that lots of alignment thinking is suspect until re-analyzed.

Risks from Learned Optimization in Advanced Machine Learning Systems,” which we published three years ago and started writing four years ago, is extremely explicit that we don't know how to get an agent that is actually optimizing for a specified reward function. The alignment research community has been heavily engaging with this idea since then. Though I agree that many alignment researchers used to be making this mistake, I think it's extremely clear that by this point most serious alignment researchers understand the distinction.

I have relatively little idea how to "improve" a reward function so that it improves the inner cognition chiseled into the policy, because I don't know the mapping from outer reward schedules to inner cognition within the agent. Does an "amplified" reward signal produce better cognition in the inner agent? Possibly? Even if that were true, how would I know it?

This is precisely the point I make in “How do we become confident in the safety of a machine learning system is making,” btw.

Principles of Privacy for Alignment Research

My usual take here is the “Nobody Cares” model, though I think there is one scenario that I tend to be worried about a bit here that you didn't address, which is how to think about whether or not you want things ending up in the training data for a future AI system. That's a scenario where the “Nobody Cares” model really doesn't apply, since the AI actually does have time to look at everything you write.

That being said, I think that, most of the time, alignment work ending up in training data is good, since it can help our AI systems be differentially better at AI alignment research (e.g. relative to how good they are at AI capabilities research), which is something that I think is pretty important. However, it can also help AI systems do things like better understand how to be deceptive, so this sort of thing can be a bit tricky.

Circumventing interpretability: How to defeat mind-readers

Using interpretability tools in the loss function incentivises a startling number of passive circumvention methods.

So why consider using it at all? There are two cases where it might be worthwhile to use:

  • If we’re absolutely 100% provably definitely sure that our interpretability methods cover every representational detail that the AI might be using, then it would be extremely hard or impossible to Goodhart interpretability tools in the loss function. But, for safety, we should probably assume that our tools are not able to interpret everything, making passive Goodharting quite likely.

To be confident including interpretability tools in your loss function, you don't have to believe that there is no way those tools could possibly be Goodharted, you just have to believe that there is no way for gradient descent to find a way of Goodharting them. And the latter condition is far more achievable, since gradient descent is a process we can try to understand and interpret the same way we can try to understand and interpret a trained model. In particular, as I discuss here, if you could understand the process of gradient descent well enough to know why it's proposing particular modifications to your model, you could use that understanding to be able to determine whether it was responding to your interpretability loss in the right way or not.

RL with KL penalties is better seen as Bayesian inference

I think I broadly disagree with this—using RLHF in a way that actually corresponds to some conditional of the original policy seems very important to me.

That said, it seems to me that in practice you often should be trying to converge to a zero entropy policy. The optimal action is not random; we want the model to be picking the output that looks best in light of its beliefs.

I think you probably expect that we'll eventually be able to give good feedback even for quite complex tasks like “actually say the truth”—but if you don't expect that to ever succeed (as I think I don't), then staying close to existing data seems like a pretty important tool to prevent yourself from Goodharting on bad feedback.

It seems like ideal way forward is to more accurately capture what you actually care about, then optimize that

What if what you actually care about is predicting the world well? In that case, staying close to existing data actually seems like a very principled thing to do. In particular, the KL penalty here lets you interpret the feedback as just extracting a particular conditioned distribution, which is precisely the thing I think you want to do with such a predictor.

On how various plans miss the hard bits of the alignment challenge

But maybe I just don't understand this proposal yet (and I have had some trouble distilling things I recognize as plans out of Evan's writing, so far).

Maybe this and this will help.

evhub's Shortform
  • Ensembling as an AI safety solution is a bad way to spend down our alignment tax—training another model brings you to 2x compute budget, but even in the best case scenario where the other model is a totally independent draw (which in fact it won't be), you get at most one extra bit of optimization towards alignment.
  • Chain of thought prompting can be thought of as creating an average speed bias that might disincentivize deception.
evhub's Shortform

It seems that you're postulating that the human brain's credit assignment algorithm is so bad that it can't tell what high-level goals generated a particular action and so would give credit just to thoughts directly related to the current action. That seems plausible for humans, but my guess would be against for advanced AI systems.

Load More