Risks from Learned Optimization


Learning the prior and generalization

Yep; that's what I was imagining. It is also worth noting that it can be less safe to do that, though, since you're letting A(Z) see Y, which could bias it in some way that you don't want—I talk about that danger a bit in the context of approval-based amplification here and here.

Inner Alignment in Salt-Starved Rats

It also transfers in an obvious way to AGI programming, where it would correspond to something like an automated "interpretability" module that tries to make sense of the AGI's latent variables by correlating them with some other labeled properties of the AGI's inputs, and then rewarding the AGI for "thinking about the right things" (according to the interpretability module's output), which in turn helps turn those thoughts into the AGI's goals.

(Is this a good design idea that AGI programmers should adopt? I don't know, but I find it interesting, and at least worthy of further thought. I don't recall coming across this idea before in the context of inner alignment.)

Fwiw, I think this is basically a form of relaxed adversarial training, which is my favored solution for inner alignment.

AI safety via market making

Pretty sure debate can also access R if you make this strong of an assumption - ie assume that debaters give correct answers for all questions that can be answered with a debate tree of size <n.

First, my full exploration of what's going on with different alignment proposals and complexity classes can be found here, so I'd recommend just checking that out rather than relying on my the mini proof sketch I gave here.

Second, in terms of directly addressing what you're saying, I tried doing a proof by induction to get debate to RE and it doesn't work. The problem is that you can only get guarantees for trees that the human can judge, which means they have to be polynomial in length (though if you relax that assumption then you might be able to do better). Also, it's worth noting that the text that you're quoting isn't actually an assumption of the proof in any way—it's just the inductive hypothesis in a proof by induction.

I think the sort of claim that's actually useful is going to look more like 'we can guarantee that we'll get a reasonable training signal for problems in [some class]'

I think that is the same as what I'm proving, at least if you allow for “training signal” to mean “training signal in the limit of training on arbitrarily large amounts of data.” See my full post on complexity proofs for more detail on the setup I'm using.

Clarifying inner alignment terminology

Glad you liked it! I definitely mean mesa-optimizer to refer to something mechanistically implementing search. That being said, I'm not really sure whether humans count or not on that definition—I would probably say humans do count but are fairly non-central. In terms of the bag of heuristics model, I probably wouldn't count that, though it depends on what “bag of heuristics” means exactly—if the heuristics are being used to guide a planning process or something, then I would call that a mesa-optimizer.

Learning Normativity: A Research Agenda

I like this post a lot. I pretty strongly agree that process-level feedback (what I would probably call mechanistic incentives) is necessary for inner alignment—and I'm quite excited about understanding what sorts of learning mechanisms we should be looking for when we give process-level feedback (and recursive quantilization seems like an interesting option in that space).

Since detecting malign hypotheses is difficult, we want the learning system to help us out here. It should generalize from examples of malign hypotheses, and attempt to draw a broad boundary around malignancy. Allowing the system to judge itself in this way can of course lead to malign reinterpretations of user feedback, but hopefully allows for a basin of attraction in which benevolent generalizations can be learned.

Notably, one way to get this is to have the process feedback given by an overseer implemented as a human with access to a prior version of the model being overseen (and then train the model both on the oversight signal directly and to match the amplified human's behavior doing oversight), as in relaxed adversarial training.

Clarifying inner alignment terminology

I agree that what you're describing is a valid way of looking at what's going on—it's just not the way I think about it, since I find that it's not very helpful to think of a model as a subagent of gradient descent, as gradient descent really isn't itself an agent in a meaningful sense, nor do I think it can really be understood as “trying” to do anything in particular.

Clarifying inner alignment terminology

I assume you instead mean all data points that it could ever encounter? Otherwise memorisation is a sufficient strategy, since it will only ever have encountered a finite number of data points.

No—all data points that it could ever encounter is stronger than I need and harder to define, since it relies on a counterfactual. All I need is for the model to always output the optimal loss answer for every input that it's ever actually given at any point.

When you say "the optimal policy on the actual MDP that it experiences", is this just during training, or also during deployment? And if the latter, given that the world is non-stationary, in what sense are you referring to the "actual MDP"? (This is a hard question, and I'd be happy if you handwave it as long as you do so explicitly. Although I do think that the fact that the world is not a MDP is an important and overlooked fact).

Deployment, but I agree that this one gets tricky. I don't think that the fact that the world is non-stationary is a problem for conceptualizing it as an MDP, since whatever transitions occur can just be thought of as part of a more abstract state. That being said, modeling the world as an MDP does still have problems—for example, the original reward function might not really be well-defined over the whole world. In those sorts of situations, I do think it gets to the point where outer alignment starts breaking down as a concept.

Clarifying inner alignment terminology

Thanks! And good point—I added a clarifying footnote.

AGI safety from first principles: Conclusion

I just wanted to say that I think this sequence is by far my new favorite resource for laying out the full argument for AI risk and I expect to be linking new people to it quite a lot in the future. Reading it, it really felt to me like the full explanation of AI risk that I would have written if I'd spent a huge amount of time writing it all up carefully—which I'm now very glad that I don't have to do!

"Inner Alignment Failures" Which Are Actually Outer Alignment Failures

Yes—I agree with both (a) and (b). I just don't think that outer and inner alignment cover the full space of alignment problems. See this post I just published for more detail.

Load More