I hear there’s a thing where people write a lot in November, so I’m going to try writing a blog post every day. Disclaimer: this post is less polished than my median. And my median post isn’t very polished to begin with.

Imagine a large corporation - we’ll call it BigCo. BigCo knows that quality management is high-value, so they have a special program to choose new managers. They run the candidates through a program involving lots of management exercises, simulations, and tests, and select those who perform best.

Of course, the exercises and simulations and tests are not a perfect proxy for the would-be managers’ real skills and habits. The rules can be gamed. Within a few years of starting the program, BigCo notices a drastic disconnect between performance in the program and performance in practice. The candidates who perform best in the program are those who game the rules, not those who manage well, so of course many candidates devote all their effort to gaming the rules.

How should this problem be solved?

Ancient Chinese scholars had a few competing schools of thought on this question, most notably the Confucianists and the Legalists. The (stylized) Confucianists’ answer was: the candidates should be virtuous and not abuse the rules. BigCo should demonstrate virtue and benevolence in general, and in return their workers should show loyalty and obedience. I’m not an expert, but as far as I can tell this is not a straw man - though stylized and adapted to a modern context, it accurately captures the spirit of Confucian thought.

The (stylized) Legalists instead took the position obvious to any student of modern economics: this is an incentive design problem, and BigCo leadership should design less abusable incentives.

If you have decent intuition for economics, it probably seems like the Legalist position is basically right and the Confucian position is Just Wrong. I don't want to discourage this intuition, but I expect that many people who have this intuition cannot fully spell out why the Confucian answer is Just Wrong, other than “it has no hope of working in practice”. After all, the whole thing is worded as a moral assertion - what people should do, how the problem should be solved. Surely the Confucian ideal of everyone working together in harmony is not wrong as an ideal? It may not be possible in practice, but that doesn’t mean we shouldn’t try to bring the world closer to the Confucian vision.

Now, there is room to argue with Confucianism on a purely moral front - everyone working together in harmony is not synonymous with everyone receiving what they deserve. Harmony does not imply justice. Also, there’s the issue of the system being vulnerable to small numbers of bad agents. These are fun arguments to have if you’re the sort of person who enjoys endless political/philosophical debates, but I bring it up to emphasize that they are NOT the arguments I’m going to talk about here.

The relevant argument here is not a moral claim, but a purely factual claim: the Confucian ideal would not actually solve the problem, even if it were fully implemented (i.e. zero bad actors). Even if BigCo senior management were virtuous and benevolent, and their workers were loyal and did not game the rules, the poor rules would still cause problems.

The key here is that the rules play more than one role. They act as:

  • Conscious incentives
  • Unconscious incentives
  • Selection rules

In the Confucian ideal, the workers all ignore the bad incentives provided by the rules, so conscious incentives are no longer an issue (as long as we’re pretending that the Confucian ideal is plausible in the first place). Unconscious incentives are harder to fight - when people are rewarded for X, they tend to do more X, regardless of whether they consciously intended to do so. But let’s assume a particularly strong form of Confucianism, where everyone fights hard against their unconscious biases.

That still leaves selection effects.

Even if everyone is ignoring the bad incentives, people are still different. Some people will naturally act in ways which play more to the loopholes and weaknesses in the rules, even if they don’t intend to do so. (And of course, if there’s even just a few bad actors, then they’ll definitely still abuse the rules.) And BigCo will disproportionately select those people as their new managers. It’s not necessarily maliciousness, it’s just Goodhart’s Law: make decisions based on a less-than-perfect proxy, and it will cease to be a good proxy.

Takeaway: even a particularly strong version of the Confucian ideal would not be sufficient to solve BigCo’s problem. Conversely, the Legalist answer - i.e. fixing the incentive structure - would be sufficient. Indeed, fixing the incentive structure seems not only sufficient but necessary; selection effects will perpetuate problems even if everyone is harmoniously working for the good of the collective.

Analogy to AI Alignment

The modern Ml paradigm: we have a system that we train offline. During that training, we select parameters which perform well in simulations/tests/etc. Alas, some parts of the parameter space may abuse loopholes in the parameter-selection rules. In extreme cases, we might even see malicious inner optimizers: subagents smart enough to intentionally abuse loopholes in the parameter-selection rules.

How should we solve this problem?

One intuitive approach: find some way to either remove or align the inner optimizers. I’ll call this the “generalized Confucianist” approach. It’s essentially the Confucianist answer from earlier, with most of the moralizing stripped out. Most importantly, it makes the same mistake: it ignores selection effects.

Even if we set up a training process so that it does not create any inner optimizers, we’ll still be selecting for the same bad behaviors which a malicious inner optimizer would utilize.

The basic problem is that “optimization” is an internal property, not a behavioral property. A malicious optimizer might do some learning and reasoning to figure out that behavior X exploits a weakness in the parameter selection goal/algorithm. But some other parameters could just happen to perform behavior X “by accident”, without any malicious intent at all. The parameter selection goal/algorithm will be just as weak to this “accidental” abuse as to the “intentional” abuse of an inner optimizer.

The equivalent of the Legalists’ solution to the problem would be to fix the parameter-selection rule: design a training goal and process which aren’t abusable, or at least aren’t abusable by anything in the parameter space. In alignment jargon: solve the outer alignment problem, and build a secure outer optimizer.

As with the Confucian solution to the BigCo problem, the Confucian solution is not sufficient for AI alignment. Even if we avoid creating misaligned inner optimizers, bad parameter-selection rules would still select for the same behavior that the inner optimizers would display. The only difference is that we’d select for rules which behave badly “by accident”.

Conversely, the Legalist solution would be sufficient to solve the problem, and seems necessary if we want to keep the general framework of optimization.

The main takeaway I want to emphasize here is that making our outer objective “secure” against abuse is part of the outer alignment problem. This means outer alignment is a lot harder than I think a lot of people imagine. If our proxy for human values has loopholes which a hypothetical inner optimizer could exploit, then it’s a bad proxy. If an inner optimizer could exploit some distribution shift between the training and deployment environments, then performance-in-training is a bad proxy for performance-in-deployment. In general, outer alignment contains an implicit “for all” quantifier: for all possible parameter values, our training objective should give a high value only if those parameters would actually perform well in practice.

The flip side is that, since we probably need to build the Legalist solution anyway, the Confucian solution isn’t really necessary. We don’t necessarily need to make any special effort to avoid inner optimizers, because our selection criteria need to be secure against whatever shenanigans the inner optimizers could attempt anyway.

That said, I do think there are some good reasons to work on inner optimizers. The biggest is imperfect optimization. In this context: our outer optimizer is not going to check every single point in the parameter space, so the basin of attraction of any misaligned behavior matters. If we expect that malicious inner optimizers will take up a larger chunk of the parameter space than “accidental” bad behavior, then it makes sense to worry more about “intentional” than “accidental” maligness. At this point, we don’t really know how to tell how much of a parameter space is taken up by malicious agents, or any sort of inner agents; one example of this kind of problem is Paul’s question about whether minimal circuits are daemon-free.

Taking the analogy back to the BigCo problem: if it’s very rare for workers to accidentally game the rules, and most rule-gaming is intentional, then the Confucian solution makes a lot more sense.

I also expect some people will argue that malicious inner optimizers would be more dangerous than accidental bad behavior. I don’t think this argument quite works - in sufficiently-rish parameter spaces, I’d expect that there are non-agenty parameter combinations which exhibit the same behavior as any agency combinations. Optimization is an internal property, not a behavioral property. But a slight modification of this argument seems plausible: more dangerous behaviors take up a larger fraction of the agenty chunks of parameter space than the non-agenty chunks. It’s not that misaligned inner optimizers are each individually more dangerous than their behaviorally-identical counterparts, it’s that misaligned optimizers are more dangerous on average. This would be a natural consequence to expect from instrumental convergence, for instance: a large chunk of agenty parameter space all converges to the same misbehavior. Again, this threat depends on imperfect optimization - if the optimizer is perfect, then "basin of attraction" doesn't matter.

Again taking the analogy back to the BigCo problem: if most accidental-rule-abusers only abuse the rules a little, but intentional-rule-abusers usually do it a lot, then the Confucian solution can help a lot.

Of course, even in the cases where the Confucian solution makes relatively more sense, it’s still just an imperfect patch; it still won’t fix “accidental” abuse of the rules. The Legalist approach is the full solution. The selection rules are the real problem here, and fixing the selection rules is the best possible solution.

New Comment
20 comments, sorted by Click to highlight new comments since: Today at 9:15 PM

You are proposing "make the right rules" as the solution. Surely this is like solving the problem of how to write correct software by saying "make correct software"? The same approach could be applied to the Confucian approach by saying "make the values right". The same argument made against the Confucian approach can be made against the Legalist approach: the rules are never the real thing that is wanted, people will vary in how assiduously they are willing to follow one or the other, or to hack the rules entirely for their own benefit, then selection effects lever open wider and wider the difference between the rules, what was wanted, and what actually happens.

It doesn't work for HGIs (Human General Intelligences). Why will it work for AGIs?

BTW, I'm not a scholar of Chinese history, but historically it seems to me that Confucianism flourished as state religion because it preached submission to the Legalist state. Daoism found favour by preaching resignation to one's lot. Do what you're told and keep your head down.

You are proposing "make the right rules" as the solution. Surely this is like solving the problem of how to write correct software by saying "make correct software"?

I strongly endorse this objection, and it's the main way in which I think the OP is unpolished. I do think there's obviously still a substantive argument here, but I didn't take the time to carefully separate it out. The substantive part is roughly "if the system accepts an inner optimizer with bad behavior, then it's going to accept non-optimizers with the same bad behavior. Therefore, we shouldn't think of the problem as being about the inner optimizers. Rather, the problem is that we accept bad behavior - i.e. bad behavior is able to score highly.".

It doesn't work for HGIs (Human General Intelligences). Why will it work for AGIs?

This opens up a whole different complicated question.

First, it's not clear that this analogy holds water at all. There are many kinds-of-things we can do to design AGI environments/incentives which don't have any even vaguely similar analogues in human mechanism design - we can choose the entire "ancestral environment", we can spin up copies at-will, we can simulate in hindsight (so there's never a situation where we won't know after-the-fact what the AI did), etc.

Second, in the cases where humans use bad incentive mechanisms, it's usually not because we can't design better mechanisms but because the people who choose the mechanism don't want a "better" one; voting mechanisms and the US government budget process are good examples.

All that said, I do still apply this analogy sometimes, and I think there's an extent to which it's right - namely, trying to align black-box AIs with opaque goals via clever mechanism design, without building a full theory of alignment and human values, will probably fail.

But I think a full theory of alignment and human values is likely tractable, which would obviously change the game entirely. It would still be true that "the rules are never the real thing that is wanted", but a full theory would at least let the rules improve in lock-step with capabilities - i.e. more predictive world models would directly lead to better estimates of human values. And I think the analogy would still hold: a full theory of alignment and human values should directly suggest new mechanism design techniques for human institutions.

That's the Legalist interpretation of Confucianism. Confucianism argues that the Legalists are just moving the problem one level up the stack a la public choice theory. The point of the Confucian is that the stack has to ground out somewhere, and asks the question of how to roll our virtue intuitions into the problem space explicitly since otherwise we are rolling them in tacitly and doing some hand waving.

Thanks, I was hoping someone more knowledgeable than I would leave a comment along these lines.

Planned summary for the Alignment Newsletter (note it's written quite differently from the post, and so I may have introduced errors, so please check more carefully than usual):

Suppose we trained our agent to behave well on some set of training tasks. <@Mesa optimization@>(@Risks from Learned Optimization in Advanced Machine Learning Systems@) suggests that we may still have a problem: the agent might perform poorly during deployment, because it ends up optimizing for some misaligned _mesa objective_ that only agrees with the base objective on the training distribution.

This post points out that this is not the only way systems can fail catastrophically during deployment: if the incentives were not designed appropriately, they may still select for agents that have learned heuristics that are not in our best interests, but nonetheless lead to acceptable performance during training. This can be true even if the agents are not explicitly “trying” to take advantage of the bad incentives, and thus can apply to agents that are not mesa optimizers.

I feel like the second paragraph doesn't quite capture the main idea, especially the first sentence. It's not just that mesa optimizers aren't the only way that a system with good training performance can fail in deployment - that much is trivial. It's that if the incentives reward misaligned mesa-optimizers, then they very likely also reward inner agents with essentially the same behavior as the misaligned mesa-optimizers but which are not "trying" to game the bad incentives.

The interesting takeaway is that the possibility of deceptive inner optimizers implies nearly-identical failures which don't involve any inner optimizers. It's not just "systems without inner optimizers can still be misaligned", it's "if we just get rid of the misaligned inner optimizers in a system which would otherwise have them, then that system can probably still stumble on parameters which result in essentially the same bad behavior". Thus the idea that the "real problem" is the incentives, not the inner optimizers.

Changed second paragraph to:

This post suggests that in any training setup in which mesa optimizers would normally be incentivized, it is not sufficient to just prevent mesa optimization from happening. The fact that mesa optimizers could have arisen means that the incentives were bad. If you somehow removed mesa optimizers from the search space, there would still be a selection pressure for agents that without any malicious intent end up using heuristics that exploit the bad incentives. As a result, we should focus on fixing the incentives, rather than on excluding mesa optimizers from the search space.

How does that sound?

Ok, that works.

Despite the fact that I commented on your previous post suggesting a different decomposition into "outer" and "inner" alignment, I strongly agree with the content of this post. I would just use different words to say it.

I think one issue which this post sort of dances around, and which maybe a lot of discussion of inner optimizers leaves implicit or unaddressed, is the difference between having a loss function which you can directly evaluate vs one which you must estimate via some sort of sample.

The argument in this post about how inner optimizers misbehaving is necessarily behavioral, and therefore best addressed by behavioral loss functions, misses the point that these misbehaviors are on examples we don't check. As such, it comes off as:

  • Perhaps arguing that we should check every example, or check much more thoroughly.
  • Perhaps arguing that the examples should be made more representative.

Now, I personally think that "distributional shift" is a misleading framing, because in learning in general (EG Solomonoff induction) we don't have an IID distribution (unlike in EG classification tasks), so we don't have a "distribution" to "shift".

But to the extent that we can talk in this framing, I'm kinda like... what are you saying here? Are you really proposing that we should just check instances more thoroughly or something like that?

Even if BigCo senior management were virtuous and benevolent, and their workers were loyal and did not game the rules, the poor rules would still cause problems.

If BigCo senior management were virtuous and benevolent, would they have poor rules?

That is to say, when I put my Confucian hat on, the whole system of selecting managers based on a proxy measure that's gameable feels too Legalist. [The actual answer to my question is "getting rid of poor rules would be a low priority, because the poor rules wouldn't impede righteous conduct, but they still would try to get rid of them."]

Like, if I had to point at the difference between the two, the difference is where the put the locus of value. The Confucian ruler is primarily focused on making the state good, and surrounding himself with people who are primarily focused on making the state good. The Legalist ruler is primarily focused on surviving and thriving, and so tries to set up systems that cause people who are primarily focused on surviving and thriving to do the right thing. The Confucian imagines that you can have a large shared value; the Legalist imagines that you will necessarily have many disconnected and contradictory values.

The difference between hiring for regular companies and EA orgs seems relevant. Often, applicants for regular companies want the job, and standard practice is to attempt to trick the company into hiring them, regardless of qualification. Often, applicants for EA orgs only want the job if and only if they're the right person for the job; if I'm trying to prevent asteroids from hitting the Earth (or w/e) and someone else could do a better job of it than I could, I very much want to get out of their way and have them do it instead of me. As you mention in the post, this just means you get rid of the part of interviews where gaming is intentional, and significant difficulty remains. [Like, people will be honest about their weaknesses and try to be honest about their strengths, but accurately measuring those and fit with the existing team remains quite difficult.]

Now, where they're trying to put the locus of value doesn't mean their policy prescriptions are helpful. As I understand the Confucian focus on virtue in the leader, the main value is that it's really hard to have subordinates who are motivated by the common good if you yourself are selfish (both because they won't have your example and because the people who are motivated by the common good will find it difficult to be motivated by working for you).

But I find myself feeling some despair at the prospect of a purely Legalist approach to AI Alignment, because it feels like it is fighting against the AI at every step, instead of being able to recruit it to do some of the work for you, and without that last bit I'm not sure how you get extrapolation instead of interpolation. Like, you can trust the Confucian to do the right thing in novel territory, insofar as you gave them the right underlying principles, and the Confucian is operating at a philosophical level where you can give them concepts like corrigibility (where they not only want to accept correction from you, but also want to preserve their ability to accept correction for you, and preserve their preservation of that ability, and so on) and the map-territory distinction (where they want their sensors to be honest, because in order to have lots of strawberries they need their strawberry-counter to be accurate instead of inaccurate). In Legalism, the hope is that the overseer can stay a step ahead of their subordinate; in Confucianism, the hope is that everyone can be their own overseer.

[Of course, defense in depth is useful; it's good to both have trust in the philosophical competence of the system and have lots of unit tests and restrictions in case you or it are confused.]

To be clear, I am definitely not arguing for a pure mechanism-design approach to all of AI alignment. The argument in the OP is relevant to inner optimizers because we can't just directly choose which goals to program into them. We can directly choose which goals to program into an outer optimizer, and I definitely think that's the right way to go.

If an inner optimizer could exploit some distribution shift between the training and deployment environments, then performance-in-training is a bad proxy for performance-in-deployment.

Suppose you are making a self driving car. The training environment is a videogame like environment. The rendering is pretty good. A human looking at the footage would not easily be able to say it was obviously fake. An expert going over the footage in detail could spot subtle artefacts. The diffuse translucency on leaves in the background isn't quite right. When another car drives through a puddle, all the water drops are perfectly spherical, and travel on parabolic paths. Falling snow doesn't experience aerodynamic turbulence. Etc.

The point is that the behaviour you want is avoiding other cars and lamp posts. The simulation is close enough to reality that it is easy to match virtual lamp posts to real ones. However the training and testing environments have a different distribution.

Making the simulated environment absolutely pixel perfect would be very hard, and doesn't seem like it should be necessary. 

However, given even a slight variation between training and the real world, there exists an agent that will behave well in training, but cause problems in the real world. And also an agent that behaves fine in training and the real world. The set of possible behaviours is vast. You can't consider all of them. You can't even store a single arbitrary behaviour. Because you cant train on all possible situations, there will be behaviours that behave the same on all the training situations, but behave differently in other situations. You need some part of your design that favours some policies over others without training data. For example, you might want a policy that can be described as parameters in a particular neural net. You have to look at how this effects off distribution actions. 

The analogous situation with managers would be that the person being tested knows they are being tested. If you get them to display benevolent leadership, then you can't distinguish benevolent leaders from sociopaths who can act nice to pass the test.

Take an outer-aligned system, then add a 0 to each training input and a 1 to each deployment input. Wouldn't this add only malicious hypotheses that can be removed by inspection without any adverse selection effects?

After thinking about it for a couple minutes, this question is both more interesting and less trivial than it seemed. The answer is not obvious to me.

On the face of it, passing in a bit which is always constant in training should do basically nothing - the system has no reason to use a constant bit. But if the system becomes reflective (i.e. an inner optimizer shows up and figures out that it's in a training environment), then that bit could be used. In principle, this wouldn't necessarily be malicious - the bit could be used even by aligned inner optimizers, as data about the world just like any other data about the world. That doesn't seem likely with anything like current architectures, but maybe in some weird architecture which systematically produced aligned inner optimizers.

The hypotheses after the modification are supposed to have knowledge that they're in training, for example because they have enough compute to find themselves in the multiverse. Among hypotheses with equal behavior in training, we select the simpler one. We want this to be the one that disregards that knowledge. If the hypothesis has form "Return whatever maximizes property _ of the multiverse", the simpler one uses that knowledge. It is this form of hypothesis which I suggest to remove by inspection.

Ok, that should work assuming something analogous to Paul's hypothesis about minimal circuits being daemon-free.

I have a weird relation with this post. On the one hand, I don't think the definition of outer alignment you're using is the right one (as I mentioned in comments on your previous post); on the other hand, I do agree with one of your main points, that we should look for a behavioral property instead of an internal structure property.

Perfect. I was on the fence about posting this one, but decided that it did a better job than the other of expressing the substantive argument in a way that would be obvious despite definitional disagreements.

(Though I actually don't think we should look for a behavioral property rather than a structural property; I think this whole thing is a bad way of framing the problem, and we shouldn't be doing open-ended searches in policy space at all. But if we are going to do an open-ended search in policy space at all, then yeah, behavioral over structural.)

Why is this? As I argued in learning normativity, I think there are some problems which we can more easily point out structurally. For example, Paul's proposal of relaxed adversarial training is one possible method (look for "pseudo-inputs" which lead to bad behavior, such as activations of some internal nodes which seem like plausible activation patterns, even if you don't know how to hit them with data).

The argument in the post seems to be "you can't incentivize virtue without incentivizing it behaviorally", but this seems untrue.