Occam's Razor May Be Sufficient to Infer the Preferences of Irrational Agents: A reply to Armstrong & Mindermann

Some objections:

The thing that you can't do is decompose behavior into planner and reward. If you just want to predict behavior, you can totally do that. Similarly, you can predict future events with physics.
You do need to do the decomposition to run counterfactuals. And indeed I buy the claim that if you literally try to find some input $I$ and some dynamics $D$ such that $D (I)$ is the world trajectory, selecting only by Kolmogorov complexity and accuracy at predicting data, you probably won't be able to use the resulting $D$ to run counterfactuals. Even ignoring the malign universal prior argument.
If it turns out you can run counterfactuals with $D$ , I would strongly expect that to be because physics "actually" works by some simple $D$ that is "invariant" to the input state. In contrast, I would be astonished if humans "actually" have some reward $R$ in their head that they are trying to maximize, and that is what drives behavior.

I don't feel much better about the speed prior than the regular Solomonoff prior.

Thanks! I'm not sure I follow you. Here's what I think you are saying:

--Occam's Razor will be sufficient for predicting human behavior of course; it just isn't sufficient for finding the intended planner-reward pair. Because (A) the simplest way to predict human behavior has nothing to do with planners and rewards, and so (B) the simplest planner-reward pair will be degenerate or weird as A&M argue.

--You agree that this argument also works for Laws+Initial Conditions; Occam's Razor is generally insufficient, not just insufficient for inferring preferences of irrational agents!

--You think the argument is more likely to work for inferring preferences than for Laws+Initial Conditions though.

If this is what you are saying, then I agree with the second and third points but disagree with the first--or at least, I don't see any argument for it in A&M's paper. It may still be true, but further argument is needed. In particular their arguments for (A) are pretty weak, methinks--that's what my section "Objections to the arguments for step 2" is about.

Edit to clarify: By "I agree with the second point" I mean I agree that if the argument works at all, it probably works for Laws+Initial Conditions as well. I don't think the argument works though. But I do think that Occam's Razor is probably insufficient.

[-]Rohin Shah6y20

That's an accurate summary of what I'm saying.

at least, I don't see any argument for it in A&M's paper. It may still be true, but further argument is needed.

If you are picking randomly out of a set of N possibilities, the chance that you pick the "correct" one is 1/N. It seems like in any decomposition (whether planner/reward or initial conditions/dynamics), there will be N decompositions, with N >> 1, where I'd say "yeah, that probably has similar complexity as the correct one". The chance that the correct one is also the simplest one out of all of these seems basically like 1/N, which is ~0.

You could make an argument that we aren't actually choosing randomly, and correctness is basically identical to simplicity. I feel the pull of this argument in the limit of infinite data for laws of physics (but not for finite data), but it just seems flatly false for the reward/planner decomposition.

[-]Daniel Kokotajlo6y10

I feel like there's a big difference between "similar complexity" and "the same complexity." Like, if we have theory T and then we have theory T* which adds some simple unobtrusive twist to it, we get another theory which is of similar complexity... yet realistically an Occam's-Razor-driven search process is not going to settle on T*, because you only get T* by modifying T. And if I'm wrong about this then it seems like Occam's Razor is broken in general; in any domain there are going to be ways to turn T's into T*'s. But Occam's Razor is not broken in general (I feel).

Maybe this is the argument you anticipate above with "...we aren't actually choosing randomly." Occam's Razor isn't random. Again, I might agree with you that intuitively Occam's Razor seems more useful in physics than in preference-learning. But intuitions are not arguments, and anyhow they aren't arguments that appeared in the text of A&M's paper.

[-]Stuart_Armstrong6y40

Hey there!

Thanks for this critique; I have, obviously, a few comments ^_^

In no particular order:

First of all, the FHI channel has a video going over the main points of the argument (and of the research agenda); it may help to understand where I'm coming from: https://www.youtube.com/watch?v=1M9CvESSeVc
A useful point from that: given human theory of mind, the decomposition of human behaviour into preferences and rationality is simple; without that theory of mind, it is complex. Since it's hard for us to turn off our theory of mind, the decomposition will always feel simple to us. However, the human theory of mind suffers from Moravec's paradox: though the theory of mind seems simple to us, it is very hard to specify, especially into code.
You're entirely correct to decompose the argument into Step 1 and Step 2, and to point out that Step 1 has much stronger formal support than Step 2.
I'm not too worried about the degenerate pairs specifically; you can rule them all out with two bits of information. But, once you've done that, there will be other almost-as-degenerate pairs that bit with the new information. To rule them out, you need to add more information... but by the time you've added all of that, you've essentially defined the "proper" pair, by hand.
On speed priors: the standard argument applies for a speed prior, too (see Appendix A of our paper). It applies perfectly for the indifferent planner/zero reward, and applies, given an extra assumption, for the other two degenerate solutions.
Onto the physics analogy! First of all, I'm a bit puzzled by your claim that physicists don't know how to do this division. Now, we don't have a full theory of physics; however, all the physical theories I know of, have a very clear and known division between laws and initial conditions. So physicists do seem to know how to do this. And when we say that "it's very complex", this doesn't seem to mean the division into laws and initial conditions is complex, just that the initial conditions are complex (and maybe that the laws are not yet known).
The indifference planner contains almost exactly the same amount of on information as the policy. The "proper" pair, on the other hand, contains information such as whether the anchoring bias is a bias (it is) compared with whether paying more for better tasting chocolates is a bias (it isn't). Basically, none of the degenerate pairs contain any bias information at all; so everything to do with human biases is extra information that comes along with the "proper" pair.
Even ignoring all that, the fact that (p,R) is of comparable complexity to (-p,-R) shows that Occams razor cannot distinguish the proper pair from its negative.

[-]Daniel Kokotajlo6y10

And thanks for the reply!

FWIW, I like the research agenda. I just don't like the argument in the paper. :)

--Yes, without theory of mind the decomposition is complex. But is it more complex than the simplest way to construct the policy? Maybe, maybe not. For all you said in the paper, it could still be that the simplest way to construct the policy is via the intended pair, complex though it may be. (In my words: The Occam Sufficiency Hypothesis might still be true.)

--If the Occam Sufficiency Hypothesis is true, then not only do we not have to worry about the degenerate pairs, we don't have to worry about anything more complex than them either.

--I agree that your argument, if it works, applies to the speed prior too. I just don't think it works; I think Step 2 in particular might break for the speed prior, because the Speed!Occam Sufficiency Hypothesis might be true.

--If I ever said physicists don't know how to distinguish between laws and initial conditions, I didn't mean it. (Did I?) What I thought I said was that physicists haven't yet found a law+IC pair that can account for the data we've observed. Also that they are in fact using lots of other heuristics and assumptions in their methodology, they aren't just iterating through law+IC pairs and comparing the results to our data. So, in that regard the situation with physics is parallel to the situation with preferences/rationality.

--My point is that they are irrelevant to what is more complex than what. In particular, just because A has more information than B doesn't mean A is more complex than B. Example: The true Laws + Initial Conditions pair contains more information than E, the set of all events in the world. Why? Because from E you cannot conclude anything about counterfactuals, but from the true Laws+IC pair you can. Yet you can deduce E from the true Laws+IC pair. (Assume determinism for simplicity.) But it's not true that the true Laws+IC pair is more complex than E; the complexity of E is the length of the shortest way to generate it, and (let's assume) the true Laws+IC is the shortest way to generate E. So both have the same complexity.

I realize I may be confused here about how complexity or information works; please correct me if so!

But anyhow if I'm right about this then I am skeptical of conclusions drawn from information to complexity... I'd like to see the argument made more explicit and broken down more at least.

For example, the "proper" pair contains all this information about what's a bias and what isn't, because our definition of bias references the planner/reward distinction. But isn't that unfair? Example: We can write 99999999999999999999999 or we can write "20-digits of 9's." The latter is shorter, but it contains more information if we cheat and say it tells us things like "how to spell the word that refers to the parts of a written number."

Anyhow don't the degenerate pairs also contain information about biases--for example, according to the policy-planner+empty-reward pair, nothing is a bias, because nothing would systematically lead to more reward than what is already being done?

--If it were true that Occam's Razor can't distinguish between P,R and -P,-R, then... isn't that a pretty general argument against Occam's Razor, not just in this domain but in other domains too?

[-]Stuart_Armstrong6y10

Hey there!

Responding to a few points. But first, I want to make the point that treating an agent as (p,R) pair is basically an intentional stance. We choose to treat the agent that way, either for ease of predicting its actions (Dennet's approach) or for extracting its preferences, to satisfy them (my approach). The decomposition is not a natural fact about the world.

--If I ever said physicists don't know how to distinguish between laws and initial conditions, I didn't mean it. (Did I?) What I thought I said was that physicists haven't yet found a law+IC pair that can account for the data we've observed. Also that they are in fact using lots of other heuristics and assumptions in their methodology, they aren't just iterating through law+IC pairs and comparing the results to our data. So, in that regard the situation with physics is parallel to the situation with preferences/rationality.

No, the situation is very different. Physicists are trying to model and predict what is happening in the world (and in counterfactual worlds). This is equivalent with trying to figure out the human policy (which can be predicted from observations, as long as you include counterfactual ones). The decomposition of the policy into preferences and rationality is a separate step, very unlike what physicists are doing (quick way to check this: if physicists were unboundedly rational with infinite data, they could solve their problem; whereas we couldn't, we'd still have to make decisions).

(if you want to talk about situations where we know some things but not all about the human policy, then the treatment is more complex, but ultimately the same arguments apply).

--My point is that they are irrelevant to what is more complex than what. In particular, just because A has more information than B doesn't mean A is more complex than B. Example: The true Laws + Initial Conditions pair contains more information than E, the set of all events in the world. Why? Because from E you cannot conclude anything about counterfactuals, but from the true Laws+IC pair you can. Yet you can deduce E from the true Laws+IC pair. (Assume determinism for simplicity.) But it's not true that the true Laws+IC pair is more complex than E; the complexity of E is the length of the shortest way to generate it, and (let's assume) the true Laws+IC is the shortest way to generate E. So both have the same complexity.

Well, it depends. Suppose there are multiple TL (true laws) + IC that could generate E. In that case, TL+IC has more complexity than E, since you need to choose among the possible options. But if there is only one feasible TL+IC that generates E, then you can work backwards from E to get that TL+IC, and now you have all the counterfactual info, from E, as well.

For example, the "proper" pair contains all this information about what's a bias and what isn't, because our definition of bias references the planner/reward distinction. But isn't that unfair? Example: We can write 99999999999999999999999 or we can write "20-digits of 9's." The latter is shorter, but it contains more information if we cheat and say it tells us things like "how to spell the word that refers to the parts of a written number."

That argument shows that if you look into the algorithm, you can get other differences. But I'm not looking into the algorithm; I'm just using the decomposition into (p, R), and playing around with the p and R pieces, without looking inside.

Anyhow don't the degenerate pairs also contain information about biases--for example, according to the policy-planner+empty-reward pair, nothing is a bias, because nothing would systematically lead to more reward than what is already being done?

Among the degenerate pairs, the one with the indifferent planner has a bias of zero, the greedy planner has a bias of zero, and the anti-greedy planner has a bias of -1 at every timestep. So they do define bias functions, but particularly simple ones. Nothing like the complexity of the biases generated by the "proper" pair.

The relevance of information for complexity is this: given reasonable assumptions, the human policy is simpler than all pairs, and the three degenerate pairs are almost as simple as the policy. However, the "proper" pair can generate a complicated object, the bias function (which has a non-trivial value in almost every possible state). So the proper pair contains at least enough information to specify a) the human policy, and b) the bias function. The kolmogorov complexity of the proper pair is thus at least that of the simplest algorithm that can generate both those objects.

So one of two things are happening: either the human policy can generate the bias function directly, in some simple way^[1], or the proper pair is more complicated that the policy. The first is not impossible, but notice that it has to be "simple". So the fact that we have not yet found a way to generate the bias function from the policy is an argument that it can't be done. Certainly there are no elementary mathematical manipulations of the policy that produces anything suitable.

--If it were true that Occam's Razor can't distinguish between P,R and -P,-R, then... isn't that a pretty general argument against Occam's Razor, not just in this domain but in other domains too?

No, because Occam's razor works in other domains. This is a strong illustration that this domain is actually different.

Let A be the simplest algorithm that generates the human policy, and B the simplest that generates the human policy and the bias function. If there are n different algorithms that generate the human policy and are of length |B| or shorter, then we need to add log2(n) bits of information to the human policy to generate B, and hence, the bias function. So if B is close is complexity to A, be don't need to add much. ↩︎

[-]Daniel Kokotajlo6y10

Thanks again! I still disagree, surprise surprise.

I think I agree with you that the (p,R) decomposition is not a natural fact about the world, but I'm not so sure. Anyhow I don't think it matters for our purposes.

No, the situation is very different. Physicists are trying to model and predict what is happening in the world (and in counterfactual worlds). This is equivalent with trying to figure out the human policy (which can be predicted from observations, as long as you include counterfactual ones). The decomposition of the policy into preferences and rationality is a separate step, very unlike what physicists are doing (quick way to check this: if physicists were unboundedly rational with infinite data, they could solve their problem; whereas we couldn't, we'd still have to make decisions).

(if you want to talk about situations where we know some things but not all about the human policy, then the treatment is more complex, but ultimately the same arguments apply).

Physicists are trying to do many things. Yes, one thing they are trying to do is predict what it happening in the world. But another thing they are trying to do is figure out stuff about counterfactuals, and for that they need to have a Laws+IC decomposition to work with. So they take their data and they look for a simple Laws+IC decomposition that fits it. They would still do this even if they already knew the results of all the experiments ever, and had no more need to predict things. (Extending the symmetry, humans also typically use the intentional stance on incomplete data about a target human's policy, for the purpose of predicting the rest of the policy. But this isn't what you concern yourself with; you assume for the sake of argument that we already have the whole policy and point out that we'd still want to use the intentional stance to get a decomposition so that we could make judgments about rationality. I say yes, true, now apply the same reasoning to physics: assume for the sake of argument that we already know everything that will happen, all the events, and notice that we'd still want to have a Laws+IC decomposition, perhaps to figure out counterfactuals.)

Well, it depends. Suppose there are multiple TL (true laws) + IC that could generate E. In that case, TL+IC has more complexity than E, since you need to choose among the possible options. But if there is only one feasible TL+IC that generates E, then you can work backwards from E to get that TL+IC, and now you have all the counterfactual info, from E, as well.

I was assuming there were multiple Law+IC pairs that would generate E... well actually no, the example degenerate pairs I gave prove that there are, no need to assume it!

That argument shows that if you look into the algorithm, you can get other differences. But I'm not looking into the algorithm; I'm just using the decomposition into (p, R), and playing around with the p and R pieces, without looking inside.

I don't see the difference between what you are doing and what I did. You started with a policy and said "But what about bias-facts? The policy by itself doesn't tell us these facts. So let's look at the various decompositions of the policy into p,R pairs; they tell us the bias facts." I start with a number and say "But what about how-to-spell-the-word-that-refers-to-the-parts-of-a-written-number facts? The number doesn't tell us that. Let's look at the various decompositions of the number into strings of symbols that represent it; they tell us those facts."

Among the degenerate pairs, the one with the indifferent planner has a bias of zero, the greedy planner has a bias of zero, and the anti-greedy planner has a bias of -1 at every timestep. So they do define bias functions, but particularly simple ones. Nothing like the complexity of the biases generated by the "proper" pair.

Thanks for the clarification--that's what I suspected. So then every p,R pair compatible with the policy contains more information than the policy. Thus even the simplest p,R pair compatible with the policy contains more information than the policy. By analogous reasoning, every algorithm for constructing the policy contains more information than the policy. So even the simplest algorithm for constructing the policy contains more information than the policy. So (by your reasoning) even the simplest algorithm for constructing the policy is more complex than the policy. But this isn't so; the simplest algorithm for constructing the policy is length L and so has complexity L, and the policy has complexity L too... That's my argument at least. Again, maybe I'm misunderstanding how complexity works. But now that I've laid it out step-by-step, which step do you disagree with?

The relevance of information for complexity is this: given reasonable assumptions, the human policy is simpler than all pairs, ...

Wait what? This is what I was objecting to in the original post. The "Occam Sufficiency Hypothesis" is that the human policy is not simpler than all pairs; in particular, it is precisely the simplicity of the intended pair, because the intended pair is the simplest way to construct the policy.

What are the reasonable assumptions that lead to the OSH being false?

My objection to your paper, in a nutshell, was that you didn't discuss this part--you didn't give any reason to think OSH was false. The three reasons you gave in Step 2 were reasons to think the intended pair is complex, not reasons to think it is more complex than the policy. Or so I argued.

--If it were true that Occam's Razor can't distinguish between P,R and -P,-R, then... isn't that a pretty general argument against Occam's Razor, not just in this domain but in other domains too?

No, because Occam's razor works in other domains. This is a strong illustration that this domain is actually different.

My argument is that if you are right, Occam's Razor would be generally useless, but i's not, so you are wrong. In more detail: If Occam's Razor can't distinguish between P,R and -P,-R, then (by analogy) it an arbitrary domain it won't be able to distinguish between theory X and theory b(X) where b() is some simple bizzaro function that negates or inverts the parts of X in such a way as to make it the changes cancel out.

[-]Stuart_Armstrong6y10

I'm not sure the physics analogy is getting us very far - I feel there is a very natural way of decomposing physics into laws+initial conditions, while there is no such natural way of doing so for preferences and rationality. But if we have different intuitions on that, then discussing the analogy doesn't isn't going to help us converge!

So then every p,R pair compatible with the policy contains more information than the policy. Thus even the simplest p,R pair compatible with the policy contains more information than the policy.

Agreed (though the extra information may be tiny - a few extra symbols).

By analogous reasoning, every algorithm for constructing the policy contains more information than the policy.

That does not follow; the simplest algorithm for building a policy does not go via decomposing into two pieces and then recombining them. We are comparing algorithms that produce a planner-reward pair (two outputs) with algorithms that produce a policy (one output). (but your whole argument shows you may be slightly misunderstanding complexity in this context).

Now, though all pairs are slightly more complex than the policy itself, the bias argument shows that the "proper" pair is considerably more complex. To use an analogy: suppose file1 and file2 are both maximally zipped files. When you unzip file1, you produce image1 (and maybe a small, blank, image2). When you unzip file2, you also produce the same image1, and a large, complex, image2'. Then, as long as image1 and image2' are at least slightly independent, file2 has to be larger than file1. The more complex image2' is, and the more independent it is from image1, the larger file2 has to be.

Does that make sense?

[-]Daniel Kokotajlo6y10

I agree that the decomposition of physics into laws+IC is much simpler than the decomposition of a human policy into p,R. (Is that what you mean by "more natural?") But this is not relevant to my argument, I think.

I feel that our conversation now has branched into too many branches, some of which have been abandoned. In the interest of re-focusing the conversation, I'm going to answer the questions you asked and then ask a few new ones of my own.

To your questions: For me to understand your argument better I'd like to know more about what the pieces represent. Is file1 the degenerate pair and file2 the intended pair, and image1 the policy and image2 the bias-facts? Then what is the "unzip" function? Pairs don't unzip to anything. You can apply the function "apply the first element of the pair to the second" or you can apply the function "do that, and then apply the MAXIMIZE function to the second element of the pair and compute the difference." Or there are infinitely many other things you can do with the pair. But the pair itself doesn't tell you what to do with it, unlike a zipped file which is like an algorithm--it tells you "run me."

I have two questions. 1. My central claim--which I still uphold as not-ruled-out-by-your-arguments (though of course I don't actually believe it) is the Occam Sufficiency Hypothesis: "The 'intended' pair is the simplest way to generate the policy." So, basically, what OSH says is that within each degenerate pair is a term, pi (the policy), and when you crack open that term and see what it is made of, you see p(R), the intended policy applied to the intended reward function! Thus, a simplicity-based search will stumble across <p,R> before it stumbles across any of the degenerate pairs, because it needs p and R to construct the degenerate pairs. What part of this do you object to?

2. Earlier you said "given reasonable assumptions, the human policy is simpler than all pairs" What are those assumptions?

Once again, thanks for taking the time to engage with me on this! Sorry it took me so long to reply, I got busy with family stuff.

[-]Stuart_Armstrong6y10

Is file1 the degenerate pair and file2 the intended pair, and image1 the policy and image2 the bias-facts?

Yes.

Then what is the "unzip" function?

The "shortest algorithm generating BLAH" is the maximally compressed way of expressing BLAH - the "zipped" version of BLAH.

Ignoring unzip, which isn't very relevant, we know that the degenerate pairs are just above the policy in complexity.

So zip(degenerate pair) $\approx$ zip(policy), while zip(reasonable pair) > zip(policy+complex bias facts) (and zip(policy+complex bias facts) > zip(policy)).

Does that help?

[-]Daniel Kokotajlo6y10

It helps me to understand more clearly your argument. I still disagree with it though. I object to this:

zip(reasonable pair) > zip(policy+complex bias facts)

I claim this begs the question against OSH. If OSH is true, then zip(reasonable pair) ≈ zip(policy).

[-]Stuart_Armstrong6y10

Indeed. It might be possible to construct that complex bias function, from the policy, in a simple way. But that claim needs to be supported, and the fact that it hasn't been found so far (I repeat that it has to be simple) is evidence against it.

[-]romeostevensit6y30

This is neat. It makes me realize that thinking in terms of simplicity and complexity priors was serving somewhat as a semantic stop sign for me whereas speed prior vs slow prior doesn't.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

21

Occam's Razor May Be Sufficient to Infer the Preferences of Irrational Agents: A reply to Armstrong & Mindermann

21

Brief summary of A&M's argument:

Methinks the argument proves too much:

Objecting to the three arguments for Step 2

Conclusion

Appendix: So, is Occam’s Razor sufficient or not?