Some objections:
I don't feel much better about the speed prior than the regular Solomonoff prior.
Thanks! I'm not sure I follow you. Here's what I think you are saying:
--Occam's Razor will be sufficient for predicting human behavior of course; it just isn't sufficient for finding the intended planner-reward pair. Because (A) the simplest way to predict human behavior has nothing to do with planners and rewards, and so (B) the simplest planner-reward pair will be degenerate or weird as A&M argue.
--You agree that this argument also works for Laws+Initial Conditions; Occam's Razor is generally insufficient, not just insufficient for inferring preferences of irrational agents!
--You think the argument is more likely to work for inferring preferences than for Laws+Initial Conditions though.
If this is what you are saying, then I agree with the second and third points but disagree with the first--or at least, I don't see any argument for it in A&M's paper. It may still be true, but further argument is needed. In particular their arguments for (A) are pretty weak, methinks--that's what my section "Objections to the arguments for step 2" is about.
Edit to clarify: By "I agree with the second point" I mean I agree that if the argument works at all, it probably works for Laws+Initial Conditions as well. I don't think the argument works though. But I do think that Occam's Razor is probably insufficient.
That's an accurate summary of what I'm saying.
at least, I don't see any argument for it in A&M's paper. It may still be true, but further argument is needed.
If you are picking randomly out of a set of N possibilities, the chance that you pick the "correct" one is 1/N. It seems like in any decomposition (whether planner/reward or initial conditions/dynamics), there will be N decompositions, with N >> 1, where I'd say "yeah, that probably has similar complexity as the correct one". The chance that the correct one is also the simplest one out of all of these seems basically like 1/N, which is ~0.
You could make an argument that we aren't actually choosing randomly, and correctness is basically identical to simplicity. I feel the pull of this argument in the limit of infinite data for laws of physics (but not for finite data), but it just seems flatly false for the reward/planner decomposition.
I feel like there's a big difference between "similar complexity" and "the same complexity." Like, if we have theory T and then we have theory T* which adds some simple unobtrusive twist to it, we get another theory which is of similar complexity... yet realistically an Occam's-Razor-driven search process is not going to settle on T*, because you only get T* by modifying T. And if I'm wrong about this then it seems like Occam's Razor is broken in general; in any domain there are going to be ways to turn T's into T*'s. But Occam's Razor is not broken in general (I feel).
Maybe this is the argument you anticipate above with "...we aren't actually choosing randomly." Occam's Razor isn't random. Again, I might agree with you that intuitively Occam's Razor seems more useful in physics than in preference-learning. But intuitions are not arguments, and anyhow they aren't arguments that appeared in the text of A&M's paper.
Hey there!
Thanks for this critique; I have, obviously, a few comments ^_^
In no particular order:
First of all, the FHI channel has a video going over the main points of the argument (and of the research agenda); it may help to understand where I'm coming from: https://www.youtube.com/watch?v=1M9CvESSeVc
A useful point from that: given human theory of mind, the decomposition of human behaviour into preferences and rationality is simple; without that theory of mind, it is complex. Since it's hard for us to turn off our theory of mind, the decomposition will always feel simple to us. However, the human theory of mind suffers from Moravec's paradox: though the theory of mind seems simple to us, it is very hard to specify, especially into code.
You're entirely correct to decompose the argument into Step 1 and Step 2, and to point out that Step 1 has much stronger formal support than Step 2.
I'm not too worried about the degenerate pairs specifically; you can rule them all out with two bits of information. But, once you've done that, there will be other almost-as-degenerate pairs that bit with the new information. To rule them out, you need to add more information... but by the time you've added all of that, you've essentially defined the "proper" pair, by hand.
On speed priors: the standard argument applies for a speed prior, too (see Appendix A of our paper). It applies perfectly for the indifferent planner/zero reward, and applies, given an extra assumption, for the other two degenerate solutions.
Onto the physics analogy! First of all, I'm a bit puzzled by your claim that physicists don't know how to do this division. Now, we don't have a full theory of physics; however, all the physical theories I know of, have a very clear and known division between laws and initial conditions. So physicists do seem to know how to do this. And when we say that "it's very complex", this doesn't seem to mean the division into laws and initial conditions is complex, just that the initial conditions are complex (and maybe that the laws are not yet known).
The indifference planner contains almost exactly the same amount of on information as the policy. The "proper" pair, on the other hand, contains information such as whether the anchoring bias is a bias (it is) compared with whether paying more for better tasting chocolates is a bias (it isn't). Basically, none of the degenerate pairs contain any bias information at all; so everything to do with human biases is extra information that comes along with the "proper" pair.
Even ignoring all that, the fact that (p,R) is of comparable complexity to (-p,-R) shows that Occams razor cannot distinguish the proper pair from its negative.
And thanks for the reply!
FWIW, I like the research agenda. I just don't like the argument in the paper. :)
--Yes, without theory of mind the decomposition is complex. But is it more complex than the simplest way to construct the policy? Maybe, maybe not. For all you said in the paper, it could still be that the simplest way to construct the policy is via the intended pair, complex though it may be. (In my words: The Occam Sufficiency Hypothesis might still be true.)
--If the Occam Sufficiency Hypothesis is true, then not only do we not have to worry about the degenerate pairs, we don't have to worry about anything more complex than them either.
--I agree that your argument, if it works, applies to the speed prior too. I just don't think it works; I think Step 2 in particular might break for the speed prior, because the Speed!Occam Sufficiency Hypothesis might be true.
--If I ever said physicists don't know how to distinguish between laws and initial conditions, I didn't mean it. (Did I?) What I thought I said was that physicists haven't yet found a law+IC pair that can account for the data we've observed. Also that they are in fact using lots of other heuristics and assumptions in their methodology, they aren't just iterating through law+IC pairs and comparing the results to our data. So, in that regard the situation with physics is parallel to the situation with preferences/rationality.
--My point is that they are irrelevant to what is more complex than what. In particular, just because A has more information than B doesn't mean A is more complex than B. Example: The true Laws + Initial Conditions pair contains more information than E, the set of all events in the world. Why? Because from E you cannot conclude anything about counterfactuals, but from the true Laws+IC pair you can. Yet you can deduce E from the true Laws+IC pair. (Assume determinism for simplicity.) But it's not true that the true Laws+IC pair is more complex than E; the complexity of E is the length of the shortest way to generate it, and (let's assume) the true Laws+IC is the shortest way to generate E. So both have the same complexity.
I realize I may be confused here about how complexity or information works; please correct me if so!
But anyhow if I'm right about this then I am skeptical of conclusions drawn from information to complexity... I'd like to see the argument made more explicit and broken down more at least.
For example, the "proper" pair contains all this information about what's a bias and what isn't, because our definition of bias references the planner/reward distinction. But isn't that unfair? Example: We can write 99999999999999999999999 or we can write "20-digits of 9's." The latter is shorter, but it contains more information if we cheat and say it tells us things like "how to spell the word that refers to the parts of a written number."
Anyhow don't the degenerate pairs also contain information about biases--for example, according to the policy-planner+empty-reward pair, nothing is a bias, because nothing would systematically lead to more reward than what is already being done?
--If it were true that Occam's Razor can't distinguish between P,R and -P,-R, then... isn't that a pretty general argument against Occam's Razor, not just in this domain but in other domains too?
--
Hey there!
Responding to a few points. But first, I want to make the point that treating an agent as (p,R) pair is basically an intentional stance. We choose to treat the agent that way, either for ease of predicting its actions (Dennet's approach) or for extracting its preferences, to satisfy them (my approach). The decomposition is not a natural fact about the world.
--If I ever said physicists don't know how to distinguish between laws and initial conditions, I didn't mean it. (Did I?) What I thought I said was that physicists haven't yet found a law+IC pair that can account for the data we've observed. Also that they are in fact using lots of other heuristics and assumptions in their methodology, they aren't just iterating through law+IC pairs and comparing the results to our data. So, in that regard the situation with physics is parallel to the situation with preferences/rationality.
No, the situation is very different. Physicists are trying to model and predict what is happening in the world (and in counterfactual worlds). This is equivalent with trying to figure out the human policy (which can be predicted from observations, as long as you include counterfactual ones). The decomposition of the policy into preferences and rationality is a separate step, very unlike what physicists are doing (quick way to check this: if physicists were unboundedly rational with infinite data, they could solve their problem; whereas we couldn't, we'd still have to make decisions).
(if you want to talk about situations where we know some things but not all about the human policy, then the treatment is more complex, but ultimately the same arguments apply).
--My point is that they are irrelevant to what is more complex than what. In particular, just because A has more information than B doesn't mean A is more complex than B. Example: The true Laws + Initial Conditions pair contains more information than E, the set of all events in the world. Why? Because from E you cannot conclude anything about counterfactuals, but from the true Laws+IC pair you can. Yet you can deduce E from the true Laws+IC pair. (Assume determinism for simplicity.) But it's not true that the true Laws+IC pair is more complex than E; the complexity of E is the length of the shortest way to generate it, and (let's assume) the true Laws+IC is the shortest way to generate E. So both have the same complexity.
Well, it depends. Suppose there are multiple TL (true laws) + IC that could generate E. In that case, TL+IC has more complexity than E, since you need to choose among the possible options. But if there is only one feasible TL+IC that generates E, then you can work backwards from E to get that TL+IC, and now you have all the counterfactual info, from E, as well.
For example, the "proper" pair contains all this information about what's a bias and what isn't, because our definition of bias references the planner/reward distinction. But isn't that unfair? Example: We can write 99999999999999999999999 or we can write "20-digits of 9's." The latter is shorter, but it contains more information if we cheat and say it tells us things like "how to spell the word that refers to the parts of a written number."
That argument shows that if you look into the algorithm, you can get other differences. But I'm not looking into the algorithm; I'm just using the decomposition into (p, R), and playing around with the p and R pieces, without looking inside.
Anyhow don't the degenerate pairs also contain information about biases--for example, according to the policy-planner+empty-reward pair, nothing is a bias, because nothing would systematically lead to more reward than what is already being done?
Among the degenerate pairs, the one with the indifferent planner has a bias of zero, the greedy planner has a bias of zero, and the anti-greedy planner has a bias of -1 at every timestep. So they do define bias functions, but particularly simple ones. Nothing like the complexity of the biases generated by the "proper" pair.
The relevance of information for complexity is this: given reasonable assumptions, the human policy is simpler than all pairs, and the three degenerate pairs are almost as simple as the policy. However, the "proper" pair can generate a complicated object, the bias function (which has a non-trivial value in almost every possible state). So the proper pair contains at least enough information to specify a) the human policy, and b) the bias function. The kolmogorov complexity of the proper pair is thus at least that of the simplest algorithm that can generate both those objects.
So one of two things are happening: either the human policy can generate the bias function directly, in some simple way^{[1]}, or the proper pair is more complicated that the policy. The first is not impossible, but notice that it has to be "simple". So the fact that we have not yet found a way to generate the bias function from the policy is an argument that it can't be done. Certainly there are no elementary mathematical manipulations of the policy that produces anything suitable.
--If it were true that Occam's Razor can't distinguish between P,R and -P,-R, then... isn't that a pretty general argument against Occam's Razor, not just in this domain but in other domains too?
No, because Occam's razor works in other domains. This is a strong illustration that this domain is actually different.
Let A be the simplest algorithm that generates the human policy, and B the simplest that generates the human policy and the bias function. If there are n different algorithms that generate the human policy and are of length |B| or shorter, then we need to add log2(n) bits of information to the human policy to generate B, and hence, the bias function. So if B is close is complexity to A, be don't need to add much. ↩︎
Thanks again! I still disagree, surprise surprise.
I think I agree with you that the (p,R) decomposition is not a natural fact about the world, but I'm not so sure. Anyhow I don't think it matters for our purposes.
No, the situation is very different. Physicists are trying to model and predict what is happening in the world (and in counterfactual worlds). This is equivalent with trying to figure out the human policy (which can be predicted from observations, as long as you include counterfactual ones). The decomposition of the policy into preferences and rationality is a separate step, very unlike what physicists are doing (quick way to check this: if physicists were unboundedly rational with infinite data, they could solve their problem; whereas we couldn't, we'd still have to make decisions).
(if you want to talk about situations where we know some things but not all about the human policy, then the treatment is more complex, but ultimately the same arguments apply).
Physicists are trying to do many things. Yes, one thing they are trying to do is predict what it happening in the world. But another thing they are trying to do is figure out stuff about counterfactuals, and for that they need to have a Laws+IC decomposition to work with. So they take their data and they look for a simple Laws+IC decomposition that fits it. They would still do this even if they already knew the results of all the experiments ever, and had no more need to predict things. (Extending the symmetry, humans also typically use the intentional stance on incomplete data about a target human's policy, for the purpose of predicting the rest of the policy. But this isn't what you concern yourself with; you assume for the sake of argument that we already have the whole policy and point out that we'd still want to use the intentional stance to get a decomposition so that we could make judgments about rationality. I say yes, true, now apply the same reasoning to physics: assume for the sake of argument that we already know everything that will happen, all the events, and notice that we'd still want to have a Laws+IC decomposition, perhaps to figure out counterfactuals.)
Well, it depends. Suppose there are multiple TL (true laws) + IC that could generate E. In that case, TL+IC has more complexity than E, since you need to choose among the possible options. But if there is only one feasible TL+IC that generates E, then you can work backwards from E to get that TL+IC, and now you have all the counterfactual info, from E, as well.
I was assuming there were multiple Law+IC pairs that would generate E... well actually no, the example degenerate pairs I gave prove that there are, no need to assume it!
That argument shows that if you look into the algorithm, you can get other differences. But I'm not looking into the algorithm; I'm just using the decomposition into (p, R), and playing around with the p and R pieces, without looking inside.
I don't see the difference between what you are doing and what I did. You started with a policy and said "But what about bias-facts? The policy by itself doesn't tell us these facts. So let's look at the various decompositions of the policy into p,R pairs; they tell us the bias facts." I start with a number and say "But what about how-to-spell-the-word-that-refers-to-the-parts-of-a-written-number facts? The number doesn't tell us that. Let's look at the various decompositions of the number into strings of symbols that represent it; they tell us those facts."
Among the degenerate pairs, the one with the indifferent planner has a bias of zero, the greedy planner has a bias of zero, and the anti-greedy planner has a bias of -1 at every timestep. So they do define bias functions, but particularly simple ones. Nothing like the complexity of the biases generated by the "proper" pair.
Thanks for the clarification--that's what I suspected. So then every p,R pair compatible with the policy contains more information than the policy. Thus even the simplest p,R pair compatible with the policy contains more information than the policy. By analogous reasoning, every algorithm for constructing the policy contains more information than the policy. So even the simplest algorithm for constructing the policy contains more information than the policy. So (by your reasoning) even the simplest algorithm for constructing the policy is more complex than the policy. But this isn't so; the simplest algorithm for constructing the policy is length L and so has complexity L, and the policy has complexity L too... That's my argument at least. Again, maybe I'm misunderstanding how complexity works. But now that I've laid it out step-by-step, which step do you disagree with?
The relevance of information for complexity is this: given reasonable assumptions, the human policy is simpler than all pairs, ...
Wait what? This is what I was objecting to in the original post. The "Occam Sufficiency Hypothesis" is that the human policy is not simpler than all pairs; in particular, it is precisely the simplicity of the intended pair, because the intended pair is the simplest way to construct the policy.
What are the reasonable assumptions that lead to the OSH being false?
My objection to your paper, in a nutshell, was that you didn't discuss this part--you didn't give any reason to think OSH was false. The three reasons you gave in Step 2 were reasons to think the intended pair is complex, not reasons to think it is more complex than the policy. Or so I argued.
--If it were true that Occam's Razor can't distinguish between P,R and -P,-R, then... isn't that a pretty general argument against Occam's Razor, not just in this domain but in other domains too?
No, because Occam's razor works in other domains. This is a strong illustration that this domain is actually different.
My argument is that if you are right, Occam's Razor would be generally useless, but i's not, so you are wrong. In more detail: If Occam's Razor can't distinguish between P,R and -P,-R, then (by analogy) it an arbitrary domain it won't be able to distinguish between theory X and theory b(X) where b() is some simple bizzaro function that negates or inverts the parts of X in such a way as to make it the changes cancel out.
I'm not sure the physics analogy is getting us very far - I feel there is a very natural way of decomposing physics into laws+initial conditions, while there is no such natural way of doing so for preferences and rationality. But if we have different intuitions on that, then discussing the analogy doesn't isn't going to help us converge!
So then every p,R pair compatible with the policy contains more information than the policy. Thus even the simplest p,R pair compatible with the policy contains more information than the policy.
Agreed (though the extra information may be tiny - a few extra symbols).
By analogous reasoning, every algorithm for constructing the policy contains more information than the policy.
That does not follow; the simplest algorithm for building a policy does not go via decomposing into two pieces and then recombining them. We are comparing algorithms that produce a planner-reward pair (two outputs) with algorithms that produce a policy (one output). (but your whole argument shows you may be slightly misunderstanding complexity in this context).
Now, though all pairs are slightly more complex than the policy itself, the bias argument shows that the "proper" pair is considerably more complex. To use an analogy: suppose file1 and file2 are both maximally zipped files. When you unzip file1, you produce image1 (and maybe a small, blank, image2). When you unzip file2, you also produce the same image1, and a large, complex, image2'. Then, as long as image1 and image2' are at least slightly independent, file2 has to be larger than file1. The more complex image2' is, and the more independent it is from image1, the larger file2 has to be.
Does that make sense?
I agree that the decomposition of physics into laws+IC is much simpler than the decomposition of a human policy into p,R. (Is that what you mean by "more natural?") But this is not relevant to my argument, I think.
I feel that our conversation now has branched into too many branches, some of which have been abandoned. In the interest of re-focusing the conversation, I'm going to answer the questions you asked and then ask a few new ones of my own.
To your questions: For me to understand your argument better I'd like to know more about what the pieces represent. Is file1 the degenerate pair and file2 the intended pair, and image1 the policy and image2 the bias-facts? Then what is the "unzip" function? Pairs don't unzip to anything. You can apply the function "apply the first element of the pair to the second" or you can apply the function "do that, and then apply the MAXIMIZE function to the second element of the pair and compute the difference." Or there are infinitely many other things you can do with the pair. But the pair itself doesn't tell you what to do with it, unlike a zipped file which is like an algorithm--it tells you "run me."
I have two questions. 1. My central claim--which I still uphold as not-ruled-out-by-your-arguments (though of course I don't actually believe it) is the Occam Sufficiency Hypothesis: "The 'intended' pair is the simplest way to generate the policy." So, basically, what OSH says is that within each degenerate pair is a term, pi (the policy), and when you crack open that term and see what it is made of, you see p(R), the intended policy applied to the intended reward function! Thus, a simplicity-based search will stumble across <p,R> before it stumbles across any of the degenerate pairs, because it needs p and R to construct the degenerate pairs. What part of this do you object to?
2. Earlier you said "given reasonable assumptions, the human policy is simpler than all pairs" What are those assumptions?
Once again, thanks for taking the time to engage with me on this! Sorry it took me so long to reply, I got busy with family stuff.
Is file1 the degenerate pair and file2 the intended pair, and image1 the policy and image2 the bias-facts?
Yes.
Then what is the "unzip" function?
The "shortest algorithm generating BLAH" is the maximally compressed way of expressing BLAH - the "zipped" version of BLAH.
Ignoring unzip, which isn't very relevant, we know that the degenerate pairs are just above the policy in complexity.
So zip(degenerate pair) zip(policy), while zip(reasonable pair) > zip(policy+complex bias facts) (and zip(policy+complex bias facts) > zip(policy)).
Does that help?
[Epistemic Status: My inside view feels confident, but I’ve only discussed this with one other person so far, so I won't be surprised if it turns out to be confused.]
Armstrong and Mindermann (A&M) argue "that even with a reasonable simplicity prior/Occam’s razor on the set of decompositions, we cannot distinguish between the true decomposition and others that lead to high regret. To address this, we need simple ‘normative’ assumptions, which cannot be deduced exclusively from observations."
I explain why I think their argument is faulty, concluding that maybe Occam's Razor is sufficient to do the job after all.
In what follows I assume the reader is familiar with the paper already or at least with the concepts within it.
Brief summary of A&M's argument:
(This is merely a brief sketch of A&M’s argument; I’ll engage with it in more detail below. For the full story, read their paper.)
Take a human policy pi = P(R) that we are trying to represent in the planner-reward formalism. R is the human’s reward function, which encodes their desires/preferences/values/goals. P() is the human’s planner function, which encodes how they take their experiences as input and try to choose outputs that achieve their reward. Pi, then, encodes the overall behavior of the human in question.
Step 1: In any reasonable language, for any plausible policy, you can construct “degenerate” planner-reward pairs that are almost as simple as the simplest possible way to generate the policy, yet yield high regret (i.e. have a reward component which is very different from the "true"/"Intended" one.)
It’s easy to see that these examples, being constructed from the policy, are at most slightly more complex than the simplest possible way to generate the policy, since they could make use of that way.
Step 2: The "intended" planner-reward pair--the one that humans would judge to be a reasonable decomposition of the human policy in question--is likely to be significantly more complex than the simplest possible planner-reward pair.
Conclusion: If we use Occam’s Razor alone to find planner-reward pairs that fit a particular human’s behavior, we’ll settle on one of the degenerate ones (or something else entirely) rather than a reasonable one. This could be very dangerous if we are building an AI to maximize the reward.
Methinks the argument proves too much:
My first point is that A&M’s argument probably works just as well for other uses of Occam’s Razor. In particular it works just as well for the canonical use: finding the Laws and Initial Conditions that describe our universe!
Take a sequence of events we are trying to predict/represent with the lawlike-universe formalism, which posits C (the initial conditions) and then L() the dynamical laws, a function that takes initial conditions and extrapolates everything else from them. L(C) = E, the sequence of events/conditions/world-states we are trying to predict/represent.
Step 1: In any reasonable language, for any plausible sequence of events, we can construct "degenerate" initial condition + laws pairs that are almost as simple as the simplest pair.
It’s easy to see that these examples, being constructed from E, are at most slightly more complex than the simplest possible pair, since they could use the simplest pair to generate E.
Step 2: The "intended" initial condition+law pair is likely to be significantly more complex than the simplest pair.
Conclusion: If we use Occam’s Razor alone to find law-condition pairs that fit all the world’s events, we’ll settle on one of the degenerate ones (or something else entirely) rather than a reasonable one. This could be very dangerous if we are e.g. building an AI to do science for us and answer counterfactual questions like “If we had posted the nuclear launch codes on the Internet, would any nukes have been launched?”
This conclusion may actually be true, but it’s a pretty controversial claim and I predict most philosophers of science wouldn’t be impressed by this argument for it--even the ones who agree with the conclusion.
Objecting to the three arguments for Step 2
Consider the following hypothesis, which is basically equivalent to the claim A&M are trying to disprove:
Occam Sufficiency Hypothesis: The “Intended” pair happens to be the simplest way to generate the policy.
Notice that everything in Step 1 is consistent with this hypothesis. The first degenerate pairs are constructed from the policy, so they are more complicated than the simplest way to generate it, so if that way is via the intended pair, they are more complicated (albeit only slightly) than the intended pair.
Next, notice that the three arguments in support of Step 2 don’t really hurt this hypothesis:
Re: first argument: The intended pair can be both very complex and the simplest way to generate the policy; no contradiction there. Indeed that’s not even surprising: since the policy is generated by a massive messy neural net in an extremely diverse environment, we should expect it to be complex. What matters for our purposes is not how complex the intended pair is, but rather how complex it is relative to the simplest possible way to generate the policy. A&M need to argue that the simplest possible way to generate the policy is simpler than the intended pair; arguing that the intended pair is complex is at best only half the argument.
Compare to the case of physics: Sure, the laws of physics are complex. They probably take at least a page of code to write up. And that’s aspirational; we haven’t even got to that point yet. But that doesn’t mean Occam’s Razor is insufficient to find the laws of physics.
Re: second argument: The inference from “This pair contains more information than the policy” to “this pair is more complex than the policy” is fallacious. Of course the intended pair contains more information than the policy! All ways of generating the policy contain more information than it. This is because there are many ways (e.g. planner-reward pairs) to get any given policy, and thus specifying any particular way is giving you strictly more information than simply specifying the policy.
Compare to the case of physics: Even once we’ve been given the complete history of the world (or a complete history of some arbitrarily large set of experiment-events) there will still be additional things left to specify about what the laws and initial conditions truly are. Do the laws contain a double negation in them, for example? Do they have some weird clause that creates infinite energy but only when a certain extremely rare interaction occurs that never in fact occurs? What language are the laws written in, anyway? And what about the initial conditions? Lots of things left to specify that aren’t determined by the complete history of the world. Yet this does not mean that the Laws + Initial Conditions are more complex than the complete history of the world, and it certainly doesn’t mean we’ll be led astray if we believe in the Laws+Conditions pair that is simplest.
Re: third argument: Yes, people have been trying to find planner-reward pairs to explain human behavior for many years, and yes, no one has managed to build a simple algorithm to do it yet. Instead we rely on all sorts of implicit and intuitive heuristics, and we still don’t succeed fully. But all of this can be said about Physics too. It’s not like physicists are literally following the Occam’s Razor algorithm--iterating through all possible Law+Condition pairs in order from simplest to most complex and checking each one to see if it outputs a universe consistent with all our observations. And moreover, physicists haven’t succeeded fully either. Nevertheless, many of us are still confident that Occam’s Razor is in principle sufficient: If we were to follow the algorithm exactly, with enough data and compute, we would eventually settle on a Law+Condition pair that accurately describes reality, and it would be the true pair. Again, maybe we are wrong about that, but the arguments A&M have given so far aren’t convincing.
Conclusion
Perhaps Occam’s Razor is insufficient after all. (Indeed I suspect as much, for reasons I’ll sketch in the appendix) But as far as I can tell, A&M’s arguments are at best very weak evidence against the sufficiency of Occam’s Razor for inferring human preferences, and moreover they work pretty much just as well against the canonical use of Occam’s Razor too.
This is a bold claim, so I won’t be surprised if it turns out I was confused. I look forward to hearing people’s feedback. Thanks in advance! And thanks especially to Armstrong and Mindermann if they take the time to reply.
Many thanks to Ramana Kumar for hearing me out about this a while ago when we read the paper together.
Appendix: So, is Occam’s Razor sufficient or not?
--A priori, we should expect something more like a speed prior to be appropriate for identifying the mechanisms of a finite mind, rather than a pure complexity prior.
--Sure enough, we can think of scenarios in which e.g. a deterministic universe with somewhat simple laws develops consequentialists who run massive simulations including of our universe and then write down Daniel’s policy in flaming letters somewhere, such that the algorithm “Run this deterministic universe until you find big flaming letters, then read out that policy” becomes a very simple way to generate Daniel’s policy. (This is basically just the “Universal Prior is Malign” idea applied in a new way.)
--So yeah, pure complexity prior is probably not good. But maybe a speed prior would work, or something like it. Or maybe not. I don’t know.
--One case that seems useful to me: Suppose we are considering two explanations of someone’s behavior: (A) They desire the well-being of the poor, but [insert epicycles here to explain why they aren’t donating much, are donating conspicuously, are donating ineffectively] and (B) They desire their peers (and their selves) to believe that they desire the well-being of the poor. Thanks to the epicycles in (A), both theories fit the data equally well. But theory B is much more simple. Do we conclude that this person really does desire the well-being of the poor, or not? If we think that even though (A) is more complex it is also more accurate, then yeah it seems like Occam’s Razor is insufficient to infer human preferences. But if we instead think “Yeah, this person just really doesn’t care, and the proof is how much simpler B is than A” then it seems we really are using something like Occam’s Razor to infer human preferences. Of course, this is just one case, so the only way it could prove anything is as a counterexample. To me it doesn’t seem like a counterexample to Occam’s sufficiency, but I could perhaps be convinced to change my mind about that.
--Also, I'm pretty sure that once we have better theories of the brain and mind, we’ll have new concepts and theoretical posits to explain human behavior. (e.g. something something Karl Friston something something free energy?) Thus, the simplest generator of a given human’s behavior will probably not divide automatically into a planner and a reward; it’ll probably have many components and there will be debates about which components the AI should be faithful to (dub these components the reward) and which components the AI should seek to surpass (dub these components the planner.) These debates may be intractable, turning on subjective and/or philosophical considerations. So this is another sense in which I think yeah, definitely Occam’s Razor isn’t sufficient--for we will also need to have a philosophical debate about what rationality is.