# DanielFilan's Shortform Feed

This is a special post for quick takes by DanielFilan. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

Rationality-related writings that are more comment-shaped than post-shaped. Please don't leave top-level comments here unless they're indistinguishable to me from something I would say here.

Frankfurt-style counterexamples for definitions of optimization

In "Bottle Caps Aren't Optimizers", I wrote about a type of definition of optimization that says system S is optimizing for goal G iff G has a higher value than it would if S didn't exist or were randomly scrambled. I argued against these definitions by providing a examples of systems that satisfy the criterion but are not optimizers. But today, I realized that I could repurpose Frankfurt cases to get examples of optimizers that don't satisfy this criterion.

A Frankfurt case is a thought experiment designed to disprove the following intuitive principle: "a person is morally responsible for what she has done only if she could have done otherwise." Here's the basic idea: suppose Alice is considering whether or not to kill Bob. Upon consideration, she decides to do so, takes out her gun, and shoots Bob. But little-known to her, a neuroscientist had implanted a chip in her brain that would have forced her to shoot Bob if she had decided not to. That said, the chip didn't activate, because she did decide to shoot Bob. The idea is that she's morally responsible, even tho she couldn't have done otherwise.

Anyway, let's do this with optimizers. Suppose I'm playing Go, thinking about how to win - imagining what would happen if I played various moves, and playing moves that make me more likely to win. Further suppose I'm pretty good at it. You might want to say I'm optimizing my moves to win the game. But suppose that, unbeknownst to me, behind my shoulder is famed Go master Shin Jinseo. If I start playing really bad moves, or suddenly die or vanish etc, he will play my moves, and do an even better job at winning. Now, if you remove me or randomly rearrange my parts, my side is actually more likely to win the game. But that doesn't mean I'm optimizing to lose the game! So this is another way such definitions of optimizers are wrong.

That said, other definitions treat this counter-example well. E.g. I think the one given in "The ground of optimization" says that I'm optimizing to win the game (maybe only if I'm playing a weaker opponent).

As far as I can tell, people typically use the orthogonality thesis to argue that smart agents could have any motivations. But the orthogonality thesis is stronger than that, and its extra content is false - there are some goals that are too complicated for a dumb agent to have, because the agent couldn't understand those goals. I think people should instead directly defend the claim that smart agents could have arbitrary goals.

I no longer endorse this claim about what the orthogonality thesis says.

Here's a project idea that I wish someone would pick up (written as a shortform rather than as a post because that's much easier for me):

• It would be nice to study competent misgeneralization empirically, to give examples and maybe help us develop theory around it.
• Problem: how do you measure 'competence' without reference to a goal??
• Prior work has used the 'agents vs devices' framework, where you have a distribution over all reward functions, some likelihood distribution over what 'real agents' would do given a certain reward function, and do Bayesian inference on that vs choosing actions randomly. If conditioned on your behaviour you're probably an agent rather than a random actor, then you're competent.
• I don't like this:
• Crucially relies on knowing the space of reward functions that the learner in question might have.
• Crucially relies on knowing how agents act given certain motivations.
• Here's another option: throw out 'competence' and talk about 'consequential'.
• This has a name collision with 'consequentialist' that you'll probably have to fix but whatever.
• The setup: you have your learner do stuff in a multi-agent environment. You use the AUP metric on every agent other than your learner. You say that your learner is 'consequential' if it strongly affects the attainable utility of other agents.
• How good is this?
• It still relies on having a space of reward functions, but there's some more wiggle-room: you probably don't need to get the space exactly right, just to have goals that are similar to yours.
• Note that this would no longer be true if this were a metric you were optimizing over.
• You still need to have some idea about how agents will act realistically, because if you only look at the utility attainable by optimal policies, that might elide the fact that it's suddenly gotten much computationally harder to achieve that utility.
• That said, I still feel like this is going to degrade more gracefully, as long as you include models that are roughly right. I guess this is because this model is no longer a likelihood ratio where misspecification can just rule out the right answer.
• Bonus round: you can probably do some thinking about why various setups would tend to reduce other agents' attainable utility, prove some little theorems, etc., in the style of the power-seeking paper.
• Ideally you could even show a relation between this and the agents vs devices framing.
• I think this is the sort of project a first-year PhD student could fruitfully make progress on.

Toryn Q. Klassen, Parand Alizadeh Alamdari, and Sheila A. McIlraith wrote a paper on the multi-agent AUP thing, framing it as a study of epistemic side effects.

Let it be known: I'm way more likely to respond to (and thereby algorithmically signal-boost) criticisms of AI doomerism that I think are dumb than those that I think are smart, because the dumb objections are easier to answer. Caveat emptor.

This is a fun Aumann paper that talks about what players have to believe to be in a Nash equilibrium. Here, instead of imagining agents randomizing, we're instead imagining that the probabilities over actions live in the heads of the other agents: you might well know exactly what you're going to do, as long as I don't. It shows that in 2-player games, you can write down conditions that involve mutual knowledge but not common knowledge that imply that the players are at a Nash equilibrium: mutual knowledge of player's conjectures about each other, players' rationality, and players' payoffs suffices. On the contrary, in 3-player games (or games with more players), you need common knowledge: common priors, and common knowledge of conjectures about other players.

The paper writes:

One might suppose that one needs stronger hypotheses in Theorem B [about 3-player games] than in Theorem A [about 2-player games] only because when , the conjectures of two players about a third one may disagree. But that is not so. One of the examples in Section 5 shows that even when the necessary agreement is assumed outright, conditions similar to those of Theorem A do not suffice for Nash equilibrium when .

This is pretty mysterious to me and I wish I understood it better. Probably it would help to read more carefully thru the proofs and examples.

Got it, sort of. Once you have 3 people, then each person has a conjecture about the actions of the other two people. This means that your distribution might not be the product of the marginals over your distributions over the actions of each opponent, so you might be maximizing expected utility wrt your actual beliefs, but not wrt the product of the marginals - and the marginals are what are supposed to form the Nash equilibrium. Common knowledge and common priors mean stop this by forcing your conjecture over the different players to be independent. I still have a hard time explaining in words why this has to be true, but at least I understand the proof.

Quantitative claims about code maintenance from Working in Public, plausibly relevant to discussion of code rot and machine intelligence:

• "most computer programmers begin their careers doing software maintenance, and many never do anything but", attributed to Nathan Ensmenger, professor at Indiana University.
• "most software at Google gets rewritten every few years", attributed to Fergus Henderson of Google.
• "A 2018 Stripe survey of software developers suggested that developers spend 42% of their time maintaining code" - link
• "Nathan Ensmenger, the informatics professor, notes that, since the early 1960s, maintenance costs account for 50% to 70% of total expenditures on software development" - paper

FYI: I am not using the dialogue matching feature. If you want to dialogue with me, your best bet is to ask me. I will probably say no, but who knows.

An attempt at rephrasing a shard theory critique of utility function reasoning, while restricting myself to things I basically agree with:

Yes, there are representation theorems that say coherent behaviour is optimizing some utility function. And yes, for the sake of discussion let's say this extends to reward functions in the setting of sequential decision-making (even tho I don't remember seeing a theorem for that). But: just because there's a mapping, doesn't mean that we can pull back a uniform measure on utility/reward functions to get a reasonable measure on agents - those theorems don't tell us that we should expect a uniform distribution on utility/reward functions, or even a nice distribution! They would if agents were born with utility functions in their heads represented as tables or something, where you could swap entries in different rows, but that's not what the theorems say!

Suppose there are two online identities, and you want to verify that they're associated with the same person. It's not too hard to verify this: for instance, you could tell one of them something secretly, and ask the other what you told the first. But how do you determine that two online identities are different people? It's not obvious how you do this with anything like cryptographic keys etc.

One way to do it if the identities always do what's causal-decision-theoretically correct is to have the two identities play a prisoner's dilemma with each other, and make it impossible to enforce contracts. If you're playing with yourself, you'll cooperate, but if you're playing with another person you'll defect.

That being said, this only works if the payoff difference between both identities cooperating and both identities defecting is greater than the amount a single person controlling both would pay to convince you that they're actually two people. Which means it only works if the amount you're willing to pay to learn the truth is greater than the amount they're willing to pay to deceive you.

Here's one way you can do it: Suppose we're doing public key cryptography, and every person is associated with one public key. Then when you write things online you could use a linkable ring signature. That means that you prove that you're using a private key that corresponds to one of the known public keys, and you also produce a hash of your keypair, such that (a) the world can tell you're one of the known public keys but not which public key you are, and (b) the world can tell that the key hash you used corresponds to the public key you 'committed' to when writing the proof.

Actually I'm being silly, you don't need ring signatures, just signatures that are associated with identities and also used for financial transfers.

Note that for this to work you need a strong disincentive against people sharing their private keys. One way to do this would be if the keys were also used for the purpose of holding cryptocurrency.

'Seminar' announcement: me talking quarter-bakedly about products, co-products, deferring, and transparency. 3 pm PT tomorrow (actually 3:10 because that's how time works at Berkeley).

I was daydreaming during a talk earlier today (my fault, the talk was great), and noticed that one diagram in Dylan Hadfield-Menell's off-switch paper looked like the category-theoretic definition of the product of two objects. Now, in category theory, the 'opposite' of a product is a co-product, which in set theory is the disjoint union. So if the product of two actions is deferring to a human about which action to take, what's the co-product? I had an idea about that which I'll keep secret until the talk, when I'll reveal it (you can also read the title to figure it out). I promise that I won't prepare any slides or think very hard about what I'm going to say. I also won't really know what I'm talking about, so hopefully one of you will. The talk will happen in my personal zoom room. Message me for the passcode.

I do not have many ideas here, so it might mostly be me talking about the category-theoretic definition of products and co-products.