Rationality-related writings that are more comment-shaped than post-shaped. Please don't leave top-level comments here unless they're indistinguishable to me from something I would say here.

Rationality-related writings that are more comment-shaped than post-shaped. Please don't leave top-level comments here unless they're indistinguishable to me from something I would say here.

As far as I can tell, people typically use the orthogonality thesis to argue that smart agents could have any motivations. But the orthogonality thesis is stronger than that, and its extra content is false - there are some goals that are too complicated for a dumb agent to have, because the agent couldn't understand those goals. I think people should instead directly defend the claim that smart agents could have arbitrary goals.

I no longer endorse this claim about what the orthogonality thesis says.

Here's a project idea that I wish someone would pick up (written as a shortform rather than as a post because that's much easier for me):

every agent other than your learner. You say that your learner is 'consequential' if it strongly affects the attainable utility of other agents.This is a fun Aumann paper that talks about what players have to believe to be in a Nash equilibrium. Here, instead of imagining agents randomizing, we're instead imagining that the probabilities over actions live in the heads of the other agents: you might well know exactly what you're going to do, as long as I don't. It shows that in 2-player games, you can write down conditions that involve mutual knowledge

but not common knowledgethat imply that the players are at a Nash equilibrium: mutual knowledge of player's conjectures about each other, players' rationality, and players' payoffs suffices. On the contrary, in 3-player games (or games with more players),you need common knowledge: common priors, and common knowledge of conjectures about other players.The paper writes:

This is pretty mysterious to me and I wish I understood it better. Probably it would help to read more carefully thru the proofs and examples.

Got it, sort of. Once you have 3 people, then each person has a conjecture about the actions of the other two people. This means that your distribution might not be the product of the marginals over your distributions over the actions of each opponent, so you might be maximizing expected utility wrt your actual beliefs, but not wrt the product of the marginals - and the marginals are what are supposed to form the Nash equilibrium. Common knowledge and common priors mean stop this by forcing your conjecture over the different players to be independent. I still have a hard time explaining in words why this has to be true, but at least I understand the proof.

Quantitative claims about code maintenance from Working in Public, plausibly relevant to discussion of code rot and machine intelligence:

Suppose there are two online identities, and you want to verify that they're associated with the same person. It's not too hard to verify this: for instance, you could tell one of them something secretly, and ask the other what you told the first. But how do you determine that two online identities are different people? It's not obvious how you do this with anything like cryptographic keys etc.

One way to do it if the identities always do what's causal-decision-theoretically correct is to have the two identities play a prisoner's dilemma with each other, and make it impossible to enforce contracts. If you're playing with yourself, you'll cooperate, but if you're playing with another person you'll defect.

That being said, this only works if the payoff difference between both identities cooperating and both identities defecting is greater than the amount a single person controlling both would pay to convince you that they're actually two people. Which means it only works if the amount you're willing to pay to learn the truth is greater than the amount they're willing to pay to deceive you.

Here's one way you can do it: Suppose we're doing public key cryptography, and every person is associated with one public key. Then when you write things online you could use a linkable ring signature. That means that you prove that you're using a private key that corresponds to one of the known public keys, and you also produce a hash of your keypair, such that (a) the world can tell you're one of the known public keys but not which public key you are, and (b) the world can tell that the key hash you used corresponds to the public key you 'committed' to when writing the proof.

Actually I'm being silly, you don't need ring signatures, just signatures that are associated with identities and also used for financial transfers.

Note that for this to work you need a strong disincentive against people sharing their private keys. One way to do this would be if the keys were also used for the purpose of holding cryptocurrency.

'Seminar' announcement: me talking quarter-bakedly about products, co-products, deferring, and transparency. 3 pm PT tomorrow (actually 3:10 because that's how time works at Berkeley).

I was daydreaming during a talk earlier today (my fault, the talk was great), and noticed that one diagram in Dylan Hadfield-Menell's off-switch paper looked like the category-theoretic definition of the product of two objects. Now, in category theory, the 'opposite' of a product is a co-product, which in set theory is the disjoint union. So if the product of two actions is deferring to a human about which action to take, what's the co-product? I had an idea about that which I'll keep secret until the talk, when I'll reveal it (you can also read the title to figure it out). I promise that I won't prepare any slides or think very hard about what I'm going to say. I also won't really know what I'm talking about, so hopefully one of you will. The talk will happen in my personal zoom room. Message me for the passcode.

I do not have many ideas here, so it might mostly be me talking about the category-theoretic definition of products and co-products.