This is a special post for short-form writing by DanielFilan. Only they can create top-level comments. Comments here also appear on the Shortform Page and All Posts page.

DanielFilan's Shortform Feed

10DanielFilan

2DanielFilan

5DanielFilan

2DanielFilan

5DanielFilan

2DanielFilan

6DanielFilan

4DanielFilan

4DanielFilan

2DanielFilan

2DanielFilan

2DanielFilan

4DanielFilan

4DanielFilan

14 comments, sorted by Click to highlight new comments since: Today at 1:26 AM

As far as I can tell, people typically use the orthogonality thesis to argue that smart agents could have any motivations. But the orthogonality thesis is stronger than that, and its extra content is false - there are some goals that are too complicated for a dumb agent to have, because the agent couldn't understand those goals. I think people should instead directly defend the claim that smart agents could have arbitrary goals.

Here's a project idea that I wish someone would pick up (written as a shortform rather than as a post because that's much easier for me):

- It would be nice to study competent misgeneralization empirically, to give examples and maybe help us develop theory around it.
- Problem: how do you measure 'competence' without reference to a goal??
- Prior work has used the 'agents vs devices' framework, where you have a distribution over all reward functions, some likelihood distribution over what 'real agents' would do given a certain reward function, and do Bayesian inference on that vs choosing actions randomly. If conditioned on your behaviour you're probably an agent rather than a random actor, then you're competent.
- I don't like this:
- Crucially relies on knowing the space of reward functions that the learner in question might have.
- Crucially relies on knowing how agents act given certain motivations.
- A priori it's not so obvious why we care about this metric.

- Here's another option: throw out 'competence' and talk about 'consequential'.
- This has a name collision with 'consequentialist' that you'll probably have to fix but whatever.

- The setup: you have your learner do stuff in a multi-agent environment. You use the AUP metric on
*every agent other than your learner*. You say that your learner is 'consequential' if it strongly affects the attainable utility of other agents. - How good is this?
- It still relies on having a space of reward functions, but there's some more wiggle-room: you probably don't need to get the space exactly right, just to have goals that are similar to yours.
- Note that this would no longer be true if this were a metric you were optimizing over.

- You still need to have some idea about how agents will act realistically, because if you only look at the utility attainable by optimal policies, that might elide the fact that it's suddenly gotten much computationally harder to achieve that utility.
- That said, I still feel like this is going to degrade more gracefully, as long as you include models that are roughly right. I guess this is because this model is no longer a likelihood ratio where misspecification can just rule out the right answer.

- It's more obvious why we care about this metric.

- It still relies on having a space of reward functions, but there's some more wiggle-room: you probably don't need to get the space exactly right, just to have goals that are similar to yours.
- Bonus round: you can probably do some thinking about why various setups would tend to reduce other agents' attainable utility, prove some little theorems, etc., in the style of the power-seeking paper.
- Ideally you could even show a relation between this and the agents vs devices framing.

- I think this is the sort of project a first-year PhD student could fruitfully make progress on.

Toryn Q. Klassen, Parand Alizadeh Alamdari, and Sheila A. McIlraith wrote a paper on the multi-agent AUP thing, framing it as a study of epistemic side effects.

This is a fun Aumann paper that talks about what players have to believe to be in a Nash equilibrium. Here, instead of imagining agents randomizing, we're instead imagining that the probabilities over actions live in the heads of the other agents: you might well know exactly what you're going to do, as long as I don't. It shows that in 2-player games, you can write down conditions that involve mutual knowledge **but not common knowledge** that imply that the players are at a Nash equilibrium: mutual knowledge of player's conjectures about each other, players' rationality, and players' payoffs suffices. On the contrary, in 3-player games (or games with more players), **you need common knowledge**: common priors, and common knowledge of conjectures about other players.

The paper writes:

One might suppose that one needs stronger hypotheses in Theorem B [about 3-player games] than in Theorem A [about 2-player games] only because when , the conjectures of two players about a third one may disagree. But that is not so. One of the examples in Section 5 shows that even when the necessary agreement is assumed outright, conditions similar to those of Theorem A do not suffice for Nash equilibrium when .

This is pretty mysterious to me and I wish I understood it better. Probably it would help to read more carefully thru the proofs and examples.

Got it, sort of. Once you have 3 people, then each person has a conjecture about the actions of the other two people. This means that your distribution might not be the product of the marginals over your distributions over the actions of each opponent, so you might be maximizing expected utility wrt your actual beliefs, but not wrt the product of the marginals - and the marginals are what are supposed to form the Nash equilibrium. Common knowledge and common priors mean stop this by forcing your conjecture over the different players to be independent. I still have a hard time explaining in words why this has to be true, but at least I understand the proof.

Quantitative claims about code maintenance from Working in Public, plausibly relevant to discussion of code rot and machine intelligence:

- "most computer programmers begin their careers doing software maintenance, and many never do anything but", attributed to Nathan Ensmenger, professor at Indiana University.
- "most software at Google gets rewritten every few years", attributed to Fergus Henderson of Google.
- "A 2018 Stripe survey of software developers suggested that developers spend 42% of their time maintaining code" - link
- "Nathan Ensmenger, the informatics professor, notes that, since the early 1960s, maintenance costs account for 50% to 70% of total expenditures on software development" - paper

An attempt at rephrasing a shard theory critique of utility function reasoning, while restricting myself to things I basically agree with:

Yes, there are representation theorems that say coherent behaviour is optimizing some utility function. And yes, for the sake of discussion let's say this extends to reward functions in the setting of sequential decision-making (even tho I don't remember seeing a theorem for that). But: just because there's a mapping, doesn't mean that we can pull back a uniform measure on utility/reward functions to get a reasonable measure on agents - those theorems don't tell us that we should expect a uniform distribution on utility/reward functions, or even a nice distribution! They would if agents were born with utility functions in their heads represented as tables or something, where you could swap entries in different rows, but that's not what the theorems say!

Suppose there are two online identities, and you want to verify that they're associated with the same person. It's not too hard to verify this: for instance, you could tell one of them something secretly, and ask the other what you told the first. But how do you determine that two online identities are different people? It's not obvious how you do this with anything like cryptographic keys etc.

One way to do it if the identities always do what's causal-decision-theoretically correct is to have the two identities play a prisoner's dilemma with each other, and make it impossible to enforce contracts. If you're playing with yourself, you'll cooperate, but if you're playing with another person you'll defect.

That being said, this only works if the payoff difference between both identities cooperating and both identities defecting is greater than the amount a single person controlling both would pay to convince you that they're actually two people. Which means it only works if the amount you're willing to pay to learn the truth is greater than the amount they're willing to pay to deceive you.

Here's one way you can do it: Suppose we're doing public key cryptography, and every person is associated with one public key. Then when you write things online you could use a linkable ring signature. That means that you prove that you're using a private key that corresponds to one of the known public keys, and you also produce a hash of your keypair, such that (a) the world can tell you're one of the known public keys but not which public key you are, and (b) the world can tell that the key hash you used corresponds to the public key you 'committed' to when writing the proof.

'Seminar' announcement: me talking quarter-bakedly about products, co-products, deferring, and transparency. 3 pm PT tomorrow (actually 3:10 because that's how time works at Berkeley).

I was daydreaming during a talk earlier today (my fault, the talk was great), and noticed that one diagram in Dylan Hadfield-Menell's off-switch paper looked like the category-theoretic definition of the product of two objects. Now, in category theory, the 'opposite' of a product is a co-product, which in set theory is the disjoint union. So if the product of two actions is deferring to a human about which action to take, what's the co-product? I had an idea about that which I'll keep secret until the talk, when I'll reveal it (you can also read the title to figure it out). I promise that I won't prepare any slides or think very hard about what I'm going to say. I also won't really know what I'm talking about, so hopefully one of you will. The talk will happen in my personal zoom room. Message me for the passcode.

Rationality-related writings that are more comment-shaped than post-shaped. Please don't leave top-level comments here unless they're indistinguishable to me from something I would say here.