All of Rafael Harth's Comments + Replies

Inner Alignment: Explain like I'm 12 Edition

Thanks. It's because directupload often has server issues. I was supposed to rehost all images from my posts to a more reliable host, but apparently forgot this one. I'll fix it in a couple of hours.

Open problem: how can we quantify player alignment in 2x2 normal-form games?

I'll take a shot at this. Let and be the sets of actions of Alice and Bob. Let (where 'n' means 'nice') be function that orders by how good the choices are for Alice, assuming that Alice gets to choose second. Similarly, let (where 's' means 'selfish') be the function that orders by how good the choices are for Bob, assuming that Alice gets to choose second. Choose some function measuring similarity between two orderings of a finite set (should range over ); the alignment of with is then .

Example: in... (read more)

Why I'm excited about Debate

I think the Go example really gets to the heart of why I think Debate doesn't cut it.

Your comment is an argument against using Debate to settle moral questions. However, what if Debate is trained on Physics and/or math questions, with the eventual goal of asking "what is a provably secure alignment proposal?"

2Charlie Steiner9moGood question. There's a big roadblock to your idea as stated, which is that asking something to define "alignment" is a moral question. But suppose we sorted out a verbal specification of an aligned AI and had a candidate FAI coded up - could we then use Debate on the question "does this candidate match the verbal specification?" I don't know - I think it still depends on how bad humans are as judges of arguments - we've made the domain more objective, but maybe there's some policy of argumentation that still wins by what we would consider cheating. I can imagine being convinced that it would work by seeing Debates play out with superhuman litigators, but since that's a very high bar maybe I should apply more creativity to my expextations.
Debate update: Obfuscated arguments problem

In the ball-attached-to-a-pole example, the honest debater has assigned probabilities that are indistinguishable from what you would do if you knew noting except that the claim is false. (I.e., assign probabilities that doubt each component equally.) I'm curious how difficult it is to find the flaw in this argument structure. Have you done anything like showing these transcripts to other experts and seeing if they will be able to answer it?

If I had to summarize this finding in one sentence, it would be "it seems like an expert can generally find a set of ... (read more)

2Beth Barnes9moNot systematically; I would be excited about people doing these experiments. One tricky thing is that you might think this is a strategy that's possible for ML models, but that humans aren't naturally very good. Yeah, this is a great summary. One thing I would clarify is that it's sufficient that the set of arguments don't look suspicious to the judge. The arguments might look suspicious to the expert, but unless they have a way to explain to the judge why it's suspicious, we still have a problem. Yeah, I think that is the obvious next step. The concern is that the reasons the argument is suspicious may be hard to justify in a debate, especially if they're reasons of the form 'look, I've done a bunch of physics problems, and approaching it this way feels like it will makes things messy, whereas approaching it this way feels cleaner'. Debate probably doesn't work very well for supervising knowledge that's gained through finding patterns in data, as opposed to knowledge that's gained through step-by-step reasoning. Something like imitative generalisation (AKA 'learning the prior') [] is trying to fill this gap.
Conclusion to 'Reframing Impact'

Fantastic sequence! Certainly, for anyone other than you, the deconfusion/time investment ratio of reading this is excellent. You really succeeded in making the core insights accessible. I'd even say it compares favorably to the recommended sequences in the Alignment Forum in that regard.

I've never read the "Towards a new Impact Measure" post, but I assume doing so is redundant now since this sequence is the 'updated' version.

2Alex Turner1yI'm very glad you enjoyed it! I'd say so, yes.
Attainable Utility Preservation: Scaling to Superhuman

(This sequence inspired me to re-read Reinforcement Learning: An Introduction, hence the break.)

I realize that impact measures always lead to a tradeoff between safety and performance competitiveness. But setting seems to sacrifice quite a lot of performance. Is this real or am I missing something?

Namely, whenever there's an action which doesn't change the state and leads to 1 reward, and a sequence of actions such that has reward with (and all have 0 reward), then it's conceivable that would c... (read more)

2Alex Turner1yFor optimal policies, yes. In practice, not always - in SafeLife [], AUP often had ~50% improved performance on the original task, compared to just naive reward maximization with the same algorithm! Yeah. I'm also pretty sympathetic to arguments by Rohin and others that theRaux= Rvariant isn't quite right in general; maybe there's a better way to formalize "do the thing without gaining power to do it" wrt the agent's own goal. I think this is plausible, yep. This is why I think it's somewhat more likely than not there's no clean way to solve this; however, I haven't even thought very hard about how to solve the problem yet. Depends on how that shows up in the non-embedded formalization, if at all. If it doesn't show up, then the optimal policy won't be able to predict any benefit and won't do it. If it does... I don't know. It might. I'd need to think about it more, because I feel confused about how exactly that would work - what its model of itself is, exactly, and so on.
Inner Alignment: Explain like I'm 12 Edition

Many thanks for taking the time to find errors.

I've fixed #1-#3. Arguments about the universal prior are definitely not something I want to get into with this post, so for #2 I've just made a vague statement that misalignment can arise for other reasons and linked to Paul's post.

I'm hesitant to change #4 before I fully understand why.

I'm not exactly sure what you're trying to say here. The way I would describe this is that internalization requires an expensive duplication where the objective is represented separately from the world model despite the world

... (read more)
Attainable Utility Preservation: Empirical Results

An early punchline in this sequence was "Impact is a thing that depends on the goals of agents; it's not about objective changes in the world." At that point, I thought "well, in that case, impact measures require agents to learn those goals, which means it requires value learning." Looking back at the sequence now, I realize that the "How agents impact each other" part of the sequence was primarily about explaining why we don't need to do that and the previous post was declaring victory on that front, but it took me seeing the formalism here to really get... (read more)

2Alex Turner1yYes, this is exactly what the plan was. :) Yeah, but one doesn't involve visibly destroying an object, which matters for certain impact measures (like whitelisting [] ). You're right that they're quite similar. Turns out you don't need the normalization, per the linked SafeLife paper. I'd probably just take it out of the equations, looking back. Complication often isn't worth it. I think the n-step stepwise inaction baseline doesn't fail at any of them?
Attainable Utility Preservation: Concepts

I was initially writing a comment about how AUP doesn't seem to work in every case because there are actions that are catastrophic without raising its power (such as killing someone), but then I checked the post again and realized that it disincentivizes changes of power in both directions. This rules out the failure modes I had in mind. (It wouldn't press a button that blows up the earth...)

It does seem that AUP will make it so an agent doesn't want to be shut off, though. If it's shut off, its power goes way down (to zero if... (read more)

1Alex Turner1yAnd why exactly would it be motivated to kill someone? This is generally incentivized only insofar as it leads to... power gain, it seems. I think that AUPconceptualshould work just fine for penalizing-increases-only. I think this is much less of a problem in the "penalize increases with respect to agent inaction" scenario.
Attainable Utility Landscape: How The World Is Changed

The technical appendix felt like it was more difficult than previous posts, but I had the advantage of having tried to read the paper from the preceding post yesterday and managed to reconstruct the graph & gamma correctly.

The early part is slightly confusing, though. I thought AU is a thing that belongs to the goal of an agent, but the picture made it look as if it's part of the object ("how fertile is the soil?"). Is the idea here that the soil-AU is slang for "AU of goal 'plant stuff here'"?

I did interpret the firs... (read more)

2Alex Turner1yyes yeah, this is because those are “nonstationary” policies - you change your mind about what to do at a given state. A classic result in MDP theory is that you never need these policies to find an optimal policy. yup!
Seeking Power is Often Convergently Instrumental in MDPs

Thoughts after reading and thinking about this post

The thing that's bugging me here is that Power and Instrumental convergence seem to be almost the same.

In particular, it seems like Power asks [a state]: "how good are you across all policies" and Instrumental Convergence asks: "for how many policies are you the best?". In an analogy to tournaments where policies are players, power cares about the average performance of a player across all tournaments, and instrumental convergence about how many first places that player got. In tha... (read more)

2Alex Turner1yYes, this is roughly correct! As an additional note: it turns out, however, that even if you slightly refine the notion of "power that this part of the future gives me, given that I start here", you have neither "more power→instrumental convergence" nor "instrumental convergence→more power" as logical implications. Instead, if you're drawing the causal graph, there are many, many situations which cause both instrumental convergence and greater power. The formal task is then, "can we mathematically characterize those situations?". Then, you can say, "power-seeking will occur for optimal agents with goals from [such and such distributions] for [this task I care about] at [these discount rates]".
World State is the Wrong Abstraction for Impact

Thoughts I have at this point in the sequence

  • This style is extremely nice and pleasant and fun to read. I saw that the first post was like that months ago; I didn't expect the entire sequence to be like that. I recall what you said about being unable to type without feeling pain. Did this not extend to handwriting?
  • The message so far seems clearly true in the sense that measuring impact by something that isn't ethical stuff is a bad idea, and making that case is probably really good.
  • I do have the suspicion that quantifying impact properly is impossible with
... (read more)
1Alex Turner1yThank you! I poured a lot into this sequence, and I'm glad you're enjoying it. :) Looking forward to what you think of the rest! Handwriting was easier, but I still had to be careful not to do more than ~1 hour / day.
Will humans build goal-directed agents?

In addition, current RL is episodic, so we should only expect that RL agents are goal-directed over the current episode and not in the long-term.

Is this true? Since ML generally doesn't choose an algorithm directly but runs a search over a parameter space, it seems speculative to assume that the resulting model, if it is a mesa-optimizer and goal-directed, only cares about its episode. If it learned that optimizing for X is good for reward, it seems at least conceivable that it won't understand that it shouldn't care about instances of X that appear in future episodes.

3Rohin Shah1yA few points: 1. It's not clear that the current deep RL paradigm would lead to a mesa optimizer. I agree it could happen, but I would like to see an argument as to why it is likely to happen. (I think there is probably a stronger case that any general intelligence we build will need to be a mesa optimizer and therefore goal-directed, and if so that argument should be added to this list.) 2. Even if we did get a mesa optimizer, the base optimizer (e.g. gradient descent) would plausibly select for mesa optimizers that care only up till the end of the episode. A mesa optimizer that wasn't myopic in this way might spend the entire episode learning and making money that it can use in the future, and as a result get no training reward, and so would be selected against by the outer optimizer.
How special are human brains among animal brains?

I might be confused here, but it seems to me that it's easy to interpret the arguments in this post as evidence in the wrong direction.

I see the following three questions as relevant:

1. How much sets human brains apart from other brains?

2. How much does the thing that humans have and animals don't matter?

3. How much does better architecture matter for AI?

Questions #2 and #3 seem positively correlated – if the thing that humans have is important, it's evidence that architectural changes matter a lot. However, holding #2 constant, #1 an... (read more)

1Alex Zhu2yNot necessarily. For example, it may be that language ability is very important, but that most of the heavy lifting in our language ability comes from general learning abilities + having a culture that gives us good training data for learning language, rather than from architectural changes.
Topological Fixed Point Exercises

Ex 5 (fixed version)

Let denote the triangle. For each , construct a 2-d simplex with nodes in , where the color of a point corresponds to the place in the disk that carries that point to, then choose to be a point within a trichromatic triangle in the graph. Then is a bounded sequence having a limit point . Let be the center of the disc; suppose that . Then there is at least one region of the disc that doesn't touch. Let be the distance to the furthest side, that is, let ... (read more)

Topological Fixed Point Exercises

I'm late, but I'm quite proud of this proof for #4:

Call the large triangle a graph and the triangles simply triangles. First, note that for any size, there is a graph where the top node is colored red, the remaining nodes on the right diagonal are colored green, and all nodes not on the right diagonal are colored blue. This graph meets the conditions, and has exactly one trichromatic triangle, namely the one at the top (no other triangle contains a red node). It is trivial to see that this graph can be changed into an arbitrary graph by re-col... (read more)

Topological Fixed Point Exercises

Ex 1

Let and . Given an edge , let denote the map that maps the color of the left to that of the right node. Given a , let . Let denote the color blue and the color green. Let be 1 if edge is bichromatic, and 0 otherwise. Then we need to show that . We'll show , which is a striclty stronger statement than the contrapositive.

For , the LHS is equivalent to , and indeed equals if is bichromatic, and o... (read more)

Diagonalization Fixed Point Exercises

Ex 4

Given a computable function , define a function by the rule . Then is computable, however, because for , we have that and .

Ex 5:

We show the contrapositive: given a function halt, we construct a surjective function from to as follows: enumerate all turing machines, such that each corresponds to a string. Given a , if does not decode to a turing machine, set . If it does, let denote that turning machine. Let ... (read more)

Diagonalization Fixed Point Exercises

Ex 1

Exercise 1: Let and let . Suppose that , then let be an element such that . Then by definition, and . So , a contradiction. Hence , so that is not surjective.

Ex 2

Exercise 2: Since is nonempty, it contains at least one element . Let be a function without a fixed point, then , so that and are two different elements in (this is the only thing we shall use the function for).

Let for nonempty. Suppose by contradiction that is surject... (read more)