I think the Go example really gets to the heart of why I think Debate doesn't cut it.
I think the Go example really gets to the heart of why I think Debate doesn't cut it.
Your comment is an argument against using Debate to settle moral questions. However, what if Debate is trained on Physics and/or math questions, with the eventual goal of asking "what is a provably secure alignment proposal?"
In the ball-attached-to-a-pole example, the honest debater has assigned probabilities that are indistinguishable from what you would do if you knew noting except that the claim is false. (I.e., assign probabilities that doubt each component equally.) I'm curious how difficult it is to find the flaw in this argument structure. Have you done anything like showing these transcripts to other experts and seeing if they will be able to answer it?
If I had to summarize this finding in one sentence, it would be "it seems like an expert can generally find a set of arguments for a false claim that is flawed such that an equally competent expert can't identify the flawed component, and the set of arguments doesn't immediately look suspect". This seems surprising, and I'm wondering whether it's unique to physics. (The cryptographic example was of this kind, but there, the structure of the dishonest arguments was suspect.)
If this finding holds, my immediate reaction is "okay, in this case, the solution for the honest debater is to start a debate about whether the set of arguments from the dishonest debater has this character". I'm not sure how good this sounds. I think my main issue here is that I don't know enough physics understand why the dishonest arguments are hard to identify
Fantastic sequence! Certainly, for anyone other than you, the deconfusion/time investment ratio of reading this is excellent. You really succeeded in making the core insights accessible. I'd even say it compares favorably to the recommended sequences in the Alignment Forum in that regard.
I've never read the "Towards a new Impact Measure" post, but I assume doing so is redundant now since this sequence is the 'updated' version.
(This sequence inspired me to re-read Reinforcement Learning: An Introduction, hence the break.)
I realize that impact measures always lead to a tradeoff between safety and performance competitiveness. But setting RAUX:=R seems to sacrifice quite a lot of performance. Is this real or am I missing something?
Namely, whenever there's an action a which doesn't change the state and leads to 1 reward, and a sequence a1,...,an of actions such that an has m reward with m>n (and all a<n have 0 reward), then it's conceivable that RAUP-1 would choose the (ai) sequence while RAUP-5 would just stubbornly repeat a, even if the (ai)1≤1≤n represent something very tailored to R that doesn't involve obtaining a lot of resources. In other words, it seems to penalize reasonable long-term thinking more than the formulas where RAUX≠R. This feels like a rather big deal since we arguably want an agent to think long-term as long as it doesn't involve gaining power. I guess the scaling step might help here?
Separately and very speculatively, I'm wondering whether the open problem of the AUP-agent tricking the penalty by restricting its future behavior is actually a symptom of the non-embedded agency model. The decision to make such a hack should come with a vast increase in AU for its primary goal, but it wouldn't be caught by your penalty since it's about an internal change. If so, that might be a sign that it'll be difficult to fix. More generally, if you don't consider internal changes in principle, what stops a really powerful agent from reprogramming itself to slip through your penalty?
Many thanks for taking the time to find errors.
I've fixed #1-#3. Arguments about the universal prior are definitely not something I want to get into with this post, so for #2 I've just made a vague statement that misalignment can arise for other reasons and linked to Paul's post.
I'm hesitant to change #4 before I fully understand why.
I'm not exactly sure what you're trying to say here. The way I would describe this is that internalization requires an expensive duplication where the objective is represented separately from the world model despite the world model including information about the objective.
So, there are these two channels, input data and SGD. If the model's objective can only be modified by SGD, then (since SGD doesn't want to do super complex modifications), it is easier for SGD to create a pointer rather than duplicate the [model of the base objective] explicitly.
But the bolded part seemed like a necessary condition, and that's what I'm trying to say in the part you quoted. Without this condition, I figured the model could just modify [its objective] and [its model of the Base Objective] in parallel through processing input data. I still don't think I quite understand why this isn't plausible. If the [model of Base objective] and the [Mesa Objective] get modified simultaneously, I don't see any one step where this is harder than creating a pointer. You seem to need an argument for why [the model of the base objective] gets represented in full before the Mesa Objective is modified.
Edit: I slightly rephrased it to say
If we further assume that processing input data doesn't directly modify the model's objective (the Mesa Objective), or that its model of the Base Objective is created first,
An early punchline in this sequence was "Impact is a thing that depends on the goals of agents; it's not about objective changes in the world." At that point, I thought "well, in that case, impact measures require agents to learn those goals, which means it requires value learning." Looking back at the sequence now, I realize that the "How agents impact each other" part of the sequence was primarily about explaining why we don't need to do that and the previous post was declaring victory on that front, but it took me seeing the formalism here to really get it.
I now think of the main results of the sequence thus far as "impact depends on goals (part 1); nonetheless an impact measure can just be about power of the agent (part 2)"
Attempted Summary/Thoughts on this post
I was initially writing a comment about how AUPconceptual doesn't seem to work in every case because there are actions that are catastrophic without raising its power (such as killing someone), but then I checked the post again and realized that it disincentivizes changes of power in both directions. This rules out the failure modes I had in mind. (It wouldn't press a button that blows up the earth...)
It does seem that AUPconceptual will make it so an agent doesn't want to be shut off, though. If it's shut off, its power goes way down (to zero if it won't be turned on again). This might be fine, but it contradicts the utility indifference approach. And it feels dangerous – it seems like we would need an assurance like "AUPconceptual will always prevent an agent from gaining enough power to resist being switched off"
The technical appendix felt like it was more difficult than previous posts, but I had the advantage of having tried to read the paper from the preceding post yesterday and managed to reconstruct the graph & gamma correctly.
The early part is slightly confusing, though. I thought AU is a thing that belongs to the goal of an agent, but the picture made it look as if it's part of the object ("how fertile is the soil?"). Is the idea here that the soil-AU is slang for "AU of goal 'plant stuff here'"?
I did interpret the first exercise as "you planned to go onto the moon" and came up with stuff like "how valuable are the stones I can take home" and "how pleasant will it be to hang around."
One thing I noticed is that the formal policies don't allow for all possible "strategies." In the graph we had to reconstruct, I can't start at s1, then go to s1 once and then go to s3. So you could think of the larger set ΠL where the policies are allowed to depend on the time step. But I assume there's no point unless the reward function also depends on the time step. (I don't know anything about MDPs.)
Am I correct that a deterministic transition function is a function T:S×A→S and a non-deterministic one is a function T:S×A×S→[0,1]?
Thoughts after reading and thinking about this post
The thing that's bugging me here is that Power and Instrumental convergence seem to be almost the same.
In particular, it seems like Power asks [a state]: "how good are you across all policies" and Instrumental Convergence asks: "for how many policies are you the best?". In an analogy to tournaments where policies are players, power cares about the average performance of a player across all tournaments, and instrumental convergence about how many first places that player got. In that analogy, the statement that "most goals incentivize gaining power over that environment" would then be "for most tournaments, the first place finisher is someone with good average performance." With this formulation, the statement
formal POWER contributions of different possibilities are approximately proportionally related to instrumental convergence.
seems to be exactly what you would expect (more first places should strongly correlate with better performance). And to construct a counter-example, one creates a state with a lot of second places (i.e., a lot of policies for which it is the second best state) but few first places. I think the graph in the "Formalizations" section does exactly that. If the analogy is sound, it feels helpful to me.
(This is all without having read the paper. I think I'd need to know more of the theory behind MDP to understand it.)
Thoughts I have at this point in the sequence