Rafael Harth


Factored Cognition


Why I'm excited about Debate

I think the Go example really gets to the heart of why I think Debate doesn't cut it.

Your comment is an argument against using Debate to settle moral questions. However, what if Debate is trained on Physics and/or math questions, with the eventual goal of asking "what is a provably secure alignment proposal?"

Debate update: Obfuscated arguments problem

In the ball-attached-to-a-pole example, the honest debater has assigned probabilities that are indistinguishable from what you would do if you knew noting except that the claim is false. (I.e., assign probabilities that doubt each component equally.) I'm curious how difficult it is to find the flaw in this argument structure. Have you done anything like showing these transcripts to other experts and seeing if they will be able to answer it?

If I had to summarize this finding in one sentence, it would be "it seems like an expert can generally find a set of arguments for a false claim that is flawed such that an equally competent expert can't identify the flawed component, and the set of arguments doesn't immediately look suspect". This seems surprising, and I'm wondering whether it's unique to physics. (The cryptographic example was of this kind, but there, the structure of the dishonest arguments was suspect.)

If this finding holds, my immediate reaction is "okay, in this case, the solution for the honest debater is to start a debate about whether the set of arguments from the dishonest debater has this character". I'm not sure how good this sounds. I think my main issue here is that I don't know enough physics understand why the dishonest arguments are hard to identify

Conclusion to 'Reframing Impact'

Fantastic sequence! Certainly, for anyone other than you, the deconfusion/time investment ratio of reading this is excellent. You really succeeded in making the core insights accessible. I'd even say it compares favorably to the recommended sequences in the Alignment Forum in that regard.

I've never read the "Towards a new Impact Measure" post, but I assume doing so is redundant now since this sequence is the 'updated' version.

Attainable Utility Preservation: Scaling to Superhuman

(This sequence inspired me to re-read Reinforcement Learning: An Introduction, hence the break.)

I realize that impact measures always lead to a tradeoff between safety and performance competitiveness. But setting seems to sacrifice quite a lot of performance. Is this real or am I missing something?

Namely, whenever there's an action which doesn't change the state and leads to 1 reward, and a sequence of actions such that has reward with (and all have 0 reward), then it's conceivable that would choose the sequence while would just stubbornly repeat , even if the represent something very tailored to that doesn't involve obtaining a lot of resources. In other words, it seems to penalize reasonable long-term thinking more than the formulas where . This feels like a rather big deal since we arguably want an agent to think long-term as long as it doesn't involve gaining power. I guess the scaling step might help here?

Separately and very speculatively, I'm wondering whether the open problem of the AUP-agent tricking the penalty by restricting its future behavior is actually a symptom of the non-embedded agency model. The decision to make such a hack should come with a vast increase in AU for its primary goal, but it wouldn't be caught by your penalty since it's about an internal change. If so, that might be a sign that it'll be difficult to fix. More generally, if you don't consider internal changes in principle, what stops a really powerful agent from reprogramming itself to slip through your penalty?

Inner Alignment: Explain like I'm 12 Edition

Many thanks for taking the time to find errors.

I've fixed #1-#3. Arguments about the universal prior are definitely not something I want to get into with this post, so for #2 I've just made a vague statement that misalignment can arise for other reasons and linked to Paul's post.

I'm hesitant to change #4 before I fully understand why.

I'm not exactly sure what you're trying to say here. The way I would describe this is that internalization requires an expensive duplication where the objective is represented separately from the world model despite the world model including information about the objective.

So, there are these two channels, input data and SGD. If the model's objective can only be modified by SGD, then (since SGD doesn't want to do super complex modifications), it is easier for SGD to create a pointer rather than duplicate the [model of the base objective] explicitly.

But the bolded part seemed like a necessary condition, and that's what I'm trying to say in the part you quoted. Without this condition, I figured the model could just modify [its objective] and [its model of the Base Objective] in parallel through processing input data. I still don't think I quite understand why this isn't plausible. If the [model of Base objective] and the [Mesa Objective] get modified simultaneously, I don't see any one step where this is harder than creating a pointer. You seem to need an argument for why [the model of the base objective] gets represented in full before the Mesa Objective is modified.

Edit: I slightly rephrased it to say

If we further assume that processing input data doesn't directly modify the model's objective (the Mesa Objective), or that its model of the Base Objective is created first,

Attainable Utility Preservation: Empirical Results

An early punchline in this sequence was "Impact is a thing that depends on the goals of agents; it's not about objective changes in the world." At that point, I thought "well, in that case, impact measures require agents to learn those goals, which means it requires value learning." Looking back at the sequence now, I realize that the "How agents impact each other" part of the sequence was primarily about explaining why we don't need to do that and the previous post was declaring victory on that front, but it took me seeing the formalism here to really get it.

I now think of the main results of the sequence thus far as "impact depends on goals (part 1); nonetheless an impact measure can just be about power of the agent (part 2)"

Attempted Summary/Thoughts on this post

  • GridWorlds is a toy environment (probably meant to be as simple as possible while still allowing to test various properties of agents). The worlds consist of small grids, the state space is correspondingly non-large, and you can program certain behavior of the environment (such as a pixel moving at a pre-defined route).
  • You can specify objectives for an agent within GridWorlds and use Reinforcement Learning to train the agent (to learn a space-transition function?). The agent can move around and behavior on collision with other agents/objects can be specified by the programmer
  • The idea now is that we program five grid worlds in such a way that they represent failure modes relevant to safety. We train (a) a RL algorithm with the objective, (b) a RL algorithm with the objective plus some implementation of AUP and see how they behave differently
  • The five failure modes are (1) causing irreversible changes, (2) damaging stuff, (3) disabling an off-swich, (4) undoing effects that result from the reaching the main objective, (5) preventing naturally occurring changes. The final two aren't things naive RL learning would do, but are failure modes for poorly specified impact penalties ("when curing cancer, make sure human still dies")
    • I don't understand how (1) and (2) are conceptually different (aren't both about causing irreversible changes?)
  • The implementation of AUP chooses a uniformly random objective  and then penalizes actions by a multiple of the term , scaled by some parameter  and normalized.
    • An important implementation detail is about what to compare "AU for aux. goal if I do this" to. There's "AU [for aux. goal] if I do nothing" and "AU [...] if I do nothing for  steps" and "AU [...] at starting state." The last one fails at (5), the first one at (4). (I forgot too much of the reinforcement learning theory to understand how exactly these concepts would map onto the formula.)
  • The AUP penalty robustly scales up to more complex environments, although the "pick a uniformly random reward function" step has to be replaced with "do some white magic to end up with something difficult to understand but still quite simple." The details of "white magic" are probably important for scaling it up to real-world applications.
Attainable Utility Preservation: Concepts

I was initially writing a comment about how AUP doesn't seem to work in every case because there are actions that are catastrophic without raising its power (such as killing someone), but then I checked the post again and realized that it disincentivizes changes of power in both directions. This rules out the failure modes I had in mind. (It wouldn't press a button that blows up the earth...)

It does seem that AUP will make it so an agent doesn't want to be shut off, though. If it's shut off, its power goes way down (to zero if it won't be turned on again). This might be fine, but it contradicts the utility indifference approach. And it feels dangerous – it seems like we would need an assurance like "AUP will always prevent an agent from gaining enough power to resist being switched off"

Attainable Utility Landscape: How The World Is Changed

The technical appendix felt like it was more difficult than previous posts, but I had the advantage of having tried to read the paper from the preceding post yesterday and managed to reconstruct the graph & gamma correctly.

The early part is slightly confusing, though. I thought AU is a thing that belongs to the goal of an agent, but the picture made it look as if it's part of the object ("how fertile is the soil?"). Is the idea here that the soil-AU is slang for "AU of goal 'plant stuff here'"?

I did interpret the first exercise as "you planned to go onto the moon" and came up with stuff like "how valuable are the stones I can take home" and "how pleasant will it be to hang around."

One thing I noticed is that the formal policies don't allow for all possible "strategies." In the graph we had to reconstruct, I can't start at s1, then go to s1 once and then go to s3. So you could think of the larger set where the policies are allowed to depend on the time step. But I assume there's no point unless the reward function also depends on the time step. (I don't know anything about MDPs.)

Am I correct that a deterministic transition function is a function and a non-deterministic one is a function ?

Seeking Power is Often Robustly Instrumental in MDPs

Thoughts after reading and thinking about this post

The thing that's bugging me here is that Power and Instrumental convergence seem to be almost the same.

In particular, it seems like Power asks [a state]: "how good are you across all policies" and Instrumental Convergence asks: "for how many policies are you the best?". In an analogy to tournaments where policies are players, power cares about the average performance of a player across all tournaments, and instrumental convergence about how many first places that player got. In that analogy, the statement that "most goals incentivize gaining power over that environment" would then be "for most tournaments, the first place finisher is someone with good average performance." With this formulation, the statement

formal POWER contributions of different possibilities are approximately proportionally related to instrumental convergence.

seems to be exactly what you would expect (more first places should strongly correlate with better performance). And to construct a counter-example, one creates a state with a lot of second places (i.e., a lot of policies for which it is the second best state) but few first places. I think the graph in the "Formalizations" section does exactly that. If the analogy is sound, it feels helpful to me.

(This is all without having read the paper. I think I'd need to know more of the theory behind MDP to understand it.)

World State is the Wrong Abstraction for Impact

Thoughts I have at this point in the sequence

  • This style is extremely nice and pleasant and fun to read. I saw that the first post was like that months ago; I didn't expect the entire sequence to be like that. I recall what you said about being unable to type without feeling pain. Did this not extend to handwriting?
  • The message so far seems clearly true in the sense that measuring impact by something that isn't ethical stuff is a bad idea, and making that case is probably really good.
  • I do have the suspicion that quantifying impact properly is impossible without formalizing qualia (and I don't expect the sequence to go there), but I'm very willing to be proven wrong.
Load More