## AI ALIGNMENT FORUMAF

Full time independent deconfusion researcher (https://www.alignmentforum.org/posts/5Nz4PJgvLCpJd6YTA/looking-deeper-at-deconfusion) in AI Alignment. (Also PhD in the theory of distributed computing).

If you're interested by some research ideas that you see in my posts, know that I keep private docs with the most compressed version of my deconfusion ideas in the process of getting feedback. I can give you access if you PM me!

A list of topics I'm currently doing deconfusion on:

• Goal-directedness for discussing AI Risk
• Myopic Decision Theories for dealing with deception (with Evan Hubinger)
• Universality for many alignment ideas of Paul Christiano
• Deconfusion itself to get better at it
• Models of Languages Models to clarify the alignment issues surrounding them.

# Sequences

Reviews for the Alignment Forum
AI Alignment Unwrapped
Understanding Goal-Directedness
Toying With Goal-Directedness

Open problem: how can we quantify player alignment in 2x2 normal-form games?

I want to point that this is a great example of a deconfusion open problem. There is a bunch of intuitions, some constraints, and then we want to clarify the confusion underlying it all. Not planning to work on it myself, but it sounds very interesting.

(Only caveat I have with the post itself is that it could be more explicit in the title that it is an open problem).

Knowledge is not just digital abstraction layers

Nice post, as always.

What I take from the sequence up to this point is that the way we formalize information is unfit to capture knowledge. This is quite intuitive, but you also give concrete counterexamples that are really helpful.

It is reasonable to say that a data recorder is accumulating nonzero knowledge, but it is strange to say that exchanging the sensor data for a model derived from that sensor data is always a net decrease in knowledge.

Definitely agreed. This sounds like your proposal doesn't capture the transformation of information into more valuable precomputation (making valuable abstraction requires throwing away some information).

Vignettes Workshop (AI Impacts)

Already told you yesterday, but great idea! I'll definitely be a part of it, and will try to bring some people with me.

Looking Deeper at Deconfusion

Sure.

• Any proposed solution to AI Alignment isn't deconfusion. It might have a bit a deconfusion at the start, and maybe studying it reveal new confusion to solve, but most of it is problem solving instead of deconfusion.
• Work in interpretability might involve deconfusion (to clarify what one searches for), but then isn't deconfusion anymore.
• Just like in normal science, once one has defined a paradigm or a problem, working with it is mostly not deconfusion anymore:

All in all, I think there are many more examples. It's just that deconfusion almost always plays a part, because we don't have one unified paradigm or approach which does the deconfusion for us. But actual problem solving and most part of normal science, are not deconfusion by my perspective.

[Event] Weekly Alignment Research Coffee Time (06/21)

Hey, it seems like other could use the link, so I'm not sure what went wrong. If you have the same problem tomorrow, just send me a PM.

Knowledge is not just mutual information

Thanks again for a nice post in this sequence!

The previous post looked at measuring the resemblance between some region and its environment as a possible definition of knowledge and found that it was not able to account for the range of possible representations of knowledge.

I found myself going back to the previous post to clarify what you mean here. I feel like you could do a better job of summarizing the issue of the previous post (maybe by mentioning the computer example explicitly?).

Formally, the mutual information between two objects is the gap between the entropy of the two objects considered as a whole, and the sum of the entropy of the two objects considered separately. If knowing the configuration of one object tells us nothing about the configuration of the other object, then the entropy of the whole will be exactly equal to the sum of the entropy of the parts, meaning there is no gap, in which case the mutual information between the two objects is zero. To the extent that knowing the configuration of one object tells us something about the configuration of the other, the mutual information between them is greater than zero.

I need to get deeper into information theory, but that is probably the most intuitive explanation of mutual information I've seen. I delayed reading this post because I worried that my half-remembered information theory wasn't up to it, but you deal with that nicely.

At the microscopic level, each photon that strikes the surface of an object might change the physical configuration of that object by exciting an electron or knocking out a covalent bond. Over time, the photons bouncing off the object being sought and striking other objects will leave an imprint in every one of those objects that will have high mutual information with the position of the object being sought. So then does the physical case in which the computer is housed have as much "knowledge" about the position of the object being sought as the computer itself?

Interestingly, I expect this effect to disappear when the measurement defining our two variable get less precise. In a sense, the mutual information between the case and ship container depend on measuring very subtle differences, whereas the mutual information between the computer and the ship container is far more robust to loss of precision.

For example, a computer that is using an electron microscope to build up a circuit diagram of its own CPU ought to be considered an example of the accumulation of knowledge. However, the mutual information between the computer and itself is always equal to the entropy of the computer and is therefore constant over time, since any variable always has perfect mutual information with itself.

But wouldn't there be a part of the computer that accumulates knowledge about the whole computer?

This is also true of the mutual information between the region of interest and the whole system: since the whole system includes the region of interest, the mutual information between the two is always equal to the entropy of the region of interest, since every bit of information we learn about the region of interest gives us exactly one bit of information about the whole system also.

Maybe it's my lack of understanding of information theory speaking, but that sounds wrong. Surely there's a difference between cases where the region of interest determines the full environment, and when it is completely independent of the rest of the environment?

The accumulation of information within a region of interest seems to be a necessary but not sufficient condition for the accumulation of knowledge within that region. Measuring mutual information fails to account for the usefulness and accessibility that makes information into knowledge.

Despite my comments above, that sounds broadly correct. I'm not sure that the mutual information would capture your example of the textbook for example, even when it contains a lot of knowledge.

Search-in-Territory vs Search-in-Map

This is a very interesting distinction. Notably, I feel that you point better at a distinction between "search inside" and "search outside" which I waved at in my review of Abram's post. Compared with selection vs control, this split also has the advantage that there is no recursive calls of one to the other: a controller can do selection inside, but you can't do search-in-territory by doing search-in-map (if I understand you correctly).

That being said, I feel you haven't yet deconfused optimization completely because you don't give a less confused explanation of what "search" means. You point out that typically search-in-map looks more like "search/optimization algorithms" and search-in-territory looks more like "controllers", which is basically redirecting to selection vs control. Yet I think this is where a big part of the confusion lies, because both look like search while being notoriously hard to reconcile. And I don't think you can rely on let's say Alex Flint's definition of optimization, because you focus more on the internal algorithm than he does.

Key point: if we can use information to build a map before we have full information about the optimization/search task, that means we can build one map and use it for many different tasks. We can weigh all the rocks, put that info in a spreadsheet, then use the spreadsheet for many different problems: finding the rock closest in weight to a reference, finding the heaviest/lightest rock, picking out rocks which together weigh some specified amount, etc. The map is a capital investment.

One part you don't address here is the choice of what to put in the map. In your rock example, maybe the actual task will be about finding the most beautiful rock (for some formalized notion of beautiful) which is completely uncorrelated with weight. Or one of the many different questions that you can't answer if your map only contains the weights. So in a sense, search-in-map requires you to know the sort of info you'll need, and what you can safely throw away.

On the thermostat example, I actually have an interesting aside from Dennett. He writes that the thermostat is an intentional system, but that the difference with humans, or even with a super advanced thermostat, is that the standard thermostat has a very abstract goal. It basically have two states and try to be in one instead of the other, by doing its only action. One consequence is that you can plug the thermostat into another room, or to control the level of water in a tub or the speed of a car, and it will do so.

From this perspective, the thermostat is not so much doing search-in-territory than search-in-map with a very abstracted map that throw basically everything.

MDP models are determined by the agent architecture and the environmental dynamics

I'm wondering whether I properly communicated my point. Would you be so kind as to summarize my argument as best you understand it?

My current understanding is something like:

• There is not really a subjective modeling decision involved because given an interface (state space and action space), the dynamics of the system are a real world property we can look for concretely.
• Claims about the encoding/modeling can be resolved thanks to power-seeking, which predicts what optimal policies are more likely to do. So with enough optimal policies, we can check the claim (like the "5-googleplex" one).

There's no subjectivity? The interface is determined by the agent architecture we use, which is an empirical question.

I would say the choice of agent architecture is the subjective decision. That's the point at which we decide what states and actions are possible, which completely determines the MDP. Granted, this argument is probably stronger for POMDP (for which you have more degrees of freedom in observations), but I still see it for MDP.

If you don't think there is subjectivity involved, do you think that for whatever (non-formal) problem we might want to solve, there is only one way to encode it as a state space and action space? Or are you pointing out that with an architecture in mind, the state space and action space is fixed? I agree with the latter, but then it's a question of how the states of the actual systems are encoded in the state space of the agent, and that doesn't seem unique to me.

You don't have to run anything to check power-seeking. Once you know the agent encodings, the rest is determined and my theory makes predictions.

But to falsify the "5 googolplex", you do need to know what the optimal policies tend to do, right? Then you need to find optimal policies and know what they do (to check that they indeed don't power-seek by going left). This means run/simulate them, which might cause them to take over the world in the worst case scenarios.

MDP models are determined by the agent architecture and the environmental dynamics

Despite agreeing with your conclusion, I'm unconvinced by the reasons you propose. Sure, once the interface is chosen, then the MDP is pretty much constrained by the real-world (for a reasonable modeling process). But that just means the subjectivity comes from the choice of the interface!

To be more concrete, maybe the state space of Pacman could be red-ghost, starting-state and live-happily-ever-after (replacing the right part of the MDP). Then taking the right action wouldn't be power-seeking either.

What I think is happening here is that in reality, there is a tradeoff in modeling between simplicity/legibility/usability of the model (pushing for fewer states and fewer actions) and performance/competence/optimality (pushing for more states and actions to be able to capture more subtle cases). The fact that we want performance rules out my Pacman variant, and the fact that we want simplicity rules out ofer's example.

It's not clear to me that there is one true encoding that strikes a perfect balance, but I'm okay with the idea that there is an acceptable tradeoff and models around that point are mostly similar, in ways that probably doesn't change the power-seeking.

That's also a claim that we can, in theory, specify reward functions which distinguish between 5 googolplex variants of red-ghost-game-over. If that were true, then yes - optimal policies really would tend to "die" immediately, since they'd have so many choices.

The "5 googolplex" claim is both falsifiable and false. Given an agent architecture (specifically, the two encodings), optimal policy tendencies are not subjective. We may be uncertain about the agent's state- and action-encodings, but that doesn't mean we can imagine whatever we want.

Sure, but if you actually have to check the power-seeking to infer the structure of the MDP, it becomes unusable for not building power-seeking AGIs. Or put differently, the value of your formalization of power-seeking IMO is that we can start from the models of the world and think about which actions/agent would be power-seeking and for which rewards. If I actually have to run the optimal agents to find out about power-seeking actions, then that doesn't help.

Attainable Utility Preservation: Concepts

(Definitely a possibility that this is answered later in the sequence)

Rereading the post and thinking about this, I wonder if AUP-based AIs can still do anything (which is what I think Steve was pointing at). Or maybe phrased differently, whether it can still be competitive.

Sure, reading a textbook doesn't decrease the AU of most other goals, but applying the learned knowledge might. On your paperclip example, I expect that the AUP-based AI will make very few paper clips, or it could have a big impact (after all, we make paperclips in factories, but they change the AUP landscape)

More generally, AUP seems to forbid any kind of competence in a zero-sum-like situation. To go back to Steve's example, if the AI invents a great new solar cell, then it will make its owner richer and more powerful at the expense of other people, which is forbidden by AUP as I understand it.

Another way to phrase my objection is that at first glance, AUP seems to not only forbid gaining power for the AI, but also gaining power for the AI's user. Which sounds like a good thing, but might also create incentives to create and use non AUP-based AIs. Does that make any sense, or did I fail to understand some part of the sequence that explains this?

(An interesting consequence of this if I'm right is that AUP-based AIs might be quite competitive for making open-source things, which is pretty cool).