How I think about alignment

Linda Linsefors

This was written as part of the first Refine blog post day. Thanks for comments by Chin Ze Shen, Tamsin Leake, Paul Bricman, Adam Shimi.

Magic agentic fluid/force

Edit: "Magi agentic fluid" = "influence" (more or less). I forgot that this word existed, so I made up my own terminology. Oops! To my defence, before writing this post I did not have a word, just a concept, and a sort of visualisation to go with it in my head. My internal thoughts are majority non-verbal. So when trying to translate my thoughts to words, I landed on the term that best described how this concept looked in my mind.

Somewhere in my brain there is some sort of physical encoding of my values. This encoding could be spread out over the entire brain, it could be implicit somehow. I’m not making any claim of how values are implemented in a brain, just that the information is somehow in there.

Somewhere in the future a super intelligent AI is going to do some action.

If we solve alignment, then there will be some causal link between the values in my head (or some human head) and the action of that AI. In some way, whatever the AI does, it should do it because that is what we want.

This is not purely about information. Technically I have some non-zero causal influence over everything in my future lightcone, but most of this influence is too small to matter. More relevant, but still not the thing we want is deceptive AI. In this case the AI’s action is guided by our values in a non-neglectable way, but not in the way we want.

I have a placeholder concept which I call magic agentic fluid or magic agentic force. (I’m using “magic” in the traditional rationalist way of tagging that I don’t yet have a good model of how this works.)

MAF is a high-level abstraction. I expect it to not be there when you zoom in too much. Same as how solid objects just exist at a macro scale, and if you zoom in too much there are just atoms. There is no essence, only shared properties. I think the same way about agents and agency.

MAF is also in some way like energy. In physics energy is a clearly defined concept, but you can’t have just energy. There is no energy without a medium. Same as how you can’t have speed without there being anything that is moving. This is not a perfect analogy.

But I think that this concept does point to something real and/or useful, and that it would be valuable to try to get a better grasp on what it is, and what it is made of.

Let’s say I want to eat an apple, and later I am eating an apple. There is some causal chain that we can follow from me wanting the apple to me eating the apple. There is a chain of events propagating through spacetime, and it carries with it my will and it enacts it. This chain is the MAF.

I would very much like to understand the full chain of how my values are causing my actions. If you have any relevant information, please let me know.

Trial and error as an incomplete example of MAF

I want an apple -> I try to get an apple and if it doesn't work, I try something else until I have an apple -> I have an apple

This is an example of MAF, because wanting an apple caused me to have an apple through some causal chain. But also notice that there are steps missing. How was “I want an apple” transformed into “I try to get an apple and if it doesn't work, I try something else until I have an apple”? Why not the alternative plan “I cry until my parents figure out what I want and provide me with an apple”. Also, the second step involves generating actual things to try. When humans execute trial and error, we don’t just do random things, we do something smarter. This is not just about efficiency. If I try to get an apple by trying random actions sequences, with no learning other than “that exact sequence did not work”, then I’ll never get an apple, because I’ll die first.

I could generate a more complete example, involving neural nets or something, and that would be useful. But I’ll stop here for now.

Mapping the territory along the path

When we have solved Alignment, there will be a causal chain, carrying MAF from somewhere inside my brain, all the way to the actions of the AI. The MAF has to survive passage through several distinct regions with minimal (preferably zero) information loss or other corruption.

The first territory to be crossed is travelling from my brain to my actions, where actions are anything that is externally observable.
The next part is the human AI interaction.
The third part is the internal mind of the AI
The last part is how the AI’s actions interact with the world (including me).

Obviously all these paths have lots of back and through feedback, both internally and with the neighbouring territories. However, for now I will not worry too much about that. I’m not yet ready to plan the journey of the MAF. I am mainly focusing on trying to understand and map out all the relevant territory.

My current focus

I’m currently prioritising understanding the brain (the first part of the MAF’s journey), for two reasons:

It’s the part I feel the most confused about.
It is tightly tied up with understanding what human values even are, which is a separate question that I think we need to solve.

The MAF needs to carry information containing my values. It seems like it would be easier to plan the journey if we know things like “What is the type signature of human values?”

By getting a better map of the part of the brain where values are encoded (possibly the entire brain), we’ll both get a better understanding of what human values are, and how to get that information out in a non-corrupted way.

In addition, learning more about how brains work will probably give us useful ideas for Aligned AI design.

Appendix: Is there a possible alternative path around (not through) the brain?

Maybe we don’t need a causal link starting from the representation of my values in my brain. Maybe it is easier to reconstruct my values by looking at the casual inputs that formed my values, e.g. genetic information and life history. Or you can go further back and study how evolution shaped my current values. Maybe if human values are a very natural consequence of evolutionary pressures and game theory, it would be easier to get the information this way, but I’m not very optimistic about it. I expect there to be too much happenstance encoded in my values. I think an understanding of evolution and game theory and learning about my umwelt could do a lot of work in creating better priors, but it is not a substitute for learning about my values directly from me.

One way to view this is that even if you can identify all the incentives that shaped me (both evolutionary and within my lifetime), it would be wrong to assume that my values will be identical to these incentives. See e.g. Reward is not the optimization target. Maybe it is possible to pinpoint my value from my history (evolutionary and personal), but it would be far from straightforward, and personally I don’t think you can get all the information that way. Some of the information relevant to my value formation (e.g. exactly how my brain grew, or some childhood event) will be lost in time, except for the consequences on my brain.

Learning about my history (evolutionary and personal) can be useful for inferring my values, but I don’t think it is enough.

Appendix: Aligned AI as a MAF amplifier

An aligned AI is a magic agentic force amplifier. Importantly, it’s supposed to amplify the force without corrupting the content.

One way to accomplish this is for the AI to learn my values in detail, and then act on those values. I visualise this as I’m having a fountain of MAF at the centre of my soul (I don’t actually believe in souls, it’s just a visualisation). The AI learns my values, through dialogue or something, and creates a copy of my fountain inside itself. Now technically the AI’s actions flow from the copy of my values, but that’s ok if the copying process is precise enough.

Another way to build an agentic force amplifier is to build an AI that is more directly reacting to my actions. My central example for this is a Servomotor, which literally amplifies the literal force applied by my arms. I’ve been thinking about how to generalise this, but I don’t know how to scale it to super intelligence (or even human level intelligence). With the servomotor there is a tight feedback loop where I can notice the outcome of my amplified force on the steering wheel which allows me to force correct. Is there a way to recreate something that plays the same role for more complicated actions? Or is this type of amplification doomed as soon as there is a significant time delay between my action and the outcome of the amplification?

You can't zoom infinitely far in on the causal chain between values and actions, because values (and to a large extent actions) are abstractions that we use when modeling agents like ourselves. They are emergent. To talk about my values at all is to use a model of me where I use my models in a certain agenty way and you don't sweat the details too hard.

Somewhere in my brain there is some sort of physical encoding of my values.

Not sure if this is an intended meaning, but the claim that values don't depend on content of the world outside the brain is generally popular (especially in decision theory), and there seems to be no basis for it. Brains are certainly some sort of pointers to value, but a lot (or at least certainly some) of the content of values could be somewhere else, most likely in civilization's culture.

This is an important distinction for corrigibility, because this claim is certainly false for a corrigible agent, it instead wants to find content of its values in environment, it's not part of its current definition/computation. It also doesn't make sense to talk about this agent pursuing its goals in a diverse set of environments, unless we expect the goals to vary with environment.

For decision theory of such agents, this could be a crucial point. For example, an updateless corrigible agent wouldn't be able to know the goals that it must choose a policy in pursuit of. The mapping from observations to actions that UDT would pick now couldn't be chosen as the most valuable mapping, because value/goal itself depends on observations, and even after some observations it's not pinned down precisely. So if this point is taken into account, we need a different decision theory, even if it's not trying to do anything fancy with corrigibility or mild optimization, but merely acknowledges that goal content could be located in the environment!

I mean that the information of what I value exists in my brain. Some of this information is pointers to things in the real world. So in a sense the information partly exist in the relation/correlation between me and the world.

I defiantly don't mean that I can only care about my internal brain state. To me that is just obviously wrong. Although I have met people who disagree, so I see where the misunderstanding came from.

That's not what I'm talking about. I'm not talking about what known goals are saying, or what they are speaking of, what they consider valuable or important. I'm talking about where the data to learn what they are is located, as we start out not knowing the goals at all and need to learn them. There is a particular thing, say a utility function, that is the intended formulation of goals. It could be the case that this intended utility function could be found somewhere in the brain. That doesn't mean that it's a utility function that cares about brains, the questions of where it's found and what it cares about are unrelated.

Or it could be the case that it's recorded on an external hard drive, and the brain only contains the name of the drive (this name is a "pointer to value"). It's simply not the case that you can recover this utility function without actually looking at the drive, and only looking at the brain. So utility function u itself depends on environment E, that is there is some method of formulating utility functions t such that u=t(E). This is not the same as saying that utility of environment depends on environment, giving the utility value u(E)=t(E)(E) (there's no typo here). But if it's actually in the brain, and says that hard drives are extremely valuable, then you do get to know what it is without looking at the hard drives, and learn that it values hard drives.

I expect there to be too much happenstance encoded in my values.

I believe this is a bug, not a feature that we would like to reproduce.

I think that the direction you described with the AI analysing how you acquired your values is important, because it shouldn't be mimicking just your current values. It should be able to adapt the values to new situations the way you'd do (distributional shift). Think all the books / movies where people get to unusual situations and have to make tough moral calls. Like plane crashing in the middle of nowhere with 20 survivors who are gradually running out of food.. Superhuman AI will be running into unknown situations all the time because of different capabilities.

Human values are undefined for most situations a superhuman AI will encounter.

Some observations:

Genes reproduce themselves.
Humans reproduce themselves.
Symbols are relearned.
Values are reproduced.

Each needs an environment to do so, but the key observation seems to be that a structure is reliably reproduced across intermediate forms (mitosis, babies, language, society) and build on top of each other. It seems plausible that there is a class of formal representations that describe

the parts that are retained across instances and
the embedding into each other (values into genes and symbols), and
the dynamics of the transfer.

If something is good at replicating, then there will be more of that thing, this creates a selection effect for things that are good at replicating. The effects of this can be observed in biology and memetics.

Maybe self replication can be seen as an agentic system with the goal of self replicating? In this particular question all uncertainty comes from "agent" being a fuzzy concept, and not from any uncertainty about the world. So answering this question will be a choice of perspective, not information about the world.

Either way, the type of agency I'm mainly interested in is the type of agency that have other goals than just self replication. Although maybe there are things to be learned from the special case of having self replication as a goal?

If the AI learns my values then this is a replication of my values. But there are also examples of magic agentic force where my values are not copied at any point along the way.

Looking at how society is transferred between generations, might have some clues to value learning? But I'm less optimistic about looking at what is similar between self replication in general, because I think I already know this, and also, it seems to be one abstraction level to high, i.e. the similarity are properties above the mechanistic details, and those details is what I want.