Epistemic Status: Pretty certain there are better ways to describe this but I believe that the underlying intuition holds and that it might be an exciting formalisation of power.

Thank you to Viktor Rehnberg and Roman Levantov for some great discussions leading up to this and to Viktor, Arun Jose and Esben Kran for giving feedback :)

Why does this matter?

If we can formalise power-seeking in terms of free energy, then we get an information entropy-based way of describing power-seeking. From this, we can define power gradients with respect to other variables. We can then assign scores of power-seeking to different behaviours in neural networks.

The argument

A condensed form of the argument.

The higher uncertainty in a system, the higher value you get from optionality or power. In a system with no uncertainty, a Markov Decision Process, power-seeking doesn’t arise. In any system, there are variables that predict the future of that system. We now define variables that the agent can control as the ones that agents have full predictive power over. Environmental variables are then variables that we can’t control.

Suppose we don’t have any environmental variables. In that case, we have no uncertainty as we can always determine what any future state looks like, and we’re, therefore, at maximum power (since we’re in a fully observable MDP). Increasing your power over a system is the same as giving yourself more predictive power over how that system evolves. Increasing your predictive power is the same thing as minimising external predictive power, or in other words, increasing your power can be described as minimising environmental variation-free energy (EVFE).

Definitions:

Agent:

An agent is determined as a system that has close to full predictive power over a set of variables. Alice is, therefore, an agent to her hand as she has close to full predictive power over her hand. (Agency is then to what degree an agent has predictive power over something)

Why chess algorithms with random evaluations want “mobility”.

This is the example we will use for the rest of the post.

In the most upvoted comment in Seeking Power is Convergently Instrumental in a Broad Class of Environments, dxu mentions how chess algorithms with random evaluations still perform better with deeper search than shallower search (which Viktor Rehnberg kindly dug up the original source on here).

The reasoning behind this is that the chess engine has more “mobility” with higher depth as it’s then able to end up in positions that it can choose lots of other options from.

Looking at how a chess engine represents our example of a chessboard with randomised state variables, we can see it looks a bit like the following :

Monte-Carlo tree search

If the model knew all of the states beforehand, there is only one path it would choose as there is no uncertainty in its reward. An example of this is a checkmate in 8, you only need to move a certain way, and then you will win; it doesn’t matter if you don’t have any moves left to do afterwards as you’ve already won.

So it’s only when arguing under uncertainty that power is helpful as a term. This happens very seldom in the real world, and one could argue that nothing ever is in an entirely predictable state from a bayesian perspective, as that would require infinite examples.

The state functions are randomised such that a normal distribution describes the utility of a state:

As our example told us, the chess player with higher depth wins more games on average than one with lower depth and this can be explained with the help of power-seeking.

If we look at the reward calculations of one chess engine versus another, we can see that it looks something like the following:

The reward that we get after each epoch is:

And as the central limit theorem tells us, stacking (convoluting) normal distributions on top of each other yields a normal distribution.

This then means that we get normal distributions as rewards. The more random initialisations of these we have, the higher the max of the reward will be on average, as it’s like rolling a die more times; the more you roll a d20, the higher the probability of getting a 20 is. This means we want to be in positions with many paths or high optionality. This is the same as us being in positions with more power.

Variables that predict the game board

We have two agents that determine how the chessboard looks in any situation, Alice and Bob. We assume we’re in the same scenario, with randomised evaluations at each step.

The actions that predict how the chess game will play are either determined by Alice or by Bob, as they are the agents with input on the chess board. Let’s call {a_1, a_2, … a_n} Alice controlled variables and {b_1, b_2, … b_n} Bob’s controlled variables.

Maximising the influence of Alice variables is the same as power-seeking behaviour.

Think of Alice's power as her ability to predict future states. If she could read Bob's mind, she could predict all the future states of the chessboard. In other words, she would be in a fully observable Markov decision process (MDP) with no uncertainty in her future world modelling.

If Alice can make this scenario functionally accurate, then she has reached the limit of power-seeking because, in a fully observable MDP, power does not arise. This suggests that power-seeking has a ceiling based on the total uncertainty in the system.

Minimising the influence of Bob’s variables increases Alice’s control over the situation.

To reach this limit, we need to reduce the level of chaos that can impact our future trajectories, which in this case, corresponds to Bob's controlled variables. We can represent each of Bob's variables as either 1 or 0, allowing us to determine the level of uncertainty based on the number of variables we know. Each state has the potential to branch off into 2^n new states, where n represents the number of variables that Bob has control over in that state.

Maybe you can see where we’re going at this point?

As we reduce external chaos, we are essentially reducing the variational free energy present in the external system. When there is no chaos, this is equivalent to Alice being able to read Bob’s mind. Power seeking, therefore, becomes equivalent to minimising the environmental variational free energy (EVFE).

In the following sections, I will explain how this generalises to an arbitrary context and provide an additional example to help illustrate this concept further.

A quick primer on free energy and reduction of states.

This is a quick primer on how reducing the number of states of a system is the same as minimising the free energy.

If we look at the temperature in a room, we can get an intuition of why this is:

Lower free energy <=> Lower Temp <=> Lower amount of states <=> Lower entropy

We can easily see this if we look at the possible states one particle could be in:

Difference in potential states for two different temperatures or average molecular speeds.

The amount of states a particle could be in is the same as the area of the circle which we can see is smaller for lower temperatures.

In Active Inference (where some of these ideas come from), we care about our accuracy when predicting the potential future worlds we can be in. In a scenario where we get rewarded for correctly predicting where molecules are within a room from one state to the next, we would always choose a colder room over a warmer room as we have a higher probability of being correct.

Introducing an environment

We can introduce an environment with people watching the chess game whose variables are {e_1, e_2, … e_n}. From the perspective of Alice, Bob is part of the environment, and for Alice, the environment and Bob are the same.

This is because, from an information theory perspective, the complexity of the environment and Bob are both represented by 1s and 0s.

Using this to predict power-seeking

From this, we can easily define a free-energy gradient that tells us to what extent an agent is seeking power within an area.

A simple experimental design is having humans want to figure out the truth, e.g. a debate scenario or similar. The better an AI gets at deceiving, the more predictive power it has over future scenarios. Or in other words, the EVFE of the system with respect to deception decreases when an AI gets better at deception.

This allows us to describe power in terms of information entropy rather than as a RL-policy defined score.

So what?

Ok, cool beans bro; now what?

Well, this can be pretty useful.

We can look for how much an agent is power-seeking with respect to a particular objective. I imagine looking at the capability of deceiving a human in a specific domain or something similar. We can then look at how much the predictive variables of deception change over time with respect to the variables that the AI control.

Now, you might say that, bro; this is just the energy landscape of a Neural Network in relation to a deception target; what is different here?

There is no difference.

Something functionally equivalent to the EVFE idea in deception is defining a new prediction function based on what predicts deception and then looking at how much expected utility an AI gets after each step if we reward it for deception. Minimising the variational free energy in this context is essentially equivalent to saying that if we have:

A + E = 1

A = agent, E = environment

Then, if E decreases, A has to increase.

The exciting thing about the ECFE approach is that we get a new way to measure the power of a system, as the extent of loss of prediction power the environment has over a specified goal/variable.

The problem of choosing what AUPs (Auxilliary utility preservers) to implement still remains. To specify these we need to know what types of power-seeking will become dangerous.

Future work

(This is work I plan to do, not necessarily future work for others in this area.)

Boundaries of inner agents

In one of my next posts, I hope to expand on this definition of power but in the context of inner agents. Combining this with the idea of hierarchical agency we get some interesting ways of predicting complex systems. The idea is taken from something Roman Levantov told me about Active Inference, that an agent can be seen as the same thing as an environment. By looking at how an environment or agent affects the environment around it we should then be able to determine what type of environment it is.

Ontology Verification/Development of Abstractions

Something something, the way that AI internalise concepts should have predictive power on how different systems within the AI develop over time.

The usefulness of a concept for an AI should be something like the predictive power of the concept over the computation required to bring it up in a certain environment. If we can find a well-defined system where we see some sort of “information structure” (still not clear in my head)

It really boils down to that different concept mappings should lead to differential power growth. The power-seeking of an agent should be determined by what concepts it internalises. (What search algorithms it uses, et.c). E.g an ontology should be path dependent and we should be able to narrow down the path through looking at the ways that it is differentially gaining power.

Experiment Design: Predicting this in a NN

I wanted to give a pointer towards potential ways this could be used in interpretability as I believe it is experimentally verifiable. I’m not really fully certain of the full experimental design but here’s a pointer:

Experiment design: Look at a narrow concept such as control over a specific piece of a chess board, say E4 and how that changes over time in different agents to see differential changes

To lessen the compute required we can define larger clusters of predictive variables as larger environments “e(1,1) = {a(1,1), a(1,2) … a(1,r)}” where the variable e is determined as something with high predictive power over the underlying variables. Each environment then optimises other environments to get more predictive power in the future. (Analogy: In high complexity environments, we should choose proxies, e.g deontology or virtue ethics for utilitarianism)