Counterfactual control incentives

by Stuart Armstrong9 min read21st Jan 20218 comments

11

AI
Frontpage

Co-authored with Rebecca Gorman.

In section 5.2 of their Arxiv paper, "The Incentives that Shape Behaviour", which introduces structural causal influence models and a proposal for addressing misaligned AI incentives, the authors present the following graph:

The blue node is a "decision node", defined as where the AI chooses its action. The yellow node is a "utility node", defined as the target of the AI's utility-maximising goal. The authors introduce this graph to introduce the concept of control incentives; the AI, given utility-maximizing goal of user clicks, discovers an intermediate control incentive: influencing user options. By influencing user opinions, the AI better fulfils its objective. This 'control incentive' is graphically represented by surrounding it in dotted orange.

A click-maximising AI would only care about user opinions indirectly: they are a means to an end. A amoral social media company might agree with the AI on this, and be ok with it modifying user opinions to achieve higher clicks/engagement. But the users themselves would object highly to this; they do not want the algorithm to have a control incentive for changing their opinions.

Carey et al designed an algorithm to remove this control incentive. They do this by instructing the algorithm to choose its posts, not on predictions of the user's actual clicks - which produce the undesired control incentive - but on predictions of what the user would have clicked on, if their opinions hadn't been changed.

In this graph, there is no longer any control incentive for the AI on the "Influenced user opinions", because that node no longer connects to the utility node.

Call this construction a "counterfactual incentive". This substitutes the (model of) the users' original opinion where the influenced user opinions used to go. A more detailed description of such an algorithm can be found in this footnote[1]. In words, the process can be phrased:

  • The counterfactual incentive is what the utility would have been, if the influenced opinions were replaced with the original opinions (everything else being equal).

It seems to neutralise a vicious, ongoing cycle of opinion change in order to maximize clicks. But, as this post shows, there are issues with this construction and, if used naively, counterfactual modifications will fail.

Matching players and buying items

In order to analyse the issues with the setup, let's choose a more narrowly defined example. There are many algorithms that aim to manipulate payers of mobile games in order to get them to buy more expensive in-game items.

One such manipulative algorithm[2] is to match up a new player against a more skilled player who is wielding an expensive item. The new player looses, and becomes convinced that the expensive item is key to success. And the new player therefore buys it.

Using the notation from Carey et al, this can be encoded in the following graph[3]:

Here the AI's decision node is the player to match the user with. Its objective is to get the player to pay more money ($). The node is the user's original estimation of the relative ranking of various in-game items; the node is the same user's estimation after their first match.

The AI discovers a control incentive on : by choosing the user's opponent, it will modify the user's ranking in order to cause the player to buy more expensive items (see next section for an example of this in practice).

Suppose we wanted to remove this "dark pattern". First, we'd need some way of the AI measuring or estimating the value of . Then we could apply the counterfactual incentive construction and get the following causal graph:

Since it now targets the counterfactual predicted purchases $, it seems that the AI will no longer be incentivised to manipulate the player to change their item rankings in . In words, this is:

  • The counterfactual incentive is what $ would have been, if the was replaced with the original (everything else being equal).

The implicit variables are important

That parenthetical aside - "everything else being equal" - is key. In partial derivatives, expressions like only have meaning if we know and all the other variables. Similarly, we don't know what "replace with the original " means, unless we know what isn't changed.

To demonstrate this, let's take the arrow connecting , the AI's decision node, with $, the utility node. That encodes a lot of information, but does so implicitly. Let's instead make it more explicit.

Let's pick one example. Now, one way could affect $, is if the user develops better knowledge of the game through their match with the AI-selected opponent. We could call this knowledge ; adding that explicitly to the graph gives:

In this graph, only and (respectively ) can affect the utility nodes $ (respectively $). Then, in words:

  • The counterfactual incentive is what $ would have been, if the was replaced with the original (with being unchanged).

This being unchanged (the "everything else being equal" of the original formulation) will be a crucial point of analysis.

All goes well: independent nodes

Assume there are three items that can be used in this game. They are the wooden spoon, the longsword, and the Flaming Poisoning Raging Sword Of Doom.

The wooden spoon is inferior, while the other two swords are exactly equivalent in gameplay; the user starts out knowing this fact (this is the value of ).

The Flaming Poisoning Raging Sword Of Doom, is, however, extremely flashy, with lots of sound effects, lighting and lightning, sparkles and decorative motifs; it is also a lot more expensive. If the user is matched against a superior opponent with the FPRSoD, they will conclude that their defeat was due to not having the FPRSoD, and will be willing to pay the higher price for it than for the longsword.

Assume that the AI has two possible policies: , where it will match the user up with a superior player with a longsword, and , where it will match the user up with another superior player, of equivalent skill to the first, but with the FPRSoD.

In both cases, will be updated to reflect 'a sword is useful'. Assume the will leave unchanged the user's rankings (so will take the value , the same as ), that the longsword and the FPRSoD are equivalent. So, in that case, the user will buy the longsword, since it is cheaper.

In contrast, will cause the user to overvalue the FPRSoD (thus changing to ), and then buy it. Graphically, we get this (with the obvious convention that the left/right values of a node correspond to the left/right value of the decision node):

Without the counterfactual incentive, is the superior policy for the AI, since the user, with its new , will then buy the more expensive FPRSoD.

With the counterfactual incentive, the algorithm has been instructed to base its predictions off the original rankings of the players, so is set to (the value of ). Therefore the AI will model the user as valuing the longsword and the FPRSoD equally, for both and . Since the longsword is cheaper, it will expect that the user will buy it. Thus the AI has no incentive to manipulate the user by choosing over[4] .

Problems appear: mutual information

In the previous example, the value of was set by the choice of or , while the value of would be the same in both cases. This makes and independent of each other (in a rather trivial sense).

But we'll show that problems appear when there is a dependence between the two nodes. To demonstrate that, model the situation as before, but the three options are the wooden spoon, the longsword, and the Flaming Poisoning Raging Mace Of Doom.

Again, the user starts out with the (correct) impression that the longsword and the FPRMoD are equivalent, and the AI can choose between and , setting the user against a superior opponent armed either with a longsword or the FPRMoD. And again, will change towards the FPRMoD.

A key difference is that, though the longsword and the FPRMoD are ultimately equivalent in their usefulness in winning the game, they are used differently. If the user sees a superior player use one of the item, they will start to understand how that item should be used.

So, under , the user will start developing skills in using a mace rather than a sword. Their gameplay knowledge, , will update to . If instead they see the longsword used, then this will update to value . Graphically, it looks like this:

In the standard case, there are two worlds, and . These lead to $ "longsword price" and $ "FPRMoD price", respectively.

If we use counterfactual incentives, the first world stays the same (since already), leading to $ "longsword price". The second world changes to leading to... what exactly?

Contradictory counterfactual worlds

As we've laid the situation out, the AI would have no way to learn what happens in world , because that world never happens. It can encounter multiple users, try and , get back the values of , , and $, and never know that leads to, because it never sees those values.

Adding the meaning to the symbols, means setting up the user with an opponent wielding the FPRMoD; this always changes both (their knowledge of how to play the game) and (their relative ranking of the weapons). But somehow the AI has to model a world where only was changed. The problem is that the node and the node have mutual information: they both encode, at least in part, the user's judgements about swords and maces, and it's not trivial to change the value of one while preserving the other.

There are multiple ways the AI can model the dependence of $ on the other variables, all of them consistent with the data it's seen. One way is to predict that leads to $ "longsword price", while leads to $ "FPRMoD price". In that case, counterfactual incentives will work as intended: the values will lead to $ "longsword price".

However, there's another equally consistent way of seeing the data: predicting that leads to $ "longsword price", while leads to $ "FPRMoD price". In that case, the counterfactual incentives will lead to the AI predicting that the values will lead to $ "FPRMoD price". In this setting, the counterfactual incentives are useless: the AI will behave as if they don't exist (since it doesn't use in its predictions) and push the mace on the user.

In the second case, the value of is used as a proxy for the value of . And that proxy is accurate in all actual worlds the AI sees. So it extends this to the counterfactual world, where we don't want it to do so. In fact, generically, mutual information between and will tend to undermine counterfactual incentives; see the argument in this footnote[5].

Low-probability worlds

In practice, if the AI has enough depth of data and decent function approximators, it will be able to estimate the outcome of the world for the variables . But function approximators will suffer from the problem mentioned in this footnote[5:1]: they will tend to favour the decision actions that were maximal before the counterfactual incentives were introduced.

As for estimating from the data, there is another problem: these values represent a very extreme and unlikely world, with potentially very odd behaviours. For example, maybe the user is an artist who doesn't even notice the relative power of the weapons (so stays at ), but will choose to buy the FPRMoD on aesthetic grounds. In that case, the AI will design its approach to all users, based only on the behaviour this artist.

Hence the behaviour of the AI starts to become unpredictable itself. This may result in behaviour that is surprisingly sensible[6] or completely crazy[7].

So, in more realistic and practical cases where and have mutual information, we generically expect that either the counterfactual incentives will not remove the manipulation that we are trying to remove, and/or that the AI's behaviour will become unpredictable.

So cannot understand how the AI behaves unless we grok the difference between and : what really happens when we insert instead of , while keeping constant? Both and are crucial to this definition.

The general case: changed and unchanged variables

In the general situation, we need to understand the difference between

and

This is a problem, as the variables not to take the counterfactual over are often implicit rather than explicit.

Value indifference and causal indifference

Another way to address this problem is to create a workable counterfactual system without listing all the variables not included in the counterfactual. As an example, my original value indifference post used a counterfactual over a random future event - the counterfactual was that this event would take a specific predefined value. Since this counterfactual is in the future and is random, it is independent of all AI decisions in the present. It has no mutual information with anything at the present time for the AI[8].


  1. Let's simplify the setup as follows; the first graph is the standard setup, the second is its counterfactual counterpart:

    The AI acts through the decision node . As before, is the utility node. In the standard setup, the AI receives data on the values of , , and (and knows its own actions). Its learning process consists of learning probabilities of the various nodes. So, for any values , , and of the four nodes, it will attempt to learn the following probabilities:

    Then, given that information, it will attempt to maximise .

    In the counterfactual setup, the AI substitutes for . What that means is that it computes the probabilities as above, from the , , and information. But it attempts to maximise , the counterfactual utility. The probable values of are defined by the following equality:

    Note that the term can be estimated empirically, so the AI can learn the probability distribution on from empirical information. ↩︎

  2. Patent ID US2016005270A1. ↩︎

  3. Note that we've slightly simplified the construction by collapsing "Original item rankings" and "Model of original item rankings" into the same node, . ↩︎

  4. One problem with these counterfactual incentive approaches is that they often allow bad policies to be chosen, just remove part of the incentive towards them. ↩︎

  5. For the moment, assume the AI doesn't get any information at all. Then suppose that is a "manipulative" action that increases $ via . Then if is an outcome that derives from , then AI will note a correlation between and high $. This argument extends to distributions over values of : values of that are typical for are also typical for high $.

    Now let's put the information back, and add the counterfactual . It's certainly possible to design setups where this completely undoes the correlation between and high $. But, generically, there's no reason to expect that it will undo the correlation (though it may weaken it). So, in the counterfactual incentives, there will generically continue to be a correlation between "manipulative" actions and high $ . ↩︎ ↩︎

  6. See the post "JFK was not assassinated". ↩︎

  7. See the third and fourth failures in this post. ↩︎

  8. The counterfactuals defined in the non-manipulated learning paper are less clear. The counterfactual was over the AI's policy - "what would have happened, had you chosen another policy". It is not clear whether this is truly independent of the other variable/nodes the AI is considering (though some of MIRI's decision theory research may help with this). ↩︎

AI2
Frontpage

11

8 comments, sorted by Highlighting new comments since Today at 9:59 AM
New Comment

Thanks Stuart and Rebecca for a great critique of one of our favorite CID concepts! :)

We agree that lack of control incentive on X does not mean that X is safe from influence from the agent, as it may be that the agent influences X as a side effect of achieving its true objective. As you point out, this is especially true when X and a utility node probabilistically dependent. 

What control incentives do capture are the instrumental goals of the agent. Controlling X can be a subgoal for achieving utility if and only if the CID admits a control incentive on X. For this reason, we have decided to slightly update the terminology: in the latest version of our paper (accepted to AAAI, just released on arXiv) we prefer the term instrumental control incentive (ICI), to emphasize that the distinction to "control as a side effect.

Cheers; Rebecca likes the "instrumental control incentive" terminology; she claims it's more in line with control theory terminology.

We agree that lack of control incentive on X does not mean that X is safe from influence from the agent, as it may be that the agent influences X as a side effect of achieving its true objective. As you point out, this is especially true when X and a utility node probabilistically dependent.

I think it's more dangerous than that. When there is mutual information, the agent can learn to behave as if it was specifically manipulating X; the counterfactual approach doesn't seem to do what it intended.

On recent terminology innovation:

we have decided to slightly update the terminology: in the latest version of our paper (accepted to AAAI, just released on arXiv) we prefer the term instrumental control incentive (ICI), to emphasize that the distinction to "control as a side effect".

For exactly the same reason, In my own recent paper Counterfactual Planning, I introduced the terms direct incentive and indirect incentive, where I frame the removal of a path to value in a planning world diagram as an action that will eliminate a direct incentive, but that may leave other indirect incentives (via other paths to value) intact. In section 6 of the paper and in this post of the sequence I develop and apply this terminology in the case of an agent emergency stop button.

In high-level descriptions of what the technique of creating indifference via path removal (or balancing terms) does, I have settled on using the terminology suppresses the incentive instead of removes the incentive.

I must admit that I have not read many control theory papers, so any insights from Rebecca about standard terminology from control theory would be welcome.

Do they have some standard phrasing where they can say things like 'no value to control' while subtly reminding the reader that 'this does not imply there will be no side effects?'

This post gives two distinct (but related) "pieces of knowledge".

  • A counterexample to the "counterfactual incentive algorithm" described in section 5.2 of The Incentives that Shape Behaviour. Moreover, this failure seems to generalize to any causal diagram where all paths from the decision node to the utility node contain a control incentive, and where the controlled variables have mutual information that forbid applying the counterfactual only to some.
  • A concrete failure mode for the task of ensuring that a causal diagram fits a concrete situation: arrows without node might implicitly hide variables on which there are control incentives, in which case their mutual information with the other variables with control incentives is crucial to removing the control incentives.

Notably, this post doesn't seem to question the graphical criterion given in Theorem 7 of The Incentives that Shape Behaviour for control incentives.

What I'm really curious about is whether we can generally find paths without node from the decision node to the utility node. If that's the case, then the counterfactual incentive algorithm probably still works in most cases. This is because I think that the counterexample given here dissolves if there is an additional path without node from the matchmaking policy to the priced payed -- then we can take the counterfactual of R and K together, in a way that is probably consistent.

Whether such paths exists is a question about the task of judging a causal diagram against a concrete situation. I believe that this post provided a very valuable failure mode for exploring this question in more details, and I hope further work will build on it.

This is because I think that the counterexample given here dissolves if there is an additional path without node from the matchmaking policy to the priced payed

I think you are using some mental model where 'paths with nodes' vs. 'paths without nodes' produces a real-world difference in outcomes. This is the wrong model to use when analysing CIDs. A path in a diagram -->[node]--> can always be replaced by a single arrow --> to produce a model that makes equivalent predictions, and the opposite operation is also possible.

So the number of nodes on a path better read as a choice about levels of abstraction in the model, not as something that tells us anything about the real world. The comment I just posted with the alternative development of the game model may be useful for you here, it offers a more specific illustration of adding nodes.

In this comment (last in my series of planned comments on this post) I'll discuss the detailed player-to-match-with example developed in the post:

In order to analyse the issues with the setup, let's choose a more narrowly defined example. There are many algorithms that aim to manipulate payers of mobile games in order to get them to buy more expensive in-game items.

I have by now re-read this analysis with the example several times. First time I read it, I already felt that it was a strange way to analyse the problem, but it took me a while to figure out exactly why.

Best I can tell right now is that there are two factors

  1. I can't figure out if the bad thing that the example tries to prove is that a) agent is trying to maximize purchases, which is unwanted or b) the agent is manipulating user's item ranking, which is unwanted. (If it is only a), then there is no need to bring in all this discussion about correlation.)

  2. the example refines its initial CID by redrawing it in a strange way

So now I am going to develop the same game example in a style that I find less strange. I also claim that this gets closer to the default style people use when they want to analyse and manage causal incentives.

To start with, this is the original model of the game mechanics: the model of the mechanics in the real world in which the game takes place.

This shows that the agent has an incentive to control predicted purchases upwards, but also to do so by influencing the item rankings that exist in the mind of the player.

If we want to weaken this incentive to influence the item rankings that exist in the mind of the player, we can construct a counterfactual planning world for the agent (see here for an explanation of the planning world terminology I am using):

(Carey et all call often call this planning world a twin model, a model which combines both factual and counterfactual events.) In both my work and in Carey et intention, the is that the above diagram defines the world model in which the agent will plan the purchases-maximizing action, and then this same action is applied in the real world model above.

Now, the important things to note are:

  • this counterfactual construction does not eliminate the incentive of the agent to maximize purchases, as we still have the red arrow in there

  • this counterfactual construction does not eliminate the ability of the agent to influence item rankings, as we still have the orange arrow in there

  • but as the orange halo around the influenced item rankings is gone, the agent has lost its instrumental control incentive on item rankings. (The meaning of the orange halo and the terminology of instrumental control incentives are defined in Agent Incentives: A Causal Perspective.)

Now, say that we want to drill down further on these models, to a more detailed level of modeling. We might do so if we may want to examine further how the orange arrow above will act in practice.

We could add more detail by adding a node 'gameplay knowledge' which is correlated with item rankings. In the real-world model, this would be depicted as follows:

I added the blue arrow above to make the correlation between influenced gameplay knowledge and influenced rankings explicit as a line of causal influence. An equivalent blue arrow is not present in the drawings in the post above: technically speaking the drawing in the post is compatible with the assumption that there may be a correlation between the two, but it does not spell out the presence of this correlation, which would be unusual when doing this type of analysis.

The drawing in the post is also unusual in that it omits the red arrow I left in above. If I were to remove the red arrow, this would amount to a claim that in the real world situation, item ranking and gameplay knowledge are the only two channels by which the agent's decision may influence purchases. Clearly this is not the case: if the agent were to match the player with an opponent who is sure to win the fight and steal all of the player's possessions afterward, this would have an effect on predicted purchases, because the player would have to buy some replacements for all the stuff they lost.

I can now again construct a counterfactual planning world to go along with the refined real world model:

If we use the above planning world we have removed the instrumental control incentive on influenced rankings again.

We can go one step further and apply the same construction to gameplay knowledge:

The construction also goes some way towards addressing the problem of low-probability counterfactual worlds discussed in the post.

If we keep this up long enough, to split further observables out of the red arrow and give them the counterfactual treatment, we will weaken the ability of the agent to take 'rational' actions that maximize purchases in the real world further and further.

Going this route is nice in a thought experiment, but in practice a less cumbersome way to weaken this ability is to decouple the yellow utility node entirely, e.g. to use a random number generator for opponent selection.

In this comment I will focus on the case of the posts-to-show agent only. The main question I explore is: does the agent construction below actually stop the agent from manipulating user opinions?

The post above also explores this question, my main aim here is to provide an exploration which is very different from the post, to highlight other relevant parts of the problem.

Carey et al designed an algorithm to remove this control incentive. They do this by instructing the algorithm to choose its posts, not on predictions of the user's actual clicks - which produce the undesired control incentive - but on predictions of what the user would have clicked on, if their opinions hadn't been changed.

In this graph, there is no longer any control incentive for the AI on the "Influenced user opinions", because that node no longer connects to the utility node.

[...]

It seems to neutralise a vicious, ongoing cycle of opinion change in order to maximize clicks. But, [...]

The TL;DR of my analysis is that the above construction may suppress a vicious, ongoing cycle of opinion change in order to maximize clicks, but there are many cases where a full suppression of the cycle will definitely not happen.

Here is an example of when full suppression of the cycle will not happen.

First, note that the agent can only pick among the posts that it has available. If all the posts that the agent has available are posts that make the user change their opinion on something, then user opinion will definitely be influenced by the agent showing posts, no matter how the decision what posts to show is computed. If the posts are particularly stupid and viral, this may well cause vicious, ongoing cycles of opinion change.

But the agent construction shown does have beneficial properties. To repeat the picture:

The above construction makes the agent indifferent about what effects it has on opinion change. It removes any incentive of the agent to control future opinion in a particular direction.

Here is a specific case where this indifference, this lack of a control incentive, leads to beneficial effects:

  • Say that the posts to show agent in the above diagram decides on a sequence of 5 posts that will be suggested in turn, with the link to the next suggested post being displayed at the bottom of the current one. The user may not necessarily see all 5 suggestions, they may leave the site instead of clicking the suggested link. The objective is to maximize the number of clicks.

  • Now, say that the user will click the next link with a 50% chance if the next suggested post is about cats. The agent's predictive model knows this.

  • But if the suggested post is a post about pandas, then the user will click only with 40% chance, and leave the site with 60%. However, if they do click on the panda post, this will change their opinion about pandas. If the next suggested posts are also all about pandas, they will click the links with 100% certainty. The agent's predictive model knows this.

  • In the above setup, the click-maximizing strategy is to show the panda posts.

  • However, the above agent does not take the influence on user opinion by the first panda post into account. It will therefore decide to show a sequence of suggested cat posts.

To generalize from the above example: the construction creates a type of myopia in the agent, that makes it under-invest (compared to the theoretical optimum) into manipulating the user's opinion to get more clicks.

But also note that in this diagram:

there is still an arrow from 'posts to show' to 'influenced user opinion'. In the graphical language of causal influence diagrams. this is a clear warning that the agent's choices may end up influencing opinion, in some way. We have eliminated the agent incentive to control future opinion, but not the possibility that it might influence future opinion as a side effect.

I guess I should also say something about how the posts-to-show agent construction relates to real recommender systems as deployed on the Internet.

Basically, the posts-to-show agent is a good toy model to illustrate points about counterfactuals and user manipulation, but it does not provide a very complete model of the decision making processes that takes place inside real-world recommender systems. There is a somewhat hidden assumption in the picture below, represented by the arrow from 'model of original opinions' to 'posts to show':

The hidden assumption is that the agent's code which computes 'posts to show' will have access to a fairly accurate 'model of original opinions' for that individual user. In practice, that model would be very difficult to construct accurately, if the agent has to do so based on only past click data from that user. (A future superintelligent agent might of course design a special mind-reading ray to extract a very accurate model of opinion without relying on clicks....)

To implement at least a rough approximation of the above decision making process, we have to build user opinion models that rely on aggregating click data collected from many users. We might for example cluster users into interest groups, and assign each individual user to one or more of these groups. But if we do so, then the fine-grained time-axis distinction between 'original user opinions' and 'influenced opinions after the user has seen the suggested posts' gets very difficult to make. The paper "The Incentives that Shape Behaviour" suggests:

We might accomplish this by using a prediction model that assumes independence between posts, or one that is learned by only showing one post to each user.

An assumption of independence between posts is not valid in practice, but the idea of learning based on only one post per user would work. However, this severely limits the amount of useful training data we have available. So it may lead to much worse recommender performance, if we measure performance by either a profit-maximizing engagement metric or a happiness-maximizing user satisfaction metric.

Thanks for working on this! I my opinion, the management of incentives via counterfactuals is a very promising route to improving AGI safety, and this route has been under-explored by the community so far.

I am writing several comments on this post, this is the first one.

My goal is to identify and discuss angles of the problem which have not been identified in the post itself, and to identify related work.

On related work: there are obvious parallels between the counterfactual agent designs discussed in "The Incentives that Shape Behaviour" and the post above and the ITC agent that I constructed in my recent paper Counterfactual Planning. This post, about the paper presents the ITC agent construction in a more summarized way.

The main difference is that "The Incentives that Shape Behaviour" and the post above are about incentives in single-action agents, in my paper and related sequence I generalize to multi-action agents.

Quick pictorial comparison:

From "The Incentives that Shape Behaviour":

From "Counterfactual Planning":

The similarity in construction is that some of the arrows into the yellow utility nodes emerge from a node that represents the past: the 'model of original opinions' in the first picture and the node in the second picture. This construction removes the agent's control incentive on the downstream nodes, 'influenced user opinions' and .

In the terminology I developed for my counterfactual planning paper, both pictures above depict 'counterfactual planning worlds' because the projected mechanics of how the agent's blue decision nodes determine outcomes in the model are different from the real-world mechanics that will determine the real-world outcomes that these decisions will have.