Counterfactual control incentives

[-]Koen.Holtman5y70

In this comment I will focus on the case of the posts-to-show agent only. The main question I explore is: does the agent construction below actually stop the agent from manipulating user opinions?

The post above also explores this question, my main aim here is to provide an exploration which is very different from the post, to highlight other relevant parts of the problem.

Carey et al designed an algorithm to remove this control incentive. They do this by instructing the algorithm to choose its posts, not on predictions of the user's actual clicks - which produce the undesired control incentive - but on predictions of what the user would have clicked on, if their opinions hadn't been changed.

In this graph, there is no longer any control incentive for the AI on the "Influenced user opinions", because that node no longer connects to the utility node.

[...]

It seems to neutralise a vicious, ongoing cycle of opinion change in order to maximize clicks. But, [...]

The TL;DR of my analysis is that the above construction may suppress a vicious, ongoing cycle of opinion change in order to maximize clicks, but there are many cases where a full suppression of the cycle will definitely not happen.

Here is an example of when full suppression of the cycle will not happen.

First, note that the agent can only pick among the posts that it has available. If all the posts that the agent has available are posts that make the user change their opinion on something, then user opinion will definitely be influenced by the agent showing posts, no matter how the decision what posts to show is computed. If the posts are particularly stupid and viral, this may well cause vicious, ongoing cycles of opinion change.

But the agent construction shown does have beneficial properties. To repeat the picture:

The above construction makes the agent indifferent about what effects it has on opinion change. It removes any incentive of the agent to control future opinion in a particular direction.

Here is a specific case where this indifference, this lack of a control incentive, leads to beneficial effects:

Say that the posts to show agent in the above diagram decides on a sequence of 5 posts that will be suggested in turn, with the link to the next suggested post being displayed at the bottom of the current one. The user may not necessarily see all 5 suggestions, they may leave the site instead of clicking the suggested link. The objective is to maximize the number of clicks.
Now, say that the user will click the next link with a 50% chance if the next suggested post is about cats. The agent's predictive model knows this.
But if the suggested post is a post about pandas, then the user will click only with 40% chance, and leave the site with 60%. However, if they do click on the panda post, this will change their opinion about pandas. If the next suggested posts are also all about pandas, they will click the links with 100% certainty. The agent's predictive model knows this.
In the above setup, the click-maximizing strategy is to show the panda posts.
However, the above agent does not take the influence on user opinion by the first panda post into account. It will therefore decide to show a sequence of suggested cat posts.

To generalize from the above example: the construction creates a type of myopia in the agent, that makes it under-invest (compared to the theoretical optimum) into manipulating the user's opinion to get more clicks.

But also note that in this diagram:

there is still an arrow from 'posts to show' to 'influenced user opinion'. In the graphical language of causal influence diagrams. this is a clear warning that the agent's choices may end up influencing opinion, in some way. We have eliminated the agent incentive to control future opinion, but not the possibility that it might influence future opinion as a side effect.

I guess I should also say something about how the posts-to-show agent construction relates to real recommender systems as deployed on the Internet.

Basically, the posts-to-show agent is a good toy model to illustrate points about counterfactuals and user manipulation, but it does not provide a very complete model of the decision making processes that takes place inside real-world recommender systems. There is a somewhat hidden assumption in the picture below, represented by the arrow from 'model of original opinions' to 'posts to show':

The hidden assumption is that the agent's code which computes 'posts to show' will have access to a fairly accurate 'model of original opinions' for that individual user. In practice, that model would be very difficult to construct accurately, if the agent has to do so based on only past click data from that user. (A future superintelligent agent might of course design a special mind-reading ray to extract a very accurate model of opinion without relying on clicks....)

To implement at least a rough approximation of the above decision making process, we have to build user opinion models that rely on aggregating click data collected from many users. We might for example cluster users into interest groups, and assign each individual user to one or more of these groups. But if we do so, then the fine-grained time-axis distinction between 'original user opinions' and 'influenced opinions after the user has seen the suggested posts' gets very difficult to make. The paper "The Incentives that Shape Behaviour" suggests:

We might accomplish this by using a prediction model that assumes independence between posts, or one that is learned by only showing one post to each user.

An assumption of independence between posts is not valid in practice, but the idea of learning based on only one post per user would work. However, this severely limits the amount of useful training data we have available. So it may lead to much worse recommender performance, if we measure performance by either a profit-maximizing engagement metric or a happiness-maximizing user satisfaction metric.

[-]Stuart_Armstrong5y20

Thanks. I think we mainly agree here.

[-]Koen.Holtman5y30

Thanks for working on this! I my opinion, the management of incentives via counterfactuals is a very promising route to improving AGI safety, and this route has been under-explored by the community so far.

I am writing several comments on this post, this is the first one.

My goal is to identify and discuss angles of the problem which have not been identified in the post itself, and to identify related work.

On related work: there are obvious parallels between the counterfactual agent designs discussed in "The Incentives that Shape Behaviour" and the post above and the ITC agent that I constructed in my recent paper Counterfactual Planning. This post, about the paper presents the ITC agent construction in a more summarized way.

The main difference is that "The Incentives that Shape Behaviour" and the post above are about incentives in single-action agents, in my paper and related sequence I generalize to multi-action agents.

Quick pictorial comparison:

From "The Incentives that Shape Behaviour":

From "Counterfactual Planning":

The similarity in construction is that some of the arrows into the yellow utility nodes emerge from a node that represents the past: the 'model of original opinions' in the first picture and the node in the second picture. This construction removes the agent's control incentive on the downstream nodes, 'influenced user opinions' and $I_{1}, I_{2}, \dots$ .

In the terminology I developed for my counterfactual planning paper, both pictures above depict 'counterfactual planning worlds' because the projected mechanics of how the agent's blue decision nodes determine outcomes in the model are different from the real-world mechanics that will determine the real-world outcomes that these decisions will have.

[-]tom4everitt5y30

Thanks Stuart and Rebecca for a great critique of one of our favorite CID concepts! :)

We agree that lack of control incentive on X does not mean that X is safe from influence from the agent, as it may be that the agent influences X as a side effect of achieving its true objective. As you point out, this is especially true when X and a utility node probabilistically dependent.

What control incentives do capture are the instrumental goals of the agent. Controlling X can be a subgoal for achieving utility if and only if the CID admits a control incentive on X. For this reason, we have decided to slightly update the terminology: in the latest version of our paper (accepted to AAAI, just released on arXiv) we prefer the term instrumental control incentive (ICI), to emphasize that the distinction to "control as a side effect.

[-]Stuart_Armstrong5y30

Cheers; Rebecca likes the "instrumental control incentive" terminology; she claims it's more in line with control theory terminology.

We agree that lack of control incentive on X does not mean that X is safe from influence from the agent, as it may be that the agent influences X as a side effect of achieving its true objective. As you point out, this is especially true when X and a utility node probabilistically dependent.

I think it's more dangerous than that. When there is mutual information, the agent can learn to behave as if it was specifically manipulating X; the counterfactual approach doesn't seem to do what it intended.

[-]tom4everitt5y10

Glad she likes the name :) True, I agree there may be some interesting subtleties lurking there.

(Sorry btw for slow reply; I keep missing alignmentforum notifications.)

[-]Koen.Holtman5y10

On recent terminology innovation:

we have decided to slightly update the terminology: in the latest version of our paper (accepted to AAAI, just released on arXiv) we prefer the term instrumental control incentive (ICI), to emphasize that the distinction to "control as a side effect".

For exactly the same reason, In my own recent paper Counterfactual Planning, I introduced the terms direct incentive and indirect incentive, where I frame the removal of a path to value in a planning world diagram as an action that will eliminate a direct incentive, but that may leave other indirect incentives (via other paths to value) intact. In section 6 of the paper and in this post of the sequence I develop and apply this terminology in the case of an agent emergency stop button.

In high-level descriptions of what the technique of creating indifference via path removal (or balancing terms) does, I have settled on using the terminology suppresses the incentive instead of removes the incentive.

I must admit that I have not read many control theory papers, so any insights from Rebecca about standard terminology from control theory would be welcome.

Do they have some standard phrasing where they can say things like 'no value to control' while subtly reminding the reader that 'this does not imply there will be no side effects?'

[-]adamShimi5y20

This post gives two distinct (but related) "pieces of knowledge".

A counterexample to the "counterfactual incentive algorithm" described in section 5.2 of The Incentives that Shape Behaviour. Moreover, this failure seems to generalize to any causal diagram where all paths from the decision node to the utility node contain a control incentive, and where the controlled variables have mutual information that forbid applying the counterfactual only to some.
A concrete failure mode for the task of ensuring that a causal diagram fits a concrete situation: arrows without node might implicitly hide variables on which there are control incentives, in which case their mutual information with the other variables with control incentives is crucial to removing the control incentives.

Notably, this post doesn't seem to question the graphical criterion given in Theorem 7 of The Incentives that Shape Behaviour for control incentives.

What I'm really curious about is whether we can generally find paths without node from the decision node to the utility node. If that's the case, then the counterfactual incentive algorithm probably still works in most cases. This is because I think that the counterexample given here dissolves if there is an additional path without node from the matchmaking policy to the priced payed -- then we can take the counterfactual of R and K together, in a way that is probably consistent.

Whether such paths exists is a question about the task of judging a causal diagram against a concrete situation. I believe that this post provided a very valuable failure mode for exploring this question in more details, and I hope further work will build on it.

[-]Koen.Holtman5y10

This is because I think that the counterexample given here dissolves if there is an additional path without node from the matchmaking policy to the priced payed

I think you are using some mental model where 'paths with nodes' vs. 'paths without nodes' produces a real-world difference in outcomes. This is the wrong model to use when analysing CIDs. A path in a diagram -->[node]--> can always be replaced by a single arrow --> to produce a model that makes equivalent predictions, and the opposite operation is also possible.

So the number of nodes on a path better read as a choice about levels of abstraction in the model, not as something that tells us anything about the real world. The comment I just posted with the alternative development of the game model may be useful for you here, it offers a more specific illustration of adding nodes.

[-]Koen.Holtman5y10

In this comment (last in my series of planned comments on this post) I'll discuss the detailed player-to-match-with example developed in the post:

In order to analyse the issues with the setup, let's choose a more narrowly defined example. There are many algorithms that aim to manipulate payers of mobile games in order to get them to buy more expensive in-game items.

I have by now re-read this analysis with the example several times. First time I read it, I already felt that it was a strange way to analyse the problem, but it took me a while to figure out exactly why.

Best I can tell right now is that there are two factors

I can't figure out if the bad thing that the example tries to prove is that a) agent is trying to maximize purchases, which is unwanted or b) the agent is manipulating user's item ranking, which is unwanted. (If it is only a), then there is no need to bring in all this discussion about correlation.)
the example refines its initial CID by redrawing it in a strange way

So now I am going to develop the same game example in a style that I find less strange. I also claim that this gets closer to the default style people use when they want to analyse and manage causal incentives.

To start with, this is the original model of the game mechanics: the model of the mechanics in the real world in which the game takes place.

This shows that the agent has an incentive to control predicted purchases upwards, but also to do so by influencing the item rankings that exist in the mind of the player.

If we want to weaken this incentive to influence the item rankings that exist in the mind of the player, we can construct a counterfactual planning world for the agent (see here for an explanation of the planning world terminology I am using):

(Carey et all call often call this planning world a twin model, a model which combines both factual and counterfactual events.) In both my work and in Carey et intention, the is that the above diagram defines the world model in which the agent will plan the purchases-maximizing action, and then this same action is applied in the real world model above.

Now, the important things to note are:

this counterfactual construction does not eliminate the incentive of the agent to maximize purchases, as we still have the red arrow in there
this counterfactual construction does not eliminate the ability of the agent to influence item rankings, as we still have the orange arrow in there
but as the orange halo around the influenced item rankings is gone, the agent has lost its instrumental control incentive on item rankings. (The meaning of the orange halo and the terminology of instrumental control incentives are defined in Agent Incentives: A Causal Perspective.)

Now, say that we want to drill down further on these models, to a more detailed level of modeling. We might do so if we may want to examine further how the orange arrow above will act in practice.

We could add more detail by adding a node 'gameplay knowledge' which is correlated with item rankings. In the real-world model, this would be depicted as follows:

I added the blue arrow above to make the correlation between influenced gameplay knowledge and influenced rankings explicit as a line of causal influence. An equivalent blue arrow is not present in the drawings in the post above: technically speaking the drawing in the post is compatible with the assumption that there may be a correlation between the two, but it does not spell out the presence of this correlation, which would be unusual when doing this type of analysis.

The drawing in the post is also unusual in that it omits the red arrow I left in above. If I were to remove the red arrow, this would amount to a claim that in the real world situation, item ranking and gameplay knowledge are the only two channels by which the agent's decision may influence purchases. Clearly this is not the case: if the agent were to match the player with an opponent who is sure to win the fight and steal all of the player's possessions afterward, this would have an effect on predicted purchases, because the player would have to buy some replacements for all the stuff they lost.

I can now again construct a counterfactual planning world to go along with the refined real world model:

If we use the above planning world we have removed the instrumental control incentive on influenced rankings again.

We can go one step further and apply the same construction to gameplay knowledge:

The construction also goes some way towards addressing the problem of low-probability counterfactual worlds discussed in the post.

If we keep this up long enough, to split further observables out of the red arrow and give them the counterfactual treatment, we will weaken the ability of the agent to take 'rational' actions that maximize purchases in the real world further and further.

Going this route is nice in a thought experiment, but in practice a less cumbersome way to weaken this ability is to decouple the yellow utility node entirely, e.g. to use a random number generator for opponent selection.

Let's simplify the setup as follows; the first graph is the standard setup, the second is its counterfactual counterpart:

The AI acts through the decision node $π$ . As before, $U$ is the utility node. In the standard setup, the AI receives data on the values of $A_{0}$ , $A$ , and $U$ (and knows its own actions). Its learning process consists of learning probabilities of the various nodes. So, for any values $a_{0}$ , $a$ , $p$ and $u$ of the four nodes, it will attempt to learn the following probabilities:

$P (U = u ∣ π = p, A = a), P (A = a ∣ A_{0} = a_{0}, π = p), P (A_{0} = a_{0}) .$

Then, given that information, it will attempt to maximise $U$ .

In the counterfactual setup, the AI substitutes $A_{0}$ for $A$ . What that means is that it computes the probabilities as above, from the $a_{0}$ , $a$ , $p$ and $u$ information. But it attempts to maximise $U^{c}$ , the counterfactual utility. The probable values of $U^{c}$ are defined by the following equality:

$P (U^{c} = u^{c} ∣ π = p, A_{0} = a_{0}) := P (U = u^{c} ∣ π = p, A = a_{0}) .$

Note that the $P (U = u^{c} ∣ π = p, A = a_{0})$ term can be estimated empirically, so the AI can learn the probability distribution on $U^{c}$ from empirical information. ↩︎
Patent ID US2016005270A1. ↩︎
Note that we've slightly simplified the construction by collapsing "Original item rankings" and "Model of original item rankings" into the same node, $R_{0}$ . ↩︎
One problem with these counterfactual incentive approaches is that they often allow bad policies to be chosen, just remove part of the incentive towards them. ↩︎
For the moment, assume the AI doesn't get any $R$ information at all. Then suppose that $π^{'}$ is a "manipulative" action that increases $ via $R$ . Then if $K = k^{'}$ is an outcome that derives from $π^{'}$ , then AI will note a correlation between $(π^{'}, K = k^{'})$ and high $. This argument extends to distributions over values of $K$ : values of $K$ that are typical for $π^{'}$ are also typical for high $.

Now let's put the $R$ information back, and add the counterfactual $R = r_{0}$ . It's certainly possible to design setups where this completely undoes the correlation between $(π^{'}, K = k^{'})$ and high $. But, generically, there's no reason to expect that it will undo the correlation (though it may weaken it). So, in the counterfactual incentives, there will generically continue to be a correlation between "manipulative" actions $π^{'}$ and high $P$ $ . ↩︎ ↩︎
See the post "JFK was not assassinated". ↩︎
See the third and fourth failures in this post. ↩︎
The counterfactuals defined in the non-manipulated learning paper are less clear. The counterfactual was over the AI's policy - "what would have happened, had you chosen another policy". It is not clear whether this is truly independent of the other variable/nodes the AI is considering (though some of MIRI's decision theory research may help with this). ↩︎

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

11

Counterfactual control incentives

11

Matching players and buying items

The implicit variables are important

All goes well: independent nodes

Problems appear: mutual information

Contradictory counterfactual worlds

Low-probability worlds

The general case: changed and unchanged variables

Value indifference and causal indifference