All of tom4everitt's Comments + Replies

Progress on Causal Influence Diagrams

Thanks Ilya for those links, in particular the second one looks quite relevant to something we’ve been working on in a rather different context (that's the benefit of speaking the same language!)

We would also be curious to see a draft of the MDP-generalization once you have something ready to share!

2IlyaShpitser3mo [] (this really is preliminary, e.g. they have not yet uploaded a newer version that incorporates peer review suggestions). --- Can't do stuff in the second paper without worrying about stuff in the first (unless your model is very simple).
AMA: Paul Christiano, alignment researcher


  • I think the existing approach and easy improvements don't seem like they can capture many important incentives such that you don't want to use it as an actual assurance (e.g. suppose that agent A is predicting the world and agent B is optimizing A's predictions about B's actions---then we want to say that the system has an incentive to manipulate the world but it doesn't seem like that is easy to incorporate into this kind of formalism).


This is what multi-agent incentives are for (i.e. incentive analysis in multi-agent CIDs).  We're still ... (read more)

Counterfactual control incentives

Glad she likes the name :) True, I agree there may be some interesting subtleties lurking there. 

(Sorry btw for slow reply; I keep missing alignmentforum notifications.)

Counterfactual control incentives

Thanks Stuart and Rebecca for a great critique of one of our favorite CID concepts! :)

We agree that lack of control incentive on X does not mean that X is safe from influence from the agent, as it may be that the agent influences X as a side effect of achieving its true objective. As you point out, this is especially true when X and a utility node probabilistically dependent. 

What control incentives do capture are the instrumental goals of the agent. Controlling X can be a subgoal for achieving utility if and only if the CID admits a control incentive... (read more)

3Stuart Armstrong8moCheers; Rebecca likes the "instrumental control incentive" terminology; she claims it's more in line with control theory terminology. I think it's more dangerous than that. When there is mutual information, the agent can learn to behave as if it was specifically manipulating X; the counterfactual approach doesn't seem to do what it intended.
(A -> B) -> A in Causal DAGs

Glad you liked it.

Another thing you might find useful is Dennett's discussion of what an agent is (see first few chapters of Bacteria to Bach). Basically, he argues that an agent is something we ascribe beliefs and goals to. If he's right, then an agent should basically always have a utility function.

Your post focuses on the belief part, which is perhaps the more interesting aspect when thinking about strange loops and similar.

(A -> B) -> A in Causal DAGs

There is a paper which I believe is trying to do something similar to what you are attempting here:

Networks of Influence Diagrams: A Formalism for Representing Agents’ Beliefs and Decision-Making Processes, Gal and Pfeffer, Journal of Artificial Intelligence Research 33 (2008) 109-147

Are you aware of it? How do you think their ideas relate to yours?

4johnswentworth2yVery interesting, thank you for the link! Main difference between what they're doing and what I'm doing: they're using explicit utility & maximization nodes; I'm not. It may be that this doesn't actually matter. The representation I'm using certainly allows for utility maximization - a node downstream of a cloud can just be a maximizer for some utility on the nodes of the cloud-model. The converse question is less obvious: can any node downstream of a cloud be represented by a utility maximizer (with a very artificial "utility")? I'll probably play around with that a bit; if it works, I'd be able to re-use the equivalence results in that paper. If it doesn't work, then that would demonstrate a clear qualitative difference between "goal-directed" behavior and arbitrary behavior in these sorts of systems, which would in turn be useful for alignment - it would show a broad class of problems where utility functions do constrain [] .
Wireheading is in the eye of the beholder

Is this analogous to the stance-dependency of agents and intelligence?

2Stuart Armstrong2yIt is analogous, to some extent; I do look into some aspect of Daniel Dennett's classification here: [] I also had a more focused attempt at defining AI wireheading here: [] I think you've already seen that?
Defining AI wireheading

Thanks Stuart, nice post.

I've moved away from the wireheading terminology recently, and instead categorize the problem a little bit differently:

The top-level category is reward hacking / reward corruption, which means that the agent's observed reward differs from true reward/task performance.

Reward hacking has two subtypes, depending on whether the agent exploited a misspecification in the process that computes the rewards, or modified the process. The first type is reward gaming and the second reward tampering.

Tampering can subsequently be divi... (read more)

Computational Model: Causal Diagrams with Symmetry

Thanks for a nice post about causal diagrams!

Because our universe is causal, any computation performed in our universe must eventually bottom out in a causal DAG.

Totally agree. This is a big part of the reason why I'm excited about these kinds of diagrams.

This raises the issue of abstraction - the core problem of embedded agency. ... how can one causal diagram (possibly with symmetry) represent another in a way which makes counterfactual queries on the map correspond to some kind of counterfactual on the territory?

Great question, I really think someon... (read more)

"Designing agent incentives to avoid reward tampering", DeepMind

Actually, I would argue that the model is naturalized in the relevant way.

When studying reward function tampering, for instance, the agent chooses actions from a set of available actions. These actions just affect the state of the environment, and somehow result in reward or not.

As a conceptual tool, we label part of the environment the "reward function", and part of the environment the "proper state". This is just to distinguish between effects that we'd like the agent to use from effects that we don't want the agent to use.

T... (read more)

"Designing agent incentives to avoid reward tampering", DeepMind

We didn't expect this to be surprising to the LessWrong community. Many RL researchers tend to be surprised, however.

4Wei Dai2yAh, that makes sense. I kind of guessed that the target audience is RL researchers, but still misinterpreted "perhaps surprisingly" as a claim of novelty instead of an attempt to raise the interest of the target audience.
"Designing agent incentives to avoid reward tampering", DeepMind

Yes, that is partly what we are trying to do here. By summarizing some of the "folklore" in the community, we'll hopefully be able to get new members up to speed quicker.

"Designing agent incentives to avoid reward tampering", DeepMind

Hey Steve,

Thanks for linking to Abram's excellent blog post.

We should have pointed this out in the paper, but there is a simple correspondence between Abram's terminology and ours:

Easy wireheading problem = reward function tampering

Hard wireheading problem = feedback tampering.

Our current-RF optimization corresponds to Abram's observation-utility agent.

We also discuss the RF-input tampering problem and solutions (sometimes called the delusion box problem), which I don’t fit into Abram’s distinction.

"Designing agent incentives to avoid reward tampering", DeepMind

Hey Charlie,

Thanks for bringing up these points. The intended audience is researchers more familiar with RL than the safety literature. Rather than try to modify the paper to everyone's liking, let me just give a little intro / context for it here.

The paper is the culmination of a few years of work (previously described in e.g. my thesis and alignment paper). One of the main goals has been to understand whether it is possible to redeem RL from a safety viewpoint, or whether some rather different framework would be necessary to build safe AGI.

As a firs... (read more)

2Charlie Steiner2ySure. On the one hand, xkcd []. On the other hand, if it works for you, that's great and absolutely useful progress. I'm a little worried about direct applicability to RL because the model is still not fully naturalized - actions that affect goals are neatly labeled and separated rather than being a messy subset of actions that affect the world. I guess this another one of those cases where I think the "right" answer is "sophisticated common sense," but an ad-hoc mostly-answer would still be useful conceptual progress.
Modeling AGI Safety Frameworks with Causal Influence Diagrams
I really like this layout, this idea, and the diagrams. Great work.

Glad to hear it :)

I don't agree that counterfactual oracles fix the incentive. There are black boxes in that proposal, like "how is the automated system not vulnerable to manipulation" and "why do we think the system correctly formally measures the quantity in question?" (see more potential problems). I think relying only on this kind of engineering cleverness is generally dangerous, because it produces safety measures we don't see how to break (and probably no
... (read more)
Modeling AGI Safety Frameworks with Causal Influence Diagrams

Hey Charlie,

Thanks for your comment! Some replies:

sometimes one makes different choices in how to chop an AI's operation up into causally linked boxes, which can lead to an apples-and-oranges problem when comparing diagrams (for example, the diagrams you use for CIRL and IDI are very different choppings-up of the algorithms)

There is definitely a modeling choice involved in choosing how much "to pack" in each node. Indeed, most of the diagrams have been through a few iterations of splitting and combining nodes. The aim has been to focus on th... (read more)

1Charlie Steiner2yAll good points. The paper you linked was interesting - the graphical model is part of an AI design that actually models other agents using that graph. That might be useful if you're coding a simple game-playing agent, but I think you'd agree that you're using CIDs in a more communicative / metaphorical way?
Risks from Learned Optimization: Introduction

Chapter 4 in Bacteria to Bach is probably most relevant to what we discussed here (with preceding chapters providing a bit of context).

Yes, it would interesting to see if causal influence diagrams (and the inference of incentives) could be useful here. Maybe there's a way to infer the CID of the mesa-optimizer from the CID of the base-optimizer? I don't have any concrete ideas at the moment -- I can be in touch if I think of something suitable for collaboration!

Risks from Learned Optimization: Introduction
What’s at stake here is: describing basically any system as an agent optimising some objective is going to be a leaky abstraction. The question is, how do we define the conditions of calling something an agent with an objective in such a way to minimise the leaks?

Indeed, this is a super slippery question. And I think this is a good reason to stand on the shoulders of a giant like Dennett. Some of the questions he has been tackling are actually quite similar to yours, around the emergence of agency and the emergence of consciousness.

For example, does it ma... (read more)

3Vladimir Mikulik2yI’ve been meaning for a while to read Dennett with reference to this, and actually have a copy of Bacteria to Bach. Can you recommend some choice passages, or is it significantly better to read the entire book? P.S. I am quite confused about DQN’s status here and don’t wish to suggest that I’m confident it’s an optimiser. Just to point out that it’s plausible we might want to call it one without calling PPO an optimiser. P.P.S.: I forgot to mention in my previous comment that I enjoyed the objective graph stuff. I think there might be fruitful overlap between that work and the idea we’ve sketched out in our third post on a general way of understanding pseudo-alignment. Our objective graph framework is less developed than yours, so perhaps your machinery could be applied there to get a more precise analysis?
Risks from Learned Optimization: Introduction

Thanks for the interesting post! I find the possibility of a gap between the base optimization objective and the mesa/behavioral objective convincing, and well worth exploring.

However, I'm less convinced that the distinction between the mesa-objective and the behavioral objective is real/important. You write:

Informally, the behavioral objective is the objective which appears to be optimized by the system’s behavior. More formally, we can operationalize the behavioral objective as the objective recovered from perfect inverse reinforcement learning (IRL).[

... (read more)

Thanks for an insightful comment. I think your points are good to bring up, and though I will offer a rebuttal I’m not convinced that I am correct about this.

What’s at stake here is: describing basically any system as an agent optimising some objective is going to be a leaky abstraction. The question is, how do we define the conditions of calling something an agent with an objective in such a way to minimise the leaks?

Distinguishing the “this system looks like it optimises for X” from “this system internally uses an evaluation of X to make decisions” is us

... (read more)
Smoking Lesion Steelman II

Nice writeup. Is one-boxing in Newcomb an equilibria?

Delegative Inverse Reinforcement Learning

My confusion is the following:

Premises (*) and inferences (=>):

  • The primary way for the agent to avoid traps is to delegate to a soft-maximiser.

  • Any action with boundedly negative utility, a soft-maximiser will take with positive probability.

  • Actions leading to traps do not have infinitely negative utility.

=> The agent will fall into traps with positive probability.

  • If the agent falls into a trap with positive probability, then it will have linear regret.

=> The agent will have linear regret.

So when you say in the beginning of the post

... (read more)
0Vanessa Kosoy4yYour confusion is because you are thinking about regret in an anytime setting. In an anytime setting, there is a fixed policy π, we measure the expected reward of π over a time interval t and compare it to the optimal expected reward over the same time interval. If π has probability p>0 to walk into a trap, regret has the linear lower bound Ω(pt). On other hand, I am talking about policies πt that explicitly depend on the parameter t (I call this a "metapolicy"). Both the advisor and the agent policies are like that. As t goes to ∞, the probability p(t) to walk into a trap goes to 0, so p(t)t is a sublinear function. A second difference with the usual definition of regret is that I use an infinite sum of rewards with geometric time discount e−1/t instead of a step function time discount that cuts off at t. However, this second difference is entirely inessential, and all the theorems work about the same with step function time discount.
Delegative Inverse Reinforcement Learning

So this requires the agent's prior to incorporate information about which states are potentially risky?

Because if there is always some probability of there being a risky action (with infinitely negative value), then regardless how small the probability is and how large the penalty is for asking, the agent will always be better off asking.

(Did you see Owain Evans recent paper about trying to teach the agent to detect risky states.)

0Vanessa Kosoy4yThe only assumptions about the prior are that it is supported on a countable set of hypotheses, and that in each hypothesis the advisor is β-rational (for some fixed β(t)=ω(t2/3)). There is no such thing as infinitely negative value in this framework. The utility function is bounded because of the geometric time discount (and because the momentary rewards are assumed to be bounded), and in fact I normalize it to lie in [0,1] (see the equation defining U in the beginning of the Results section). Falling into a trap is an event associated with Ω(1) loss (i.e. loss that remains constant as t goes to ∞). Therefore, we can risk such an event, as long as the probability is o(1) (i.e. goes to 0 as t goes to ∞). This means that as t grows, the agent will spend more rounds delegating to the advisor, but for any given t, it won't delegate on most rounds (even on most of the important rounds, i.e. during the first O(t)-length "horizon"). In fact, you can see in the proof of Lemma A, that the policy I construct delegates on O(t2/3) rounds. As a simple example, consider again the toy environment from before. Consider also the environments you get from it by applying a permutation to the set of actions A. Thus, you get a hypothesis class of 6 environments. Then, the corresponding DIRL agent will spend O(t2/3) rounds delegating, observe which action is chosen by the advisor most frequently, and perform this action forevermore. (The phenomenon that all delegations happen in the beginning is specific to this toy example, because it only has 1 non-trap state.) If you mean this [] paper, I saw it?
Delegative Inverse Reinforcement Learning

Hi Vanessa!

So basically the advisor will be increasingly careful as the cost of falling into the trap goes to infinity? Makes sense I guess.

What is the incentive for the agent not to always let the advisor choose? Is there always some probability that the advisor saves them from infinite loss, or only in certain situations that can be detected by the agent?

0Vanessa Kosoy4yIf the agent always delegates to the advisor, it loses a large fraction of the value. Returning again to the simple example above, the advisor on its own is only guaranteed to get expected utility 1/2+ω(t−1/3) (because it often takes the suboptimal action 1). On the other hand, for any prior over a countable set of environments that includes this one, the corresponding DIRL agent gets expected utility 1−o(1) on this environment (because it will learn to only take action 2). You can also add an external penalty for each delegation, adjusting the proof is straightforward. So, the agent has to exercise judgement about whether to delegate, using its prior + past observations. For example, the policy I construct in Lemma A delegates iff there is no action whose expected loss (according to current beliefs) is less than β(t)−1t−1/3.
CIRL Wireheading

Adversarial examples for neural networks make situations where the agent misinterprets the human action seem plausible.

But it is true that the situation where the human acts irrationally in some state (e.g. because of drugs, propaganda) could be modeled in much the same way.

I preferred the sensory error since it doesn't require a irrational human. Perhaps I should have been clearer that I'm interested in the agent wireheading itself (in some sense) rather than wireheading of the human.

(Sorry for being slow to reply -- I didn't get notified about the comments.)

CIRL Wireheading

That is a good question. I don't think it is essential that the agent can move from to , only that the agent is able to force a stay in if it wants to.

The transition from to could instead happen randomly with some probability.

The important thing is that the human's action in does not reveal any information about .

Delegative Inverse Reinforcement Learning

"The use of an advisor allows us to kill two birds with one stone: learning the reward function and safe exploration (i.e. avoiding both the Scylla of “Bayesian paranoia” and the Charybdis of falling into traps)."

This sounds quite nice. But how is it possible to achieve this if the advisor is a soft-maximiser? Doesn't that mean that there is a positive probability that the advisor falls into the trap?

0Vanessa Kosoy4yHi Tom! There is a positive probability that the advisor falls into the trap, but this probability goes to 0 as the time discount parameter t goes to ∞ (which is the limit I study here). This follows from the condition β(t)=ω(t2/3) in the Theorem. To give a simple example, suppose that A={0,1,2} and the environment is s.t.: * When you take action 0, you fall into a trap and get reward 0 forever. * When you take action 1, you get reward 0 for the current round and remain in the same state. * When you take action 2, you get reward 1 for the current round (unless you are in the trap) and remain in the same state. In this case, our advisor would have to take action 0 with probability exp(−ω(t2 /3)) and action 2 has to be more probable than action 1 by a factor of exp(ω(t−1 /3))≈1+ω(t−1/3).