Research Scientist at DeepMind
Thanks Ilya for those links, in particular the second one looks quite relevant to something we’ve been working on in a rather different context (that's the benefit of speaking the same language!)
We would also be curious to see a draft of the MDP-generalization once you have something ready to share!
I think the existing approach and easy improvements don't seem like they can capture many important incentives such that you don't want to use it as an actual assurance (e.g. suppose that agent A is predicting the world and agent B is optimizing A's predictions about B's actions---then we want to say that the system has an incentive to manipulate the world but it doesn't seem like that is easy to incorporate into this kind of formalism).
This is what multi-agent incentives are for (i.e. incentive analysis in multi-agent CIDs). We're still working on these as there are a range of subtleties, but I'm pretty confident we'll have a good account of it.
Glad she likes the name :) True, I agree there may be some interesting subtleties lurking there.
(Sorry btw for slow reply; I keep missing alignmentforum notifications.)
Thanks Stuart and Rebecca for a great critique of one of our favorite CID concepts! :)
We agree that lack of control incentive on X does not mean that X is safe from influence from the agent, as it may be that the agent influences X as a side effect of achieving its true objective. As you point out, this is especially true when X and a utility node probabilistically dependent.
What control incentives do capture are the instrumental goals of the agent. Controlling X can be a subgoal for achieving utility if and only if the CID admits a control incentive on X. For this reason, we have decided to slightly update the terminology: in the latest version of our paper (accepted to AAAI, just released on arXiv) we prefer the term instrumental control incentive (ICI), to emphasize that the distinction to "control as a side effect.
Glad you liked it.
Another thing you might find useful is Dennett's discussion of what an agent is (see first few chapters of Bacteria to Bach). Basically, he argues that an agent is something we ascribe beliefs and goals to. If he's right, then an agent should basically always have a utility function.
Your post focuses on the belief part, which is perhaps the more interesting aspect when thinking about strange loops and similar.
There is a paper which I believe is trying to do something similar to what you are attempting here:
Networks of Influence Diagrams: A Formalism for Representing Agents’ Beliefs and Decision-Making Processes, Gal and Pfeffer, Journal of Artificial Intelligence Research 33 (2008) 109-147
Are you aware of it? How do you think their ideas relate to yours?
Is this analogous to the stance-dependency of agents and intelligence?
Thanks Stuart, nice post.
I've moved away from the wireheading terminology recently, and instead categorize the problem a little bit differently:
The top-level category is reward hacking / reward corruption, which means that the agent's observed reward differs from true reward/task performance.
Reward hacking has two subtypes, depending on whether the agent exploited a misspecification in the process that computes the rewards, or modified the process. The first type is reward gaming and the second reward tampering.
Tampering can subsequently be divided into further subcategories. Does the agent tamper with its reward function, its observations, or the preferences of a user giving feedback? Which things the agent might want to tamper with depends on how its observed rewards are computed.
One advantage with this terminology is that it makes it clearer what we're talking about. For example, its pretty clear what reward function tampering refers to, and how it differs from observation tampering, even without consulting a full definition.
That said, I think you're post nicely puts the finger on what we usually mean when we say wireheading, and it is something we have been talking about a fair bit. Translated into my terminology, I think your definition would be something like "wireheading = tampering with goal measurement".
Thanks for a nice post about causal diagrams!
Because our universe is causal, any computation performed in our universe must eventually bottom out in a causal DAG.
Totally agree. This is a big part of the reason why I'm excited about these kinds of diagrams.
This raises the issue of abstraction - the core problem of embedded agency. ... how can one causal diagram (possibly with symmetry) represent another in a way which makes counterfactual queries on the map correspond to some kind of counterfactual on the territory?
Great question, I really think someone should look more carefully into this. A few potentially related papers:
In general, though, how to learn causal DAGs with symmetry is still an open question. We’d like something like Solomonoff Induction, but which can account for partial information about the internal structure of the causal DAG, rather than just overall input-output behavior.
Again, agreed. It would be great if we could find a way to make progress on this question.
Actually, I would argue that the model is naturalized in the relevant way.
When studying reward function tampering, for instance, the agent chooses actions from a set of available actions. These actions just affect the state of the environment, and somehow result in reward or not.
As a conceptual tool, we label part of the environment the "reward function", and part of the environment the "proper state". This is just to distinguish between effects that we'd like the agent to use from effects that we don't want the agent to use.
The current-RF solution doesn't rely on this distinction, it only relies on query-access to the reward function (which you could easily give an embedded RL agent).
The neat thing is that when we look at the objective of the current-RF agent using the same conceptual labeling of parts of the state, we see exactly why it works: the causal paths from actions to reward that pass the reward function have been removed.