Sure, I think we're saying the same thing: causality is frame dependent, and the variables define the frame (in your example, you and the sensor have different measurement procedures for detecting the purple cube, so you don't actually talk about the same random variable).
How big a problem is it? In practice it seems usually fine, if we're careful to test our sensor / double check we're using language in the same way. In theory, scaled up to super intelligence, it's not impossible it would be a problem.
But I would also like to emphasize that the problem yo...
The way I think about this, is that the variables constitute a reference frame. They define particular well-defined measurements that can be done, which all observers would agree about. In order to talk about interventions, there must also be a well-defined "set" operation associated with each variable, so that the effect of interventions is well-defined.
Once we have the variables, and a "set" and "get" operation for each (i.e. intervene and observe operations), then causality is an objective property of the universe. Regardless who does the experiment (i....
The idea ... works well on mechanised CIDs whose variables are neatly divided into object-level and mechanism nodes. ... But to apply this to a physical system, we would need a way to obtain such a partition those variables
Agree, the formalism relies on a division of variable. One thing that I think we should perhaps have highlighted much more is Appendix B in the paper, which shows how you get a natural partition of the variables from just knowing the object-level variables of a repeated game.
Does a spinal reflex count as a policy?
A spinal reflex would be...
This makes sense, thanks for explaining. So a threat model with specification gaming as its only technical cause, can cause x-risk under the right (i.e. wrong) societal conditions.
For instance: why expect that we need a multi-step story about consequentialism and power-seeking in order to deceive humans, when RLHF already directly selects for deceptive actions?
Is deception alone enough for x-risk? If we have a large language model that really wants to deceive any human it interacts with, then a number of humans will be deceived. But it seems like the danger stops there. Since the agent lacks intent to take over the world or similar, it won't be systematically deceiving humans to pursue some particular agenda of the agent.
As I understand it, this is why we need the extra assumption that the agent is also a misaligned power-seeker.
The way I see it, the primary value of this work (as well as other CID work) is conceptual clarification. Causality is a really fundamental concept, which many other AI-safety relevant concepts build on (influence, response, incentives, agency, ...). The primary aim is to clarify the relationships between concepts and to derive relevant implications. Whether there are practical causal inference algorithms or not is almost irrelevant.
TLDR: Causality > Causal inference :)
Sure, humans are sometimes inconsistent, and we don't always know what we want (thanks for the references, that's useful!). But I suspect we're mainly inconsistent in borderline cases, which aren't catastrophic to get wrong. I'm pretty sure humans would reliably state that they don't want to be killed, or that lots of other people die, etc. And that when they have a specific task in mind , they state that they want the task done rather than not. All this subject to that they actually understand the main considerations for whatever plan or outcome is in question, but that is exactly what debate and rrm are for
alignment of strong optimizers simply cannot be done without grounding out in something fundamentally different from a feedback signal.
I don't think this is obvious at all. Essentially, we have to make sure that humans give feedback that matches their preferences, and that the agent isn't changing the human's preferences to be more easily optimized.
We have the following tools at our disposal:
Minor rant about this is particular:
Essentially, we have to make sure that humans give feedback that matches their preferences...
Humans' stated preferences do not match their preferences-in-hindsight, neither of those matches humans' self-reported happiness/satisfaction in-the-moment, none of that matches humans' revealed preferences, and all of those are time-inconsistent. IIRC the first section of Kahnemann's textbook Well Being: The Foundations of Hedonic Psychology is devoted entirely to the problem of getting feedback from humans on what they actually...
Nice post! The Game Theory / Bureaucracy is interesting. It reminds me of Drexler's CAIS proposal, where services are combined into an intelligent whole. But I (and Drexler, I believe) agree that much more work could be spent on figuring out how to actually design/combine these systems.
Thanks Ilya for those links, in particular the second one looks quite relevant to something we’ve been working on in a rather different context (that's the benefit of speaking the same language!)
We would also be curious to see a draft of the MDP-generalization once you have something ready to share!
- I think the existing approach and easy improvements don't seem like they can capture many important incentives such that you don't want to use it as an actual assurance (e.g. suppose that agent A is predicting the world and agent B is optimizing A's predictions about B's actions---then we want to say that the system has an incentive to manipulate the world but it doesn't seem like that is easy to incorporate into this kind of formalism).
This is what multi-agent incentives are for (i.e. incentive analysis in multi-agent CIDs). We're still ...
Glad she likes the name :) True, I agree there may be some interesting subtleties lurking there.
(Sorry btw for slow reply; I keep missing alignmentforum notifications.)
Thanks Stuart and Rebecca for a great critique of one of our favorite CID concepts! :)
We agree that lack of control incentive on X does not mean that X is safe from influence from the agent, as it may be that the agent influences X as a side effect of achieving its true objective. As you point out, this is especially true when X and a utility node probabilistically dependent.
What control incentives do capture are the instrumental goals of the agent. Controlling X can be a subgoal for achieving utility if and only if the CID admits a control incentive...
Glad you liked it.
Another thing you might find useful is Dennett's discussion of what an agent is (see first few chapters of Bacteria to Bach). Basically, he argues that an agent is something we ascribe beliefs and goals to. If he's right, then an agent should basically always have a utility function.
Your post focuses on the belief part, which is perhaps the more interesting aspect when thinking about strange loops and similar.
There is a paper which I believe is trying to do something similar to what you are attempting here:
Are you aware of it? How do you think their ideas relate to yours?
Thanks Stuart, nice post.
I've moved away from the wireheading terminology recently, and instead categorize the problem a little bit differently:
The top-level category is reward hacking / reward corruption, which means that the agent's observed reward differs from true reward/task performance.
Reward hacking has two subtypes, depending on whether the agent exploited a misspecification in the process that computes the rewards, or modified the process. The first type is reward gaming and the second reward tampering.
Tampering can subsequently be divi...
Thanks for a nice post about causal diagrams!
Because our universe is causal, any computation performed in our universe must eventually bottom out in a causal DAG.
Totally agree. This is a big part of the reason why I'm excited about these kinds of diagrams.
This raises the issue of abstraction - the core problem of embedded agency. ... how can one causal diagram (possibly with symmetry) represent another in a way which makes counterfactual queries on the map correspond to some kind of counterfactual on the territory?
Great question, I really think someon...
Actually, I would argue that the model is naturalized in the relevant way.
When studying reward function tampering, for instance, the agent chooses actions from a set of available actions. These actions just affect the state of the environment, and somehow result in reward or not.
As a conceptual tool, we label part of the environment the "reward function", and part of the environment the "proper state". This is just to distinguish between effects that we'd like the agent to use from effects that we don't want the agent to use.
T...
We didn't expect this to be surprising to the LessWrong community. Many RL researchers tend to be surprised, however.
Yes, that is partly what we are trying to do here. By summarizing some of the "folklore" in the community, we'll hopefully be able to get new members up to speed quicker.
Hey Steve,
Thanks for linking to Abram's excellent blog post.
We should have pointed this out in the paper, but there is a simple correspondence between Abram's terminology and ours:
Easy wireheading problem = reward function tampering
Hard wireheading problem = feedback tampering.
Our current-RF optimization corresponds to Abram's observation-utility agent.
We also discuss the RF-input tampering problem and solutions (sometimes called the delusion box problem), which I don’t fit into Abram’s distinction.
Hey Charlie,
Thanks for bringing up these points. The intended audience is researchers more familiar with RL than the safety literature. Rather than try to modify the paper to everyone's liking, let me just give a little intro / context for it here.
The paper is the culmination of a few years of work (previously described in e.g. my thesis and alignment paper). One of the main goals has been to understand whether it is possible to redeem RL from a safety viewpoint, or whether some rather different framework would be necessary to build safe AGI.
As a firs...
I really like this layout, this idea, and the diagrams. Great work.
Glad to hear it :)
I don't agree that counterfactual oracles fix the incentive. There are black boxes in that proposal, like "how is the automated system not vulnerable to manipulation" and "why do we think the system correctly formally measures the quantity in question?" (see more potential problems). I think relying only on this kind of engineering cleverness is generally dangerous, because it produces safety measures we don't see how to break (and probably no...
Hey Charlie,
Thanks for your comment! Some replies:
sometimes one makes different choices in how to chop an AI's operation up into causally linked boxes, which can lead to an apples-and-oranges problem when comparing diagrams (for example, the diagrams you use for CIRL and IDI are very different choppings-up of the algorithms)
There is definitely a modeling choice involved in choosing how much "to pack" in each node. Indeed, most of the diagrams have been through a few iterations of splitting and combining nodes. The aim has been to focus on th...
Chapter 4 in Bacteria to Bach is probably most relevant to what we discussed here (with preceding chapters providing a bit of context).
Yes, it would interesting to see if causal influence diagrams (and the inference of incentives) could be useful here. Maybe there's a way to infer the CID of the mesa-optimizer from the CID of the base-optimizer? I don't have any concrete ideas at the moment -- I can be in touch if I think of something suitable for collaboration!
What’s at stake here is: describing basically any system as an agent optimising some objective is going to be a leaky abstraction. The question is, how do we define the conditions of calling something an agent with an objective in such a way to minimise the leaks?
Indeed, this is a super slippery question. And I think this is a good reason to stand on the shoulders of a giant like Dennett. Some of the questions he has been tackling are actually quite similar to yours, around the emergence of agency and the emergence of consciousness.
For example, does it ma...
Thanks for the interesting post! I find the possibility of a gap between the base optimization objective and the mesa/behavioral objective convincing, and well worth exploring.
However, I'm less convinced that the distinction between the mesa-objective and the behavioral objective is real/important. You write:
...Informally, the behavioral objective is the objective which appears to be optimized by the system’s behavior. More formally, we can operationalize the behavioral objective as the objective recovered from perfect inverse reinforcement learning (IRL).[
Thanks for an insightful comment. I think your points are good to bring up, and though I will offer a rebuttal I’m not convinced that I am correct about this.
What’s at stake here is: describing basically any system as an agent optimising some objective is going to be a leaky abstraction. The question is, how do we define the conditions of calling something an agent with an objective in such a way to minimise the leaks?
Distinguishing the “this system looks like it optimises for X” from “this system internally uses an evaluation of X to make decisions” is us
...My confusion is the following:
Premises (*) and inferences (=>):
The primary way for the agent to avoid traps is to delegate to a soft-maximiser.
Any action with boundedly negative utility, a soft-maximiser will take with positive probability.
Actions leading to traps do not have infinitely negative utility.
=> The agent will fall into traps with positive probability.
=> The agent will have linear regret.
So when you say in the beginning of the post
...So this requires the agent's prior to incorporate information about which states are potentially risky?
Because if there is always some probability of there being a risky action (with infinitely negative value), then regardless how small the probability is and how large the penalty is for asking, the agent will always be better off asking.
(Did you see Owain Evans recent paper about trying to teach the agent to detect risky states.)
Hi Vanessa!
So basically the advisor will be increasingly careful as the cost of falling into the trap goes to infinity? Makes sense I guess.
What is the incentive for the agent not to always let the advisor choose? Is there always some probability that the advisor saves them from infinite loss, or only in certain situations that can be detected by the agent?
Adversarial examples for neural networks make situations where the agent misinterprets the human action seem plausible.
But it is true that the situation where the human acts irrationally in some state (e.g. because of drugs, propaganda) could be modeled in much the same way.
I preferred the sensory error since it doesn't require a irrational human. Perhaps I should have been clearer that I'm interested in the agent wireheading itself (in some sense) rather than wireheading of the human.
(Sorry for being slow to reply -- I didn't get notified about the comments.)
That is a good question. I don't think it is essential that the agent can move from to , only that the agent is able to force a stay in if it wants to.
The transition from to could instead happen randomly with some probability.
The important thing is that the human's action in does not reveal any information about .
"The use of an advisor allows us to kill two birds with one stone: learning the reward function and safe exploration (i.e. avoiding both the Scylla of “Bayesian paranoia” and the Charybdis of falling into traps)."
This sounds quite nice. But how is it possible to achieve this if the advisor is a soft-maximiser? Doesn't that mean that there is a positive probability that the advisor falls into the trap?
I really like this articulation of the problem!
To me, a way to point to something similar is to say that preservation (and enhancement) of human agency is important (value change being one important way that human agency can be reduced). https://www.alignmentforum.org/s/pcdHisDEGLbxrbSHD/p/Qi77Tu3ehdacAbBBe
One thing I've been trying to argue for is that we might try to pivot agent foundations research to focus more on human agency instead of artificial agency. For example, I think value change is an example of self-modification, which has been studied a fair bit for artificial agents.