Most models of agency (in game theory, decision theory, etc) implicitly assume that the agent is separate from the environment - there is a “Cartesian boundary” between agent and environment. The embedded agency sequence goes through a long list of theoretical/conceptual problems which arise when an agent is instead embedded in its environment. Some examples:
The embedded agency sequence mostly discusses how these issues create problems for designing reliable AI. Less discussed is how these same issues show up when modelling humans - and, in particular, when trying to define human values (i.e. “what humans want”). Many - arguably most - of the problems alignment researchers run into when trying to create robust pointers to human values are the same problems we encounter when talking about embedded agents in general.
I’ll run through a bunch of examples below, and tie each to a corresponding problem-class in embedded agency. While reading, bear in mind that directly answering the questions posed is not the point. The point is that each of these problems is a symptom of the underlying issue: humans are embedded agents. Patching over each problem one-by-one will produce a spaghetti tower; ideally we’d tackle the problem closer to the root.
Let’s imagine that we have an AI which communicates with its human operator via screen and keyboard. It tries to figure out what the human wants based on what’s typed at the keyboard.
A few possible failure modes in this setup:
Embedded agency problem: humans do not have well-defined output channels. We cannot just point to a keyboard and say “any information from that keyboard is direct output from the human”. Of course we can come up with marginally better solutions than a keyboard - e.g. voice recognition - but eventually we’ll run into similar issues. There is nothing in the world we can point to and say “that’s the human’s output channel, the entire output channel, and nothing but the output channel”. Nor does any such output channel exist, so e.g. we won’t solve the problem just by having uncertainty over where exactly the output channel is.
Because humans are embedded in the physical world, there is no fundamental block to an AI modifying us (either intentionally or unintentionally). Define what a “human” is based on some neural network which recognizes humans in images, and we risk an AI modifying the human by externally-invisible means ranging from drugs to wholesale replacement.
Embedded agency problem: no Cartesian boundary. All the human-parts can be manipulated/modified; the AI is not in a different physical universe from us.
Human choices can depend on off-equilibrium behavior - what we or someone else would do, in a scenario which never actually happens. Game theory is full of examples, especially threats: we don’t launch our nukes because we expect our enemies would launch their nukes… yet what we actually expect to happen is for nobody to launch any nukes. Our own behavior is determined by “possibilities” which we don’t actually expect to happen, and which may not even be possible. Embedded agency problem: counterfactuals.
Going even further: our values themselves can depend on counterfactuals. My enjoyment of a meal sometimes depends on what the alternatives were, even when the meal is my top pick - I’m happier if I didn’t pass up something nearly-as-good. We’re often unhappy to be forced into a choice, even if it’s a choice we would have made anyway. What does it mean to “have a choice”, in the sense that matters for human values? How do we physically ground that concept? If we want a friendly AI to allow us choices, rather than force us to do what’s best for us, then we need answers to questions like these.
Humans have different preferences while drunk than while sober [CITATION NEEDED]. When pointing an AI at “human values”, it’s tempting to simply say “don’t count decisions made while drunk”. But on the other hand, people often drink to intentionally lower their own inhibitions - suggesting that, at a meta-level, they want to self-modify into making low-inhibition decisions (at least temporarily, and within some context, e.g. at a party).
Embedded agency problem: self-modification and robust delegation. When a human intentionally self-modifies, to what extent should their previous values be honored, to what extent their new values, and to what extent their future values?
Humans generally have different values in childhood, middle age, and old age. Heck, humans have different values just from being hangry! Suppose a human makes a precommitment, and then later on, their values drift - the precommitment becomes a nontrivial constraint, pushing them to do something they no longer wish to do. How should a friendly AI handle that precommitment?
Embedded agency problem: tiling & delegation failures. As humans propagate through time, our values are not stable, even in the absence of intentional self-modification. Unlike in the AI case, we can’t just design humans to have more stable values. (Or can we? Would that even be desirable?)
Humans have subsystems. Those subsystems do not always want the same things. Stated preferences and revealed preferences do not generally match. Akrasia exists; many people indulge in clicker games no matter how much some other part of themselves wishes they could be more productive.
Embedded agency problem: subsystem alignment. Human subsystems are not all aligned all the time. Unlike the AI case, we can’t just design humans to have better-aligned subsystems - first we’d need to decide what to align them to, and it’s not obvious that any one particular subsystem contains the human’s “true” values.
Humans generally don’t have preferences over quantum fields directly. The things we value are abstract, high-level objects and notions. Embedded agency problem: multi-level world models. How do we take the abstract objects/notions over which human values operate, and tie them back to physical observables?
At the same time, our values ultimately need to be grounded in quantum fields, because that’s what the world is made of. Human values should not seemingly cease to exist just because the world is quantum and we thought it was classical. It all adds up to normality. Embedded agency problem: ontological crises. How do we ensure that a friendly AI can still point to human values even if its model of the world fundamentally shifts?
I have, on at least one occasion, completely switched a political position in about half an hour after hearing an argument I had not previously considered. More generally, we humans tend to update our beliefs, our strategies, and what-we-believe-to-be-our-values as new implications are realized.
Embedded agency problem: logical non-omniscience. We do not understand the full implications of what we know, and sometimes we base our decisions/strategies/what-we-believe-to-be-our-values on flawed logic. How is a friendly AI to recognize and handle such cases?
Because humans are all embedded in one physical world, lying is hard. There are side-channels which leak information, and humans have long since evolved to pay attention to those side-channels. One side effect: the easiest way to “deceive” others is to deceive oneself, via self-modification. Embedded agency problem: coordination with visible source code, plus self-modification.
We earnestly adopt both the beliefs and values of those around us. Are those our “true” values? How should a friendly AI treat values adopted due to social pressure? More generally, how should a friendly AI handle human self-modifications driven by social pressure?
Combining this with earlier examples: perhaps we spend an evening drunk because it gives us a socially-viable excuse to do whatever we wanted to do anyway. Then the next day, we bow to social pressure and earnestly regret our actions of the previous night - or at least some of our subsystems do. Other subsystems still had fun while drunk, and we do the same thing the next weekend. What is a friendly AI to make of this? Where, in this mess, are the humans’ “values”?
These are the sorts of shenanigans one needs to deal with when dealing with embedded agents, and I expect that a better understanding of embedded agents in general will lead to substantial insights about the nature of human values.
We actually avoided talking about AI in most of the cartoon, and tried to just imply it by having a picture of a robot.
The first time (I think) I presented the factoring in the embedded agency sequence was at a MIRI CFAR collaboration workshop, so parallels with humans was live in my thinking.
The first time we presented the cartoon in roughly its current form was at MSFP 2018, where we purposely did it on the first night before a CFAR workshop, so people could draw analogies that might help them transfer their curiosity in both directions.
Planned summary for the Alignment Newsletter:
<@Embedded agency@>(@Embedded Agents@) is not just a problem for AI systems: humans are embedded agents too; many problems in understanding human values stem from this fact. For example, humans don't have a well-defined output channel: we can't say "anything that comes from this keyboard is direct output from the human", because the AI could seize control of the keyboard and wirehead, or a cat could walk over the keyboard, etc. Similarly, humans can "self-modify", e.g. by drinking, which often modifies their "values": what does that imply for value learning? Based on these and other examples, the post concludes that "a better understanding of embedded agents in general will lead to substantial insights about the nature of human values".
I certainly agree that many problems with value learning stem from embedded agency issues with humans, and any <@formal account@>(@Why we need a *theory* of human values@) of this will benefit from general progress in understanding embeddedness. Unlike many others, I do not think we need a formal account of human values, and that a "common-sense" understanding will suffice, including for the embeddedness problems detailed in this post.
One (possibly minor?) point: this isn't just about value learning; it's the more general problem of pointing to values. For instance, a system with a human in the loop may not need to learn values; it could rely on the human to provide value judgements. On the other hand, the human still needs to point to their own values in manner usable/interpretable by the rest of the system (possibly with the human doing the "interpretation", as in e.g. tool AI). Also, the system still needs to point to the human somehow - cats walking on keyboards are still a problem.
Also, if you have written up your views on these sorts of problems, and how human-common-sense understanding will solve them, I'd be interested to read that. (Or if someone else has written up views similar to your own, that works too.)
One (possibly minor?) point: this isn't just about value learning; it's the more general problem of pointing to values.
Makes sense, I changed "value learning" to "figuring out what to optimize".
Also, if you have written up your views on these sorts of problems, and how human-common-sense understanding will solve them, I'd be interested to read that.
Hmm, I was going to say Chapter 3 of the Value Learning sequence, but looking at it again it doesn't really talk about this. Maybe the post on Following human norms gives some idea of the flavor of what I mean, but it doesn't explicitly talk about it. Perhaps I should write about this in the future.
Here's a brief version:
We'll build ML systems with common sense, because common sense is necessary for tasks of interest; common sense already deals with most (all?) of the human embeddedness problems. There are still two remaining problems:
Thanks, that makes sense.
FWIW, my response would be something like: assuming that common-sense reasoning is sufficient, we'll probably still need a better understanding of embeddedness in order to actually build common-sense reasoning into an AI. When we say "common sense can solve these problems", it means humans know how to solve the problems, but that doesn't mean we know how to translate the human understanding into something an AI can use. I do agree that humans already have a good intuition for these problems, but we still don't know how to automate that intuition.
I think our main difference in thinking here is not in whether or not common sense is sufficient, but in whether or not "common sense" is a natural category that ML-style methods could figure out. I do think it's a natural category in some sense, but I think we still need a theoretical breakthrough before we'll be able to point a system at it - and I don't think systems will acquire human-compatible common sense by default as an instrumentally convergent tool.
I think our main difference in thinking here is not in whether or not common sense is sufficient, but in whether or not "common sense" is a natural category that ML-style methods could figure out.
To give some flavor of why I think ML could figure it out:
I don't think "common sense" itself is a natural category, but is instead more like a bundle of other things that are natural, e.g. pragmatics. It doesn't seem like "common sense" is innate to humans; we seem to learn "common sense" somehow (toddlers are often too literal). I don't see an obvious reason why an ML algorithm shouldn't be able to do the same thing.
In addition, "common sense" type rules are often very useful for prediction, e.g. if you hear "they gave me a million packets of hot sauce", and then you want to predict how many packets of hot sauce there are in the bad, you're going to do better if you understand common sense. So common sense is instrumentally useful for prediction (and probably any other objective you care to name that we might use to train an AI system).
That said, I don't think it's a crux for me -- even if I believed that current ML systems wouldn't be able to figure "common sense" out, my main update would be that current ML systems wouldn't lead to AGI / transformative AI, since I expect most tasks require common sense. Perhaps the crux is "transformative AI will necessarily have figured out most aspects of 'common sense'".
Ah, ok, I may have been imagining something different by "common sense" than you are - something more focused on the human-specific parts.
Maybe this claim gets more at the crux: the parts of "common sense" which are sufficient for handling embeddedness issues with human values are not instrumentally convergent; the parts of "common sense" which are instrumentally convergent are not sufficient for human values.
The cat on the keyboard seems like a decent example here (though somewhat oversimplified). If the keyboard suddenly starts emitting random symbols, then it seems like common sense to ignore it - after all, those symbols obviously aren't coming from a human. On the other hand, if the AI's objective is explicitly pointing to the keyboard, then that common sense won't do any good - it doesn't have any reason to care about the human's input more than random input a priori, common sense or not. Obviously there are simple ways of handling this particular problem, but it's not something the AI would learn unless it was pointing to the human to begin with.
Hmm, this seems to be less about whether or not you have common sense, and more about whether the AI system is motivated to use its common sense in interpreting instructions / goals.
I think if you have an AI system that is maximizing an explicit objective, e.g. maximize the numbers input from this keyboard; then the AI will have common sense, but (almost tautologically) won't use it to interpret the input correctly. (See also Failed Utopia.)
The hope is to train an AI system that doesn't work like that, in the same way that humans don't work like that. (In fact, I could see that by default AI systems are trained like that; e.g. instruction-following AI systems like CraftAssist seem to be in this vein.)
Let me make sure I understand what you're picturing as an example. Rather than giving an AI an explicit objective, we train it to follow instructions from a human (presumably using something RL-ish?), and the idea is that it will learn something like human common sense in order to better follow instructions. Is that a prototypical case of what you're imagining? If so, what criteria do you imagine using for training? Maximizing a human approval score? Mimicking a human/predicting what a human would do and then doing that? Some kind of training procedure which somehow avoids optimizing anything at all?
Is that a prototypical case of what you're imagining?
Maximizing a human approval score?
Sure, that seems reasonable. Note that this does not mean that the agent ends up taking whichever actions maximize the number entered into a keyboard; it instead creates a policy that is consistent with the constraints "when asked to follow <instruction i>, I should choose action <most approved action i>", for instructions and actions it is trained on. It's plausible to me that the most "natural" policy that satisfies these constraints is one which predicts what a real human would think of the chosen action, and then chooses the action that does best according to that prediction.
(In practice you'd want to add other things like e.g. interpretability and adversarial training.)
It's plausible to me that the most "natural" policy that satisfies these constraints is one which predicts what a real human would think of the chosen action...
I'd expect that's going to depend pretty heavily on how we're quantifying "most natural", which brings us right back to the central issue.
Just in terms of pure predictive power, the most accurate policy is going to involve a detailed simulation of a human at a keyboard, reflecting the physical setup in which the data is collected - and that will produce basically the same problems as an actual human at a keyboard. The final policy won't point to human values any more robustly than the data collection process did - if the data was generated by a human typing at a keyboard, then the most-predictive policy will predict what a human would type at a keyboard, not what a human "actually wants". Garbage in, garbage out, etc.
More pithily: if a problem can't be solved by a human typing something into a keyboard, then it also won't be solved by simulating/predicting what the human would type into the keyboard.
It could be that there's some viable criterion of "natural" other than just maximizing predictive power, but predictive power alone won't circumvent the embeddedness problems.
Just in terms of pure predictive power, the most accurate policy is going to involve a detailed simulation of a human at a keyboard, reflecting the physical setup in which the data is collected - and that will produce basically the same problems as an actual human at a keyboard. [...] the most-predictive policy will predict what a human would type at a keyboard, not what a human "actually wants".
Agreed. I don't think we will get that policy, because it's very complex. (It's much easier / cheaper to predict what the human wants than to run a detailed simulation of the room.)
I'd expect that's going to depend pretty heavily on how we're quantifying "most natural", which brings us right back to the central issue.
I'm making an empirical prediction; so I'm not quantifying "most natural", reality is.
Tbc, I'm not saying that this is a good on-paper solution to AI safety; it doesn't seem like we could know in advance that this would work. I'm saying that it may turn out that as we train more and more powerful systems, we see evidence that the picture I painted is basically right; in that world it could be enough to do some basic instruction-following.
I'm also not saying that this is robust to scaling up arbitrarily far; as you said, the literal most predictive policy doesn't work.
Cool, I agree with all of that. Thanks for taking the time to talk through this.
I agree and think this is an unappreciated idea, which is why I liberally link the embedded agency post in things I write. I'm not sure I'm doing a perfect job of not forgetting we are all embedded, but I consider it important and essential to not getting confused about, for example, human values, and think many of the confusions we have (especially the ones we fail to notice) are a result of incorrectly thinking, to put in another way, that the map does not also reside in the territory.