This post has been recorded as part of the LessWrong Curated Podcast, and can be listened to on Spotify, Apple Podcasts, and Libsyn.
This is Part 1 of my «Boundaries» Sequence on LessWrong.
Summary: «Boundaries» are a missing concept from the axioms of game theory and bargaining theory, which might help pin down certain features of multi-agent rationality (this post), and have broader implications for effective altruism discourse and x-risk (future posts).
Epistemic status: me describing what I mean.
With the exception of some relatively recent and isolated pockets of research on embedded agency (e.g., Orseau & Ring, 2012; Garrabrant & Demsky, 2018), most attempts at formal descriptions of living rational agents — especially utility-theoretic descriptions — are missing the idea that living systems require and maintain boundaries.
When I say boundary, I don't just mean an arbitrary constraint or social norm. I mean something that could also be called a membrane in a generalized sense, i.e., a layer of stuff-of-some-kind that physically or cognitively separates a living system from its environment, that 'carves reality at the joints' in a way that isn't an entirely subjective judgement of the living system itself. Here are some examples that I hope will convey my meaning:
Comparison to Cartesian Boundaries.
For those who'd like a comparison to 'Cartesian boundaries', as in Scott Garrabrant's Cartesian Frames work, I think what I mean here is almost exactly the same concept. The main differences are these:
Comparison to social norms.
Certain social norms exist to maintain separations between livings systems. For instance:
Epistemic status: uncontroversial overview and explanation of well-established research.
Game theory usually represents players as having utility functions (payoff functions), and often tries to view the outcome of the game as arising as a consequence of the players' utilities. However, for any given concept of "equilibrium" attempting to predict how players will behave, there are often many possible equilibria. In fact, there are a number of theorems in game theory called "folk theorems" (reference: Wikipedia) that show very large spaces of possible equilibria result when games have certain features approximating real-world interaction, such as
Here's a nice illustration of a folk theorem from a Chegg.com homework set:
The zillions of possible equilibria arising from repeated interactions leave us with not much of a prediction about what will actually happen in a real-world game, and not much of a normative prescription of what should happen, either.
Bargaining theory attempts to predict and/or prescribe how agents end up "choosing an equilibrium", usually by writing down some axioms to pick out a special point on the Pareto frontier of possible, such as the Nash Bargaining Solution and Kalai-Smordinsky Bargaining Solution (reference: Wikipedia). It's not crucial to understand these figures for the remainder of the post, but if you don't, I do think think it's worth learning about them sometime, starting with the Wikipedia article:
The main thing to note about the above bargaining solutions is that they both depend on the existence of a constant point d, called a "disagreement point", representing a pair of constant utility levels that each player will fall back on attaining if the process of negotiation breaks down.
(See also this concurrently written recent LessWrong post about Kalai & Kalai's cooperative/competitive 'coco' bargaining solution. The coco solution doesn't assume a constant disagreement point, but it does assume transferrable utility, which has its own problems, due to difficulties with defining interpersonal comparisons of utility [source: lots].)The utility achieved by a player at the disagreement point is sometimes called their best alternative to negotiated agreement (BATNA):
Within the game, the disagreement point, i.e., the pair of BATNAs, may be viewed as defining what "zero" (marginal) utility means for each player.
(Why does zero need a definition, you might ask? Recall that the most broadly accepted axioms for the utility-theoretic foundations of game theory — namely, the von Neumann–Morgenstern rationality axioms [reference: Wikipedia]) — only determine a player's utility function modulo a positive affine transformation (x↦ax+b,a>0). So, in the wild, there's no canonical way to look at an agent and say what is or isn't a zero-utility outcome for that agent.)
While it's appealing to think in terms of BATNAs, in physical reality, payoffs outside of negotiations can depend very much on the players' behavior inside the negotiations, and thus is not a constant. Nash himself wrote about this limitation (Nash, 1953) just three years after originally proposing the Nash bargaining solution. For instance, if someone makes an unacceptable threat against you during a business negotiation, you might go to the police and have them arrested, versus just going home and minding your business if the negotiations had failed in a more normal/acceptable way. In other words, you have the ability to control their payoff outside the negotiation, based on what you observe during the negotiation. It's not a constant; you can affect it.
So, the disagreement point or BATNA concept isn't really applicable on its own, unless something is protecting the BATNA from what happens in the negotiation, making it effectively constant. Basically, the two players need a safe/protected/stable place to walk away to in order for a constant "walk away price" to be meaningful. For many people in many situations, that place is their home:
Thus, to the extent that we maintain social norms like "mind your own business" and "don't threaten to attack people" and "people can do whatever they want in the privacy of their own homes", we also simplify bargaining dynamics outside the home, by maintaining a well-defined fallback option for each person (a disagreement point), of the form "go home and do your own thing".
Epistemic status: research ideas, both for pinning down technical bargaining solutions, and for fixing game theory to be more applicable to real-life geopolitics and human interactions.
Since BATNAs need protection in order to be meaningful in negotiations, to identify BATNAs, we must ask: what protections already exist, going into the negotiation?
In real-world high-stakes negotiations between states — wars — almost the whole interaction is characterized by
Finally, the issue of whether AI technology will cause human extinction is very much an issue of whether certain boundaries can be respected and maintained, such as the boundaries of the human body and mind that protect individuals, as well as boundaries around physical territories and cyberspace that (should) protect human civilization.
That, however, will be a topic of a future post. For now, the main take-aways I'd like to re-iterate are that boundaries of living systems are important, and that they have a technical role to play in the theory and practice of how agents interact, including in formal descriptions of how one or more agents will or should reach agreements in cases of conflict.
In the next post, I'll talk more about how that concept of boundaries could be better integrated into discourse on effective altruism.
In this post, I laid out what I mean by boundaries (of living systems), described how a canonical choice of a "zero point" or "disagreement point" is missing from utility theory and bargaining theory, proposed that living system boundaries have a role to play in defining those disagreement points, and briefly alluded to the importance of boundaries in navigating existential risk.
This was Part 1 of my «Boundaries» Sequence.
While I generally like the post, I somewhat disagree with this summary of state of understanding, which seems to ignore quite a lot of academic research. In particular- Friston et al certainly understand this (cf ... dozens to hundreds papers claiming and explainting the importance of boundaries for living systems)- the whole autopoiesis field- various biology-inspired papers (eg this)I do agree this way of thinking it is less common among people stuck too much in the VNM basin, such as most of econ or most of game theory.
Jan, I agree with your references, especially Friston et al. I think those kinds of understanding, as you say, have not adequately made their way into utility utility-theoretic fields like econ and game theory, so I think the post is valid as a statement about the state of understanding in those utility-oriented fields. (Note that the post is about "a missing concept from the axioms of game theory and bargaining theory" and "a key missing concept from utility theory", and not "concepts missing from the mind of all of humanity".)
In Part 3 of this series, I plan to write a shallow survey of 8 problems relating to AI alignment, and the relationship of the «boundary» concept to formalizing them. To save time, I'd like to do a deep dive into just one of the eight problems, based on what commenters here would find most interesting. If you have a moment, please use the "agree" button (and where desired, "disagree") to vote for which of the eight topics I should go into depth about. Each topic is given as a subcomment below (not looking for karma, just agree/disagree votes). Thanks!
7. Preference plasticity — the possibility of changes to the preferences of human preferences over time, and the challenge of defining alignment in light of time-varying preferences (Russell, 2019, p.263).
6. Mesa-optimizers — instances of learned models that are themselves optimizers, which give rise to the so-called inner alignment problem (Hubinger et al, 2019).
5. Counterfactuals in decision theory — the problem of defining what would have happened if an AI system had made a different choice, such as in the Twin Prisoner's Dilemma (Yudkowsky & Soares, 2017).
3. Mild optimization — the problem of designing AI systems and objective functions that, in an intuitive sense, don’t optimize more than they have to (Taylor et al, 2016).
2. Corrigibility — the problem of constructing a mind that will cooperate with what its creators regard as a corrective intervention (Soares et al, 2015).
8. (Unscoped) Consequentialism — the problem that an AI system engaging in consequentialist reasoning, for many objectives, is at odds with corrigibility and containment (Yudkowsky, 2022, no. 23).
4. Impact regularization — the problem of formalizing "change to the environment" in a way that can be effectively used as a regularizer penalizing negative side effects from AI systems (Amodei et al, 2016).
1. AI boxing / containment — the method and challenge of confining an AI system to a "box", i.e., preventing the system from interacting with the external world except through specific restricted output channels (Bostrom, 2014, p.129).
Curated. It's not everyday that someone attempts to add concepts to the axioms of game theory/bargaining theory/utility theory and I'm pretty excited for where this is headed, especially if the implications are real for EA and x-risk.
In other words, you have the ability to control their payoff outside the negotiation, based on what you observe during the negotiation.
In other words, you have the ability to control their payoff outside the negotiation, based on what you observe during the negotiation.
This suggests some sort of (possibly acausal) bargaining within the BATNAs, so points to a hierarchy of bargains. Each bargain must occur without violating boundaries of agents, but if it would, then the encounter undergoes escalation, away from trade and towards conflict. After a step of escalation, another bargain may be considered, that runs off tighter less comfortable boundaries. If it also falls through, there is a next level of escalation, and so on.
Possibly the sequence of escalation goes on until the goodhart boundary where agents lose ability to assess value of outcomes. It's unclear what happens when that breaks down as well and one of the agents moves the environment into the other's crash space.
Note that this is not destruction of the other agent, which is unexpected for the last stage of escalation of conflict. Destruction of the other agent is merely how the game aborts before reaching its conclusion, while breaking into the crash space of the other agent is the least acceptable outcome in terms of agent boundaries (though it's not the worst outcome, it could even have high utility; these directions of badness are orthogonal, goodharting vs. low utility). This is a likely outcome of failed AI alignment (all boundaries of humanity are ignored, leading to something normatively worthless), as well as of some theoretical successes of AI alignment that are almost certainly impossible in practice (all boundaries of humanity are ignored, the world is optimized towards what is the normatively best outcome for humanity).
Great post! One relatively-minor nitpick:
The coco solution doesn't assume a constant disagreement point, but it does assume transferrable utility, which has its own problems, due to difficulties with defining interpersonal comparisons of utility [source: lots].
Interpersonal comparisons of utility in general make no sense at all, because each agent's utility can be scaled/shifted independently. But I don't think that's a problem for transferrable utility, which is what we need for coco. Transferrable utility just requires money (or some analogous resource), and it requires that the amounts of money-equivalent involved in the game are small enough that utility is roughly linear in money. We don't need interpersonal comparability of utility for that.
For the games that matter most, the amounts of money-equivalent involved are large enough that utility is not roughly linear in it. (Example: Superintelligences deciding what to do with the cosmic endowment.) Or so it seems to me, I'd love to be wrong about this.
Seems true, though I would guess that the coco idea could probably be extended to weaker conditions, e.g. expected utility a smooth function of money. I haven't looked into this, but my guess would be that it only needs linearity on the margin, based on how things-like-this typically work in economics.
Interesting. I hope you are wrong.
Heh. Beware lest you wish yourself from the devil you know to the devil you don't.
Rambling/riffing: Boundaries typically need holes in order to be useful. Depending on the level of abstraction, different things can be thought of as holes. One way to think of a boundary is a place where a rule is enforced consistently, and this probably involves pushing what would be a continuous condition into a condition with a few semi discrete modes (in the simplest case enforcing a bimodal distribution of outcomes). In practice, living systems seem to have settled on stacking a bunch of one dimensional gate keepers together as presumably the modularity of such a thing was easier to discover in the search space than things with higher path dependencies due to entangled condition measurement. This highlights the similarity between boolean circuit analysis and a biological boundary. In a boolean circuit, the configurations of 'cheap' energy flows/gradients can be optimized for benefit, while the walls to the vast alternative space of other configurations can be artificially steepened/shored up (see: mitigation efforts to prevent electron tunneling in semiconductors).