Vanessa Kosoy

AI alignment researcher supported by MIRI and LTFF. Working on the learning-theoretic agenda.

Open problem: how can we quantify player alignment in 2x2 normal-form games?

In common-payoff games the denominator is *not* zero, in general. For example, suppose that , , , , . Then , as expected: current payoff is , if played it would be .

Open problem: how can we quantify player alignment in 2x2 normal-form games?

Consider any finite two-player game in normal form (each player can have any finite number of strategies, we can also easily generalize to certain classes of infinite games). Let be the set of pure strategies of player and the set of pure strategies of player . Let be the utility function of player . Let be a particular (mixed) outcome. Then the alignment of player with player in this outcome is defined to be:

Ofc so far it doesn't depend on at all. However, we can make it depend on if we use to impose assumptions on , such as:

- is a -best response to or
- is a Nash equilibrium (or other solution concept)

Caveat: If we go with the Nash equilibrium option, can become "systematically" ill-defined (consider e.g. the Nash equilibrium of matching pennies). To avoid this, we can switch to the extensive-form game where chooses their strategy after seeing 's strategy.

My Current Take on Counterfactuals

I would be convinced if you had a theory of rationality that is a Pareto improvement on IB (i.e. has all the good properties of IB + a more general class of utility functions). However, LI doesn't provide this AFAICT. That said, I would be interested to see some rigorous theorem about LIDT solving procrastination-like problems.

As to philosophical deliberation, I feel some appeal in this point of view, but I can also easily entertain a different point of view: namely, that human values are more or less fixed and well-defined whereas philosophical deliberation is just a "show" for game theory reasons. Overall, I place much less weight on arguments that revolve around the presumed nature of human values compared to arguments grounded in abstract reasoning about rational agents.

An Intuitive Guide to Garrabrant Induction

First, "no complexity bounds on the trader" doesn't mean we allow *uncomputable* traders, we just don't limit their time or other resources (exactly like in Solomonoff induction). Second, even having a trader that knows everything doesn't mean all the prices collapse in a single step. It does mean that the prices will *converge* to knowing everything with time. GI guarantees no budget-limited trader will make an *infinite* profit, it doesn't guarantee no trader will make a profit at all (indeed guaranteeing the later is impossible).

An Intuitive Guide to Garrabrant Induction

A brief note on naming: Solomonoff exhibited an uncomputable algorithm that does idealized induction, which we call Solomonoff induction. Garrabrant exhibited a computable algorithm that does logical induction, which we have named Garrabrant induction.

This seems misleading. Solomonoff induction has computable versions obtained by imposing a complexity bound on the programs. Garrabrant induction has uncomputable versions obtained by *removing* the complexity bound from the traders. The important difference between Solomonoff and Garrabrant is *not* computable v.s uncomputable. Also I feel that it would be appropriate to mention defensive forecasting as a historical precursor of Garrabrant induction.

My Current Take on Counterfactuals

My hope is that we will eventually have computationally feasible algorithms that satisfy provable (or at least conjectured) infra-Bayesian regret bounds for some sufficiently rich hypothesis space. Currently, even in the Bayesian case, we only have such algorithms for poor hypothesis spaces, such as MDPs with a small number of states. We can also rule out such algorithms for some large hypothesis spaces, such as short programs with a fixed polynomial-time bound. In between, there should be some hypothesis space which is small enough to be feasible and rich enough to be useful. Indeed, it seems to me that the existence of such a space is the simplest explanation for the success of deep learning (that is, for the ability to solve a diverse array of problems with relatively simple and domain-agnostic algorithms). But, at present I only have speculations about what this space looks like.

My Current Take on Counterfactuals

However, I also think LIDT solves the problem in practical terms:

What is LIDT exactly? I can try to guess but I rather make sure we're both talking about the same thing.

My basic argument is we can model this sort of preference, so why rule it out as a possible human preference? You may be philosophically confident in finitist/constructivist values, but are you so confident that you'd want to lock unbounded quantifiers out of the space of possible values for value learning?

I agree inasmuch as we actually *can* model this sort of preferences, for a sufficiently strong meaning of "model". I feel that it's much harder to be confident about any detailed claim about human values than about the validity of a generic theory of rationality. Therefore, if the ultimate generic theory of rationality imposes some conditions on utility functions (while still leaving a very rich space of different utility functions), that will lead me to try formalizing human values *within those constraints*. Of course, given a candidate theory, we *should* poke around and see whether it can be extended to weaken the constraints.

Introduction To The Infra-Bayesianism Sequence

Boundedly rational agents definitely *can* have dynamic consistency, I guess it depends on just how bounded you want them to be. IIUC what you're looking for is a model that can formalize "approximately rational but doesn't necessary satisfy any crisp desideratum". In this case, I would use something like my quantitative AIT definition of intelligence.

Formal Inner Alignment, Prospectus

Since you're trying to compile a comprehensive overview of directions of research, I will try to summarize my own approach to this problem:

- I want to have algorithms that admit thorough theoretical analysis. There's already plenty of bottom-up work on this (proving initially weak but increasingly stronger theoretical guarantees for deep learning). I want to complement it by top-down work (proving strong theoretical guarantees for algorithms that are initially infeasible but increasingly made more feasible). Hopefully eventually the two will meet in the middle.
- Given feasible algorithmic building blocks with strong theoretical guarantees, some version of the consensus algorithm can tame Cartesian daemons (including manipulation of search) as long as the prior (inductive bias) of our algorithm is sufficiently good.
- Coming up with a good prior is a problem in embedded agency. I believe I achieved significant progress on this using a certain infra-Bayesian approach, and hopefully will have a post soonish.
- The consensus-like algorithm
*will*involve a trade-off between safety and capability. We will have to manage this trade-off based on expectations regarding external dangers that we need to deal with (e.g. potential competing unaligned AIs). I believe this to be inevitable, although ofc I would be happy to be proven wrong. - The resulting AI is only a first stage that we will use to design the second stage AI, it's
*not*something we will deploy in self-driving cars or such - Non-Cartesian daemons need to be addressed separately. Turing RL seems like a good way to study this if we assume the core is too weak to produce non-Cartesian daemons, so the latter can be modeled as potential catastrophic side effects of using the envelope. However, I don't have a satisfactory solution yet (aside perhaps homomorphic encryption, but the overhead might be prohibitive).

I don't think in this case aB/A should be defined to be 1. It seems perfectly justified to leave it undefined, since in such a game B can be equally well conceptualized as maximally aligned or as maximally anti-aligned. It

istrue that if, out of some set of objects you consider the subset of those that have aB/A=1, then it's natural to include the undefined cases too. But, if out of some set of objects you consider the subset of those that have aB/A=0, then it'salsonatural to include the undefined cases. This is similar to how (0,0)∈R2 is simultaneously in the closure of {xy=1} and in the closure of {xy=−1}, so 00 can be considered to be either 1 or −1 (or any other number) depending on context.