Vanessa Kosoy

AI alignment researcher supported by HUJI, MIRI and LTFF. Working on the learning-theoretic agenda.

E-mail: vanessa DOT kosoy AT {the thing reverse stupidity is not} DOT org

Yes, absolutely! The contest is not a publication venue.

A major impediment in applying RL theory to any realistic scenario is that even the control problem^{[1]} is intractable when the state space is exponentially large (in general). Real-life agents probably overcome this problem by exploiting some special properties of real-life environments. Here are two strong candidates for such properties:

- In real life, processes can often be modeled as made of independent co-existing parts. For example, if I need to decide on my exercise routine for the next month and also on my research goals for the next month, the two can be optimized more or less independently.
- In real life, planning can often be decomposed across timescales, s.t. you don't need to make short timescale plans for steps that only happen later on the long timescale. For example, if I'm in the process of planning a trip to Paris, I might need to worry about (i) booking hotel and tickets (long timescale), (ii) navigating the website I'm using to find a flight (medium timescale) and (iii) moving my finger towards the correct key for entering some specific text into a field (short timescale). But I don't need to worry about walking down the escalator in the airport at this moment.

Here's an attempt to formalize these properties.

We will define a certain formal language for describing environments. These environments are going to be certain *asymptotic regions* in the space of MDPs.

- Each term has a type which consists of a tuple of inputs and a single output . Each input is a associated with an HV-polytope
^{[2]}. The output is associated with an H-polytope^{[3]}. The inputs represent action spaces (to get a discrete action set, we use the simplex of probability distributions on this set). The output represents the space of admissible equilibria. - The atomic terms are finite communicating
^{[4]}MDPs, in which each state is associated with a particular input and a transition kernel which has to be an affine mapping. For an atomic term, is the polytope of stationary state-action distributions. Notice that it's efficiently computable. - Given two terms and , we can construct a new term . We set . This represents a process made of two independent parts.
- Given a term , terms and surjective affine mappings , we can construct a new term . This represents an environment governed by on long timescales and by on short timescales. Notice that it's possible to efficiently verify that is a surjection, which is why we use HV-polytopes for inputs
^{[5]}.

It might be useful to think of as vertical composition and as horizontal composition, in the category-theoretic sense.

In order to assign semantics to this language, we need to define the environment associated with each term . We will do so by assigning a state space , each state an input (which determines the action space at this state) and a transition kernel. This is done recursively:

For the atomic terms, it is straightforward.

For :

- . Here, the last factor represents which subenvironment is active. This is needed because we want the two subenvironments to be asynchronous, i.e. their time dynamics don't have to be in lockstep.
- The transition kernel at is defined by updating according to the transition kernel of and then changing according to some
*arbitrary*probabilistic rule, as long as this rule switches the active subenvironment sufficiently often. The degrees of freedom here are one reason we get an asymptotic region in MDP-space rather than a specific MDP.

For :

- , where we abuse notation to identify the input with its index inside the tuple.
- is extended from in the obvious way.
- Given and , the -transition kernel at is defined by (i) with high probability, is updated according to the transition kernel of (ii) with low probability, is updated according to the transition kernel of , where the action is determined by the
*frequency*of state-action pairs since the last type II transition: it is easy to see that is always a polytope in an appropriately defined space of state-action distributions.

The upshot is that, given a list of term definitions (which has a structure similar to a directed acyclic graph, since the definition of each term can refer to previously defined terms), we get an environment that can have an exponentially large number of states, but the control problem can be solved in time polynomial in the size of this description, given some assumptions about the reward function. Specifically, we "decorate" our terms with reward functions in the following way:

- For atomic terms, we just specify the reward function in the straightforward way.
- For , we specify some . The reward is then a linear combination of the individual rewards with these coefficients (and doesn't depend on which subenvironment is active).
- For a term of the form , we need that for some affine which is part of the decoration. This can be validated efficiently (here it's important again that the input is an HV-polytope). In addition, we specify some and the reward a linear combination with these coefficients of the -reward and the -reward.

For timescale decomposition, this planning algorithm can be regarded as formalization of instrumental goals.

An important problem is, understanding the sample complexity of learning hypothesis classes made of such environments. First in the unbounded case and then with polynomial-time learning algorithms.

"Control" means finding the optimal policy given known transition kernel and reward function. ↩︎

An HV-polytope is a polytope described by a list of inequalities

*and*a list of vertices (notice that it's possible to efficiently validate such a description). ↩︎An H-polytope is a polytope described by list of inequalities. ↩︎

Maybe we can drop this requirement and use the polytope of

*reachable*stationary state-action distributions for . ↩︎According to Tiwary 2008, projection of H-polytopes is NP-hard even in the output-sensitive sense, but for non-degenerate projection directions it is output-sensitive polynomial time. In particular, this means we should be able to efficiently verify surjectivity in the non-degenerate case even for H-polytopes on the inputs. However, the proof given there seems poorly written and the paper is not peer reviewed AFAICT. ↩︎

A question that often comes up in discussion of IRL: are agency and values purely behavioral concepts, or do they depend on *how* the system produces its behavior? The cartesian measure of agency I proposed seems purely behavioral, since it only depends on the policy. The physicalist version seems less so since it depends on the source code, but this difference might be minor: this role of the source is merely telling the agent "where" it is in the universe. However, on closer examination, the physicalist is far from purely behaviorist, and this is true even for cartesian Turing RL. Indeed, the policy describes not only the agent's interaction with the actual environment but also its interaction with the "envelope" computer. In a sense, the policy can be said to reflects the agent's "conscious thoughts".

This means that specifying an agent requires not only specifying its source code but also the "envelope semantics" (possibly we also need to penalize for the complexity of in the definition of ). Identifying that an agent exists requires not only that its source code is running, but also, at least that its history is -consistent with the variable of the bridge transform. That is, for any we must have for some destiny . In other words, we want any computation the agents ostensibly runs on the envelope to be one that is physically manifest (it might be this condition isn't sufficiently strong, since it doesn't seem to establish a causal relation between the manifesting and the agent's observations, but it's at least necessary).

Notice also that the computational power of the envelope implied by becomes another characteristic of the agent's intelligence, together with as a function of the cost of computational resources. It might be useful to come up with natural ways to quantify this power.

The spectrum you're describing is related, I think, to the spectrum that appears in the AIT definition of agency where there is dependence on the *cost of computational resources*. This means that the same system can appear agentic from a resource-scarce perspective but non-agentic from a resource-abundant perspective. The former then corresponds to the Vingean regime and the latter to the predictable regime. However, the framework does have a notion of prior and not just utility, so it *is* possible to ascribe beliefs to Vingean agents. I think it makes sense: the beliefs of another agent can predictably differ from your own beliefs if only because there is some evidence that you have seen but the other agent, to the best of your knowledge, have not^{[1]}.

You need to allow for the possibility that the other agent inferred this evidence from some pattern you are not aware of, but you should not be confident of this. For example even a an arbitrarily-intelligent AI that received zero external information should have a hard time inferring certain things about the world that we know. ↩︎

There seems to be an even more elegant way to define causal relationships between agents, or more generally between programs. Starting from a hypothesis , for , we consider its bridge transform . Given some subset of programs we can define then project to ^{[1]}. We can then take bridge transform *again* to get some . The factor now tells us which programs causally affect the manifestation of programs in . Notice that by Proposition 2.8 in the IBP article, when we just get all programs that are running, which makes sense.

The version of PreDCA without any explicit malign hypothesis filtering might be immune to malign hypotheses, and here is why. It seems plausible that IBP admits an agreement theorem (analogous to Aumann's) which informally amounts to the following: Given two agents Alice and Bobcat that (i) share the same physical universe, (ii) have a sufficiently tight causal relationship (each can see what the other sees), (iii) have unprivileged locations inside the physical universe, (iv) start from similar/compatible priors and (v) [maybe needed?] similar utility functions, they converge to similar/compatible beliefs, regardless of the complexity of translation between their subjective viewpoints. This is plausible because (i) as opposed to the cartesian framework, different bridge rules don't lead to different probabilities and (ii) if Bobcat considers a simulation hypothesis plausible, and the simulation is sufficiently detailed to fool it indefinitely, then the simulation contains a detailed simulation of Alice and hence Alice must also consider this to be plausible hypothesis.

If the agreement conjecture is true, then the AI will converge to hypotheses that all contain the user, in a causal relationship with the AI that affirms them as the user. Moreover, those hypotheses will be compatible with the user's own posterior (i.e. the differences can be attributed the AIs superior reasoning). Therefore, the AI will act on the user's behalf, leaving no room for mesa-optimizers. Any would-be mesa-optimizer has to take the shape of a hypothesis that the user should also believe, within which the pointer-to-values still points to the right place.

Two nuances:

- Maybe in practice there's still room for simulation hypotheses of the AI which contain coarse-grained simulations of the user. In this case, the user detection algorithm might need to allow for coarsely simulated agents.
- If the agreement theorem needs condition v, we get a self-referential loop: if the AI and the user converge to the same utility function, the theorem guarantees them to converge to the same utility function, but otherwise it doesn't. This might make the entire thing a useless tautology, or there might be a way to favorably resolve the self-reference, vaguely analogously to how Loeb's theorem allows resolving the self-reference in prisoner dilemma games between FairBots.

There are actually two ways to do this, corresponding to the two natural mappings . The first is just projecting the subset of to a subset of , the second is analogous to what's used in Proposition 2.16 of the IBP article. I'm not entirely sure what's correct here. ↩︎

The problem of future unaligned AI leaking into human imitation is something I wrote about before. Notice that IDA-style recursion help a lot, because instead of simulating a process going deep into the external timeline's future, you're simulating a "groundhog day" where the researcher wakes up over and over at the same external time (more realistically, the restart time is drifting forward with the time outside the simulation) with a written record of all their previous work (but no memory of it). There can still be a problem if there is a positive probability of unaligned AI takeover in the present (i.e. during the time interval of the simulated loop), but it's a milder problem. It can be further ameliorated if the AI has enough information about the external world to make confident predictions about the possibility of unaligned takeover during this period. The out-of-distribution problem is also less severe: the AI can occasionally query the real researcher to make sure its predictions are still on track.

I think it's a terrible idea to automatically adopt an equilibrium notion which incentivises the players to come up with increasingly nasty threats as fallback if they don't get their way. And so there seems to be a good chunk of remaining work to be done, involving poking more carefully at the CoCo value and seeing which assumptions going into it can be broken.

I'm not convinced there is any real problem here. The intuitive negative reaction we have to this "ugliness" is because of (i) empathy and (ii) morality. Empathy is just a part of the utility function which, when accounted for, already ameliorates some of the ugliness. Morality is a reflection of the fact we are already in some kind of bargaining equilibrium. Therefore, thinking about all the threats invokes a feeling of all existing agreements getting dissolved sending us back to the "square one" of the bargaining. And the latter is something that, reasonably, nobody wants to do. But none of this implies this is not the correct ideal notion of bargaining equilibrium.

This is a fascinating result, but there is a caveat worth noting. When we say that e.g. AlphaGo is "superhuman at go" we are comparing it humans who (i) spent *years* training on the task and (ii) were selected for being the best at it among a sizable population. On the other hand, with next token prediction we're nowhere near that amount of optimization on the human side. (That said, I also agree that optimizing a model on next token prediction is very different from optimizing it for text coherence would be, if we could accomplish the latter.)

The short answer is, I don't know.

The long answer is, here are some possibilities, roughly ordered from "boring" to "weird":

- The framework is wrong.
- The framework is incomplete, there is some extension which gets rid of monotonicity. There are some obvious ways to make such extensions, but they look uglier and without further research it's hard to say whether they break important things or not.
- Humans are just not physicalist agents, you're not supposed to model them using this framework, even if this framework can be useful for AI. This is why humans took so much time coming up with science.
- Like #3, and also if we thought long enough we would become convinced of some kind of simulation/deity hypothesis (where the simulator/deity is a physicalist), and this is normatively correct for us.
- Because the universe is effectively finite (since it's asymptotically de Sitter), there are only so many computations that can run. Therefore, even if you only assign positive value to running certain computations, it effectively implies that running other computations is bad. Moreover, the fact the universe is finite is unsurprising since infinite universes tend to have
*all*possible computations running which makes them roughly irrelevant hypotheses for a physicalist. - We are just confused about hell being worse than death. For example, maybe people in hell have no qualia. This makes some sense if you endorse the (natural for physicalists) anthropic theory that only the best-off future copy of you matters. You can imagine there always being a "dead copy" of you, so that if something worst-than-death happens to the apparent-you, your subjective experiences go into the "dead copy".

There's also the ALTER prize for progress on the learning-theoretic agenda.