Vanessa Kosoy

Plausible cases for HRAD work, and locating the crux in the "realism about rationality" debate

I think theoretical work on AI safety has multiple different benefits, but I prefer a slightly different categorization. I like categorizing in terms of the sort of safety guarantees we can get, on a spectrum from "stronger but harder to get" to "weaker but easier to get". Specifically, the reasonable goals for such research IMO are as follows.

Plan A is having (i) a mathematical formalization of alignment (ii) a specific practical algorithm (iii) a proof that this algorithm is aligned, or at least a solid base of theoretical and empirical evidence, similarly to the situation in cryptography. This more or less correspond to World 2.

Plan B is having (i) a mathematical formalization of alignment (ii) a specific practical algorithm (iii) a specific impractical but provably aligned algorithm (iv) informal and empirical arguments suggesting that the former algorithm is as aligned as the latter. As an analogy consider Q-learning (an impractical algorithm with provable convergence guarantees) and deep Q-learning (a practical algorithm with no currently known convergence guarantees, designed by analogy to the former). This sort of still corresponds to World 2 but not quite.

Plan C is having enough theory to at least have rigorous models of all possible failure modes, and theory-inspired informal and empirical arguments why a certain algorithm avoids them. As an analogy, concepts such as VC dimension and Rademacher complexity allow us being more precise in our reasoning about underfitting and overfitting, even if we don't know how to compute them in practical scenarios. This corresponds to World 1, I guess?

In a sane civilization the solution would be not building AGI until we can implement Plan A. In the real civilization, we should go with the best plan that will be ready by the time competing projects become too dangerous to ignore.

World 3 seems too ambitious to me, since analyzing arbitrary code is almost always an intractable problem (e.g. Rice's theorem). You would need at least *some* constraints on how your agent is designed.

Possible takeaways from the coronavirus pandemic for slow AI takeoff

Like I wrote before, slow takeoff might be actually *worse* than fast takeoff. This is because even if the first powerful AIs are aligned, their head start on unaligned AIs will not count as much, and alignment might (and probably will) require overhead that will give the unaligned AIs an advantage. Therefore, success would require that the institutions will either prevent or quickly shut down unaligned AIs for enough time that aligned AIs gain the necessary edge.

Vanessa Kosoy's Shortform

Actually, as opposed to what I claimed before, we don't need computational complexity bounds for this definition to make sense. This is because the Solomonoff prior is made of computable hypotheses but is uncomputable itself.

Given , we define that " has (unbounded) goal-directed intelligence (at least) " when there is a prior and utility function s.t. for any policy , if then . Here, is the Solomonoff prior and is Kolmogorov complexity. When (i.e. no computable policy can match the expected utility of ; in particular, this implies is optimal since any policy can be *approximated* by a computable policy), we say that is "perfectly (unbounded) goal-directed".

Compare this notion to the Legg-Hutter intelligence measure. The LH measure depends on the choice of UTM in radical ways. In fact, for some UTMs, AIXI (which is the maximum of the LH measure) becomes computable or even really stupid. For example, it can always keep taking the same action because of the fear that taking any other action leads to an inescapable "hell" state. On the other hand, goal-directed intelligence differs only by between UTMs, just like Kolmogorov complexity. A perfectly unbounded goal-directed policy has to be uncomputable, and the notion of which policies are such doesn't depend on the UTM at all.

I think that it's also possible to prove that intelligence is rare, in the sense that, for any computable stochastic policy, if we regard it as a probability measure over deterministic policies, then for any there is s.t. the probability to get intelligence at least is smaller than .

Also interesting is that, for bounded goal-directed intelligence, increasing the prices can only decrease intelligence by , and a policy that is perfectly goal-directed w.r.t. lower prices is also such w.r.t. higher prices (I think). In particular, a perfectly unbounded goal-directed policy is perfectly goal-directed for *any* price vector. Informally speaking, an agent that is very smart relatively to a context with cheap computational resources is still very smart relatively to a context where they are expensive, which makes intuitive sense.

If we choose just one computational resource, we can speak of the minimal price for which a given policy is perfectly goal-directed, which is another way to measure intelligence with a more restricted domain. Curiously, our bounded Solomonoff-like prior has the shape of a Maxwell-Boltzmann distribution in which the prices are thermodynamic parameters. Perhaps we can regard the minimal price as the point of a phase transition.

Vanessa Kosoy's Shortform

*This idea was inspired by a correspondence with Adam Shimi.*

It seem very interesting and important to understand to what extent a purely "behaviorist" view on goal-directed intelligence is viable. That is, given a certain behavior (policy), is it possible to tell whether the behavior is goal-directed and what are its goals, without any additional information?

Consider a general reinforcement learning settings: we have a set of actions , a set of observations , a policy is a mapping , a reward function is a mapping , the utility function is a time discounted sum of rewards. (Alternatively, we could use instrumental reward functions.)

The simplest attempt at defining "goal-directed intelligence" is requiring that the policy in question is optimal for some prior and utility function. However, this condition is vacuous: the reward function can artificially reward only behavior that follows , or the prior can believe that behavior not according to leads to some terrible outcome.

The next natural attempt is bounding the description complexity of the prior and reward function, in order to avoid priors and reward functions that are "contrived". However, description complexity is only naturally well-defined up to an additive constant. So, if we want to have a crisp concept, we need to consider an asymptotic in which the complexity of *something* goes to infinity. Indeed, it seems natural to ask that the complexity of the policy should be much higher than the complexity of the prior and the reward function: in this case we can say that the "intentional stance" is an efficient description. However, this doesn't make sense with description complexity: the description "optimal policy for and " is of size ( stands for "description complexity of ").

To salvage this idea, we need to take not only description complexity but also *computational* complexity into account. [EDIT: I was wrong, and we can get a well-defined concept in the unbounded setting too, see child comment. The bounded concept is still interesting.] For the intentional stance to be non-vacuous we need to demand that the policy does some "hard work" in order to be optimal. Let's make it formal. Consider any function of the type where and are some finite alphabets. Then, we can try to represent it by a probabilistic automaton , where is the finite set space, is the transition kernel, and we're feeding symbols into the automaton one by one. Moreover, can be represented as a boolean circuit and this circuit can be the output of some program executed by some fixed universal Turing machine. We can associate with this object 5 complexity parameters:

- The description complexity, which is the length of .
- The computation time complexity, which is the size of .
- The computation space complexity, which is the maximum between the depth of and .
- The precomputation time complexity, which is the time it takes to run.
- The precomputation space complexity, which is the space needs to run.

It is then natural to form a single complexity measure by applying a logarithm to the times and taking a linear combination of all 5 (we apply a logarithm so that a brute force search over bits is roughly equivalent to hard-coding bits). The coefficients in this combination represent the "prices" of the various resources (but we should probably fix the price of description complexity to be 1). Of course not all coefficients must be non-vanishing, it's just that I prefer to keep maximal generality for now. We will denote this complexity measure .

We can use such automatons to represent policies, finite POMDP environments and reward functions (ofc not *any* policy or reward function, but any that can be computed on a machine with finite space). In the case of policies, the computation time/space complexity can be regarded as the time/space cost of applying the "trained" algorithm, whereas the precomputation time/space complexity can be regarded as the time/space cost of training. If we wish, we can also think of the boolean circuit as a recurrent neural network.

We can also use to define a prior , by ranging over programs that output a valid POMDP and assigning probability proportional to to each instance. (Assuming that the environment has a finite state space might seem restrictive, but becomes quite reasonable if we use a quasi-Bayesian setting with quasi-POMDPs that are not meant to be complete descriptions of the environment; for now we won't go into details about this.)

Now, return to our policy . Given , we define that " has goal-directed intelligence (at least) " when there is a suitable prior and utility function s.t. for any policy , if then . When (i.e. no finite automaton can match the expected utility of ; in particular, this implies is optimal since any policy can be *approximated* by a finite automaton), we say that is "perfectly goal-directed". Here, serves as a way to measure the complexity of , which also ensures is non-dogmatic in some rather strong sense.

With *this* definition we cannot "cheat" by encoding the policy into the prior or into the utility function, since that would allow no complexity difference. Therefore this notion seems like a non-trivial requirement on the policy. On the other hand, this requirement *does* hold sometimes, because solving the optimization problem can be much more computationally costly than just evaluating the utility function or sampling the prior.

Using vector fields to visualise preferences and make them consistent

Regarding higher-dimensional space. For a Riemannian manifold of any dimension, and a smooth vector field , we can pose the problem: find smooth that minimizes , where is the canonical measure on induced by the metric. If either is compact or we impose appropriate boundary conditions on and , then I'm pretty sure this equivalent to solving the elliptic differential equation . Here, the Laplacian and are defined using the Levi-Civita connection. If is connected then, under these conditions, the equation has a unique solution up to an additive constant.

An Orthodox Case Against Utility Functions

First, it seems to me rather clear what macroscopic physics I attach utility to...

This does not strike me as the sort of thing which will be easy to write out.

Of course it is not easy to write out. Humanity preferences are highly complex. By "clear" I only meant that it's clear something like this exists, not that I or anyone can write it out.

What if humans value something like observer-independent beauty? EG, valuing beautiful things existing regardless of whether anyone observes their beauty.

This seems ill-defined. What is a "thing"? What does it mean for a thing to "exist"? I can imagine valuing beautiful wild nature, by having "wild nature" be a part of the innate ontology. I can even imagine preferring certain computations to have results with certain properties. So, we can consider a preference that some kind of simplicity-prior-like computation outputs bit sequences with some complexity theoretic property we call "beauty". But if you want to go even more abstract than that, I don't know how to make sense of that ("make sense" *not* as "formalize" but just as "understand what you're talking about").

It would be best if you had a simple example, like a diamond maximizer, where it's more or less clear that it makes sense to speak of agents with this preference.

What I have in mind is complicated interactions between different ontologies. Suppose that we have one ontology -- the ontology of classical economics -- in which...

And we have another ontology -- the hippie ontology -- in which...

And suppose what we want to do is try to reconcile the value-content of these two different perspectives.

Why do we want to reconcile them? I think that you might be mixing two different questions here. The first question is what kind of preferences ideal "non-myopic" agents can have. About this I maintain that my framework provides a good answer, or at least a good first approximation of the answer. The second question is what kind of preferences *humans* can have. But humans are agents with only semi-coherent preferences, and I see no reason to believe things like reconciling classical economics with hippies should follow from any natural mathematical formalism. Instead, I think we should model humans as having preferences that change over time, and the detailed dynamics of the change is just a function the AI needs to learn, not some consequence of mathematical principles of rationality.

An Orthodox Case Against Utility Functions

Humans are not clear on what macroscopic physics we attach utility to. It is possible that we can emulate human judgement sufficiently well by learning over macroscopic-utility hypotheses (ie, partial hypotheses in your framework). But perhaps no individual hypothesis will successfully capture the way human value judgements fluidly switch between macroscopic ontologies...

First, it seems to me rather clear what macroscopic physics I attach utility to. If I care about people, this means my utility function comes with some model of what a "person" is (that has many free parameters), and if something falls within the parameters of this model then it's a person, and if it doesn't then it isn't a person (ofc we can also have a fuzzy boundary, which is supported in quasi-Bayesianism).

Second, what does it mean for a hypothesis to be "individual"? If we have a prior over a family of hypotheses, we can take their convex combination and get a new individual hypothesis. So I'm not sure what sort of "fluidity" you imagine that is not supported by this.

Your way of handling macroscopic ontologies entails knightian uncertainty over the microscopic possibilities. Isn't that going to lack a lot of optimization power? EG, if humans reasoned this way using intuitive physics, we'd be afraid that any science experiment creating weird conditions might destroy the world, and try to minimize chances of those situations being set up, or something along those lines?

The agent doesn't have full Knightian uncertainty over all microscopic possibilities. The prior is composed of *refinements* of an "ontological belief" that has this uncertainty. You can even consider a version of this formalism that is entirely Bayesian (i.e. each refinement has to be maximal), but then you lose the ability to retain an "objective" macroscopic reality in which the agent's point of view is "unspecial", because if the agent's beliefs about this reality have no Knightian uncertainty then it's inconsistent with the agent's free will (you could "avoid" this problem using an EDT or CDT agent but this would be bad for the usual reasons EDT and CDT are bad, and ofc you need Knightian uncertainty anyway because of non-realizability).

An Orthodox Case Against Utility Functions

IIUC, you argue that for an embedded agent to have an explicit utility function, it needs to be a function of the microscopic description of the universe. This is unsatisfactory since the agent shouldn't start out knowing microscopic physics. The alternative you suggest is using the more exotic Jeffrey-Bolker approach. However, this is not how I believe embedded agency should work.

Instead, you should consider a utility function that depends on the universe described *in whatever ontology the utility function is defined* (which we may call "macroscopic"). Microscopic physics comes in when the agent learns a fine-grained model of the dynamics in the macroscopic ontology. In particular, this fine-grained model can involve a fine-grained state space.

The other issue discussed is utility functions of the sort exemplified by the procrastination paradox. I think that besides being uncomputable, this brings in other pathologies. For example, since the utility functions you consider are discontinuous, it is no longer guaranteed an optimal policy exists at all. Personally, I think discontinuous utility functions are strange and poorly motivated.

Two Alternatives to Logical Counterfactuals

Policy-dependent source code, then, corresponds to Omega making different predictions depending on the agent's intended policy, such that when comparing policies, the agent has to imagine Omega predicting differently (as it would imagine learning different source code under policy-dependent source code).

Well, in quasi-Bayesianism for each policy you have to consider the worst-case environment in your belief set, which depends on the policy. I guess that in this sense it is analogous.

Well, HRAD certainly has relations to my own research programme. Embedded agency seems important since human values are probably "embedded" to some extent, counterfactuals are important for translating knowledge from the user's subjective vantage point to the AI's subjective vantage point, reflection is important if it's required for high capability (as Turning RL suggests). I do agree that having a high level plan for solving the problem is important to focus the research in the right directions.