Vanessa Kosoy

AGI safety from first principles: Goals and Agency

By contrast, in this section I’m interested in what it means for an agent to have a goal of its own. Three existing frameworks which attempt to answer this question are Von Neumann and Morgenstern’s expected utility maximisation, Daniel Dennett’s intentional stance, and Hubinger et al’s mesa-optimisation. I don’t think any of them adequately characterises the type of goal-directed behaviour we want to understand, though. While we can prove elegant theoretical results about utility functions, they are such a broad formalism that practically any behaviour can be described as maximising some utility function.

There is my algorithmic-theoretic definition which might be regarded as a formalization of the intentional stance, and which avoids the degeneracy problem you mentioned.

Vanessa Kosoy's Shortform

There is a formal analogy between infra-Bayesian decision theory (IBDT) and modal updateless decision theory (MUDT).

Consider a one-shot decision theory setting. There is a set of unobservable states , a set of actions and a reward function . An IBDT agent has some belief ^{[1]}, and it chooses the action .

We can construct an equivalent scenario, by augmenting this one with a *perfect predictor of the agent* (Omega). To do so, define , where the semantics of is "the unobservable state is and Omega predicts the agent will take action ". We then define by and by ( is what we call the *pullback* of to , i.e we have utter Knightian uncertainty about Omega). This is essentially the usual Nirvana construction.

The new setup produces the same optimal action as before. However, we can now give an alternative description of the decision rule.

For any , define by . That is, is an infra-Bayesian representation of the belief "Omega will make prediction ". For any , define by . can be interpreted as the belief "assuming Omega is accurate, the expected reward will be at least ".

We will also need to use the order on defined by: when . The reversal is needed to make the analogy to logic intuitive. Indeed, can be interpreted as " implies "^{[2]}, the meet operator can be interpreted as logical conjunction and the join operator can be interpreted as logical disjunction.

Claim:

(Actually I only checked it when we restrict to crisp infradistributions, in which case is intersection of sets and is set containment, but it's probably true in general.)

Now, can be interpreted as "the conjunction of the belief and implies ". Roughly speaking, "according to , if the predicted action is then the expected reward is at least ". So, our decision rule says: choose the action that maximizes the value for which this logical implication holds (but "holds" is better thought of as "is provable", since we're talking about the agent's *belief*). Which is exactly the decision rule of MUDT!

Vanessa Kosoy's Shortform

Master post for ideas about infra-Bayesianism.

What to do with imitation humans, other than asking them what the right thing to do is?

You could try to infer human values from the "sideload" using my "Conjecture 5" about the AIT definition of goal-directed intelligence. However, since it's not an upload and, like you said, it can go off-distribution, that doesn't seem very safe. More generally, alignment protocols should never be open-loop.

I'm also skeptical about IDA, for reasons not specific to your question (in particular, this), but making it open-loop is worse.

Gurkenglas' answer seems to me like something that can work, if we can somehow be sure the sideload doesn't become superintelligent, for example, given an imitation plateau.

Vanessa Kosoy's Shortform

Another thing that might happen is a data bottleneck.

Maybe there will be a good enough dataset to produce a sideload that simulates an "average" person, and that will be enough to automate many jobs, but for a simulation of a competent AI researcher you would need a more specialized dataset that will take more time to produce (since there are a lot less competent AI researchers than people in general).

Moreover, it might be that the sample complexity grows with the duration of coherent thought that you require. That's because, unless you're training directly on brain inputs/outputs, non-realizable (computationally complex) environment influences contaminate the data, and in order to converge you need to have enough data to average them out, which scales with the length of your "episodes". Indeed, all convergence results for Bayesian algorithms we have in the non-realizable setting require ergodicity, and therefore the time of convergence (= sample complexity) scales with mixing time, which in our case is determined by episode length.

In such a case, we might discover that many tasks can be automated by sideloads with short coherence time, but AI research might require substantially longer coherence times. And, simulating progress requires by design going off-distribution along certain dimensions which might make things worse.

Vanessa Kosoy's Shortform

The imitation plateau can definitely be rather short. I also agree that computational overhang is the major factor here. However, a failure to capture some of the ingredients can be a *cause* of low computational overhead, whereas a success to capture all of the ingredients is a cause of high computational overhang, because the compute necessary to reach superintelligence might be very different in those two cases. Using sideloads to accelerate progress might still require years, whereas an "intrinsic" AGI might lead to the classical "foom" scenario.

EDIT: Although, since training is typically much more computationally expensive than deployment, it is likely that the first human-level imitators will already be significantly sped-up compared to humans, implying that accelerating progress will be relatively easy. It might still take some time from the first prototype until such an accelerate-the-progress project, but probably not much longer than deploying lots of automation.

Vanessa Kosoy's Shortform

I don't see any strong argument why this path will produce superintelligence. You can have a stream of thought that cannot be accelerated without investing a proportional amount of compute, while a completely different algorithm would produce a far superior "stream of thought". In particular, such an approach cannot differentiate between features of the stream of thought that are important (meaning that they advance towards the goal) and features of the stream of though that are unimportant (e.g. different ways to phrase the same idea). This forces you to solve a task that is potentially much more difficult than just achieving the goal.

Vanessa Kosoy's Shortform

That *is* similar to gaining uploads (borrowing terminology from Egan, we can call them "sideloads"), but it's not obvious amplification/distillation will work. In the model based on realizability, the distillation step can fail because the system you're distilling is too computationally complex (hence, too unrealizable). You can deal with it by upscaling the compute of the learning algorithm, but that's not better than plain speedup.

Vanessa Kosoy's Shortform

An AI progress scenario which seems possible and which I haven't seen discussed: an imitation plateau.

The key observation is, *imitation learning algorithms ^{[1]} might produce close-to-human-level intelligence even if they are missing important ingredients of general intelligence that humans have*. That's because imitation might be a qualitatively easier task than general RL. For example, given enough computing power, a human mind becomes

This opens the possibility that close-to-human-level AI will arrive while we're still missing key algorithmic insights to produce general intelligence directly. Such AI would not be easily scalable to superhuman. Nevertheless, some superhuman performance might be produced by sped-up simulation, reducing noise in human behavior and controlling the initial conditions (e.g. simulating a human on a good day). As a result, we will have some period of time during which AGI is already here, automation is in full swing, but there's little or no further escalation. At the end of this period, the missing ingredients will be assembled (maybe with the help of AI researchers) and superhuman AI (possibly a fast takeoff) begins.

It's interesting to try and work out the consequences of such a scenario, and the implications on AI strategy.

Such as GPT-n ↩︎

In the anthropic trilemma, Yudkowsky writes about the thorny problem of understanding subjective probability in a setting where copying and modifying minds is possible. Here, I will argue that infra-Bayesianism (IB) leads to the solution.

Consider a population of robots, each of which in a regular RL agent. The environment produces the observations of the robots, but can also make copies or delete portions of their memories. If we consider a random robot sampled from the population, the history they observed will be biased compared to the "physical" baseline. Indeed, suppose that a particular observation c has the property that every time a robot makes it, 10 copies of them are created in the next moment. Then, a random robot will have c much more often in their history than the physical frequency with which c is encountered, due to the resulting "selection bias". We call this setting "anthropic RL" (ARL).

The original motivation for IB was non-realizability. But, in ARL, Bayesianism runs into issues even when the environment is realizable from the "physical" perspective. For example, we can consider an "anthropic MDP" (AMDP). An AMDP has finite sets of actions (A) and states (S), and a transition kernel T:A×S→Δ(S∗). The output is a string of states instead of a single state, because many copies of the agent might be instantiated on the next round, each with their own state. In general, there will be no single Bayesian hypothesis that captures the distribution over histories that the average robot sees at any given moment of time (at any given moment of time we sample a robot out of the population and look at their history). This is because the distributions at different moments of time are

mutually inconsistent.The consistency that is violated is exactly the causality property of environments. Luckily, we know how to deal with acausality: using the IB causal-acausal correspondence! The result can be described as follows: Murphy chooses a time moment n∈N and guesses the robot policy π until time n. Then, a simulation of the dynamics of (π,T) is performed until time n, and a single history is sampled from the resulting population. Finally, the observations of the chosen history unfold in reality. If the agent chooses an action different from what is prescribed, Nirvana results. Nirvana also happens after time n (we assume Nirvana reward 1 rather than ∞).

This IB hypothesis is consistent with what the average robot sees at any given moment of time. Therefore, the average robot will learn this hypothesis (assuming learnability). This means that for n≫11−γ≫0, the population of robots at time n has expected average utility with a lower bound close to the optimum for this hypothesis. I think that for an AMDP this should equal the optimum expected average utility you can possibly get, but it would be interesting to verify.

Curiously, the same conclusions should hold if we do a weighted average over the population, with any fixed method of weighting. Therefore, the posterior of the average robot behaves adaptively depending on which sense of "average" you use. So, your epistemology doesn't have to fix a particular method of counting minds. Instead different counting methods are just different "frames of reference" through which to look, and you can be simultaneously rational in all of them.