## AI ALIGNMENT FORUMAF

Vanessa Kosoy

AI alignment researcher supported by HUJI, MIRI and LTFF. Working on the learning-theoretic agenda.

E-mail: vanessa DOT kosoy AT {the thing reverse stupidity is not} DOT org

# Wiki Contributions

Vanessa Kosoy's Shortform

Here's a question inspired by thinking about Turing RL, and trying to understand what kind of "beliefs about computations" should we expect the agent to acquire.

Does mathematics have finite information content?

First, let's focus on computable mathematics. At first glance, the answer seems obviously "no": because of the halting problem, there's no algorithm (i.e. a Turing machine that always terminates) which can predict the result of every computation. Therefore, you can keep learning new facts about results of computations forever. BUT, maybe most of those new facts are essentially random noise, rather than "meaningful" information?

Is there a difference of principle between "noise" and "meaningful content"? It is not obvious, but the answer is "yes": in algorithmic statistics there is the notion of "sophistication" which measures how much "non-random" information is contained in some data. In our setting, the question can be operationalized as follows: is it possible to have an algorithm plus an infinite sequence of bits , s.t. is random in some formal sense (e.g. Martin-Lof) and can decide the output of any finite computation if it's also given access to ?

The answer to the question above is "yes"! Indeed, Chaitin's constant is Martin-Lof random. Given access to Chaitin's constant, it is possible to construct a halting oracle, therefore can decide whether the computation halts, and if it does, run it (and if doesn't, output N/A or whatever).

[EDIT: Actually, this is not quite right. The way you use Chaitin's constant to emulate a halting oracle produces something that's only guaranteed to halt if you give it the correct Chaitin's constant.]

But, this is a boring solution. In practice we are interested at efficient methods of answering mathematical questions, and beliefs acquired by resource bounded agents. Hence, the question becomes: given a resource bound (e.g. a bound on space or time complexity), is it possible to have and similar to above, s.t. respects the bound and is pseudorandom in some formal sense w.r.t. the bound ?

[EDIT: I guess that the analogous thing to the unbounded setting would be, only has to respect when given the correct . But the real conclusion is probably that we should look for something else instead, e.g. some kind of infradistribution.]

This is a fun question, because any answer would be fascinating in its own way: either computable mathematics has finite content in some strong formal sense (!) or mathematics is infinitely sophisticated in some formal sense (!)

We can also go in the other direction along the "hierarchy of feasibility", although I'm not sure how useful is that. Instead of computable mathematics, let's consider determining the truth (not provability, but actual truth) of sentences in e.g. Peano Arithmetic. Does and as above still exist? This would require e.g. a Martin-Lof random sequence which allows making any finite number of Turing jumps.

Vanessa Kosoy's Shortform

Two more remarks.

## User Detection

It can be useful to identify and assist specifically the user rather than e.g. any human that ever lived (and maybe some hominids). For this purpose I propose the following method. It also strengthens the protocol by relieving some pressure from other classification criteria.

Given two agents and , which can ask which points on 's timeline are in the causal past of which points of 's timeline. To answer this, consider the counterfactual in which takes a random action (or sequence of actions) at some point (or interval) on 's timeline, and measure the mutual information between this action(s) and 's observations at some interval on 's timeline.

Using this, we can effectively construct a future "causal cone" emanating from the AI's origin, and also a past causal cone emanating from some time on the AI's timeline. Then, "nearby" agents will meet the intersection of these cones for low values of whereas "faraway" agents will only meet it for high values of or not at all. To first approximation, the user would be the "nearest" precursor[1] agent i.e. the one meeting the intersection for the minimal .

More precisely, we expect the user's observations to have nearly maximal mutual information with the AI's actions: the user can e.g. see every symbol the AI outputs to the display. However, the other direction is less clear: can the AI's sensors measure every nerve signal emanating from the user's brain? To address this, we can fix to a value s.t. we expect only the user the meet the intersection of cones, and have the AI select the agent which meets this intersection for the highest mutual information threshold.

This probably does not make the detection of malign agents redundant, since AFAICT a malign simulation hypothesis might be somehow cleverly arranged to make a malign agent the user.

## More on Counterfactuals

In the parent post I suggested "instead of examining only we also examine coarsenings of which are not much more complex to describe". A possible elegant way to implement this:

• Consider the entire portion of our (simplicity) prior which consists of coarsenings of .
• Apply the counterfactual to .
• Renormalize the result from HUC to HUD.

1. We still need precursor detection, otherwise the AI can create some new agent and make it the nominal "user". ↩︎

Infra-Topology

All credit for this beautiful work goes to Alex.

Vanessa Kosoy's Shortform

## Non-Cartesian Daemons

These are notoriously difficult to deal with. The only methods I know are that applicable to other protocols are homomorphic cryptography and quantilization of envelope (external computer) actions. But, in this protocol, they are dealt with the same as Cartesian daemons! At least if we assume a non-Cartesian attack requires an envelope action, the malign hypotheses which are would-be sources of such actions are discarded without giving an opportunity for attack.

## Weaknesses

My main concerns with this approach are:

• The possibility of major conceptual holes in the definition of precursors. More informal analysis can help, but ultimately mathematical research in infra-Bayesian physicalism in general and infra-Bayesian cartesian/physicalist multi-agent interactions in particular is required to gain sufficient confidence.

• The feasibility of a good enough classifier. At present, I don't have a concrete plan for attacking this, as it requires inputs from outside of computer science.

• Inherent "incorrigibility": once the AI becomes sufficiently confident that it correctly detected and classified its precursors, its plans won't defer to the users any more than the resulting utility function demands. On the second hand, I think the concept of corrigibility is underspecified so much that I'm not sure it is solved (rather than dissolved) even in the Book. Moreover, the concern can be ameliorated by sufficiently powerful interpretability tools. It is therefore desirable to think more of how to achieve interpretability in this context.

Vanessa Kosoy's Shortform

# Precursor Detection, Classification and Assistance (PreDCA)

Infra-Bayesian physicalism provides us with two key building blocks:

• Given a hypothesis about the universe, we can tell which programs are running. (This is just the bridge transform.)
• Given a program, we can tell whether it is an agent, and if so, which utility function it has[1] (the "evaluating agent" section of the article).

I will now outline how we can use these building blocks to solve both the inner and outer alignment problem. The rough idea is:

• For each hypothesis in the prior, check which agents are precursors of our agent according to this hypothesis.
• Among the precursors, check whether some are definitely neither humans nor animals nor previously created AIs.
• If there are precursors like that, discard the hypothesis (it is probably a malign simulation hypothesis).
• If there are no precursors like that, decide which of them are humans.
• Follow an aggregate of the utility functions of the human precursors (conditional on the given hypothesis).

## Detection

How to identify agents which are our agent's precursors? Let our agent be and let be another agents which exists in the universe according to hypothesis [2]. Then, is considered to be a precursor of in universe when there is some -policy s.t. applying the counterfactual " follows " to (in the usual infra-Bayesian sense) causes not to exist (i.e. its source code doesn't run).

A possible complication is, what if implies that creates / doesn't interfere with the creation of ? In this case might conceptually be a precursor, but the definition would not detect it. It is possible that any such would have a sufficiently large description complexity penalty that it doesn't matter. On the second hand, if is unconditionally Knightian uncertain about creating then the utility will be upper bounded by the scenario in which doesn't exist, which is liable to make an effectively falsified hypothesis. On the third hand, it seems plausible that the creation of by would be contingent on 's behavior (Newcomb-style, which we know how it works in infra-Bayesianism), in which case is not falsified and the detection works. In any case, there is a possible variant of the definition to avoid the problem: instead of examining only we also examine coarsenings of which are not much more complex to describe (in the hope that some such coarsening would leave the creation of uncertain).

Notice that any agent whose existence is contingent on 's policy cannot be detected as a precursor: the corresponding program doesn't even "run", because we don't apply a -policy-counterfactual to the bridge transform.

## Classification

How to decide which precursors are which? One tool we have is the parameter and the computational resource parameters in the definition of intelligence. In addition we might be able to create a very rough neuroscience-based model of humans. Also, we will hopefully have a lot of information about other AIs that can be relevant. Using these, it might be possible to create a rough benign/malign/irrelevant classifier, s.t.

• Humans are classified as "benign".
• Most (by probability mass) malign simulation hypotheses contain at least one precursor classified as "malign".
• Non-human agents that exist in the causal past of our AI in the null (non-simulation) hypothesis are classified as "irrelevant".

## Assistance

Once we detected and classified precursors in each hypothesis, we discard all hypotheses that contain malign precursors. In the remaining hypotheses, we perform some kind of aggregation on the utility functions of the benign precursors (for example, this). The utility functions from different hypotheses are somehow normalized to form the overall utility function. Alternatively, we do a maximal lottery vote for the policy, where each hypothesis is a voter with weight proportional to its prior probability mass.

## Inner Alignment

Why can this solve inner alignment? In any model-based approach, the AI doesn't train the policy directly. Instead, it trains models and uses them to compute the policy. I suspect that the second step cannot create mesa-optimizers, since it only involves control and not learning[3]. Hence, any mesa-optimizer has to originate from the first step, i.e. from the model/hypothesis. And, any plausible physicalist hypothesis which contains a mesa-optimizer has to look like a malign simulation hypothesis.

## Outer Alignment

Why can this solve outer alignment? Presumably, we are aggregating human utility functions. This doesn't assume humans are perfect agents: can be less than infinity. I suspect that when the utility function becomes somewhat ambiguous, but the ambiguity can probably be resolved arbitrarily or maybe via a risk-averse method. What if the AI modifies the humans? Then only pre-modification humans are detected as precursors, and there's no problem.

Moreover, the entire method can be combined with the Hippocratic principle to avoid catastrophic mistakes out of ignorance (i.e. to go from intent alignment to impact alignment).

1. We do need a lot more research to fully specify this "utility reconstruction" and check that it satisfies reasonable desiderata. But, the existence of a natural utility-function-dependent measure of intelligence suggests it is possible. ↩︎

2. I'm ignoring details like "what if only exists with certain probability". The more careful analysis is left for later. ↩︎

3. In modern deep RL systems, there might not be a clear line between learning and control. For example, if we use model-free RL to produce the policy for a given hypothesis, then there is learning happening there as well. In such an architecture, the value function or Q-function should be regarded as part of the hypothesis for our purpose. ↩︎

Vanessa Kosoy's Shortform

Infradistributions admit an information-theoretic quantity that doesn't exist in classical theory. Namely, it's a quantity that measures how many bits of Knightian uncertainty an infradistribution has. We define it as follows:

Let be a finite set and a crisp infradistribution (credal set) on , i.e. a closed convex subset of . Then, imagine someone trying to communicate a message by choosing a distribution out of . Formally, let be any other finite set (space of messages), (prior over messages) and (communication protocol). Consider the distribution . Then, the information capacity of the protocol is the mutual information between the projection on and the projection on according to , i.e. . The "Knightian entropy" of is now defined to be the maximum of over all choices of , , . For example, if is Bayesian then it's , whereas if , it is .

Here is one application[1] of this concept, orthogonal to infra-Bayesianism itself. Suppose we model inner alignment by assuming that some portion of the prior consists of malign hypotheses. And we want to design e.g. a prediction algorithm that will converge to good predictions without allowing the malign hypotheses to attack, using methods like confidence thresholds. Then we can analyze the following metric for how unsafe the algorithm is.

Let be the set of observations and the set of actions (which might be "just" predictions) of our AI, and for any environment and prior , let be the distribution over histories resulting from our algorithm starting with prior and interacting with environment for time steps. We have , where is the malign part of the prior and the benign part. For any , consider . The closure of the convex hull of these distributions for all choices of ("attacker policy") is some . The maximal Knightian entropy of over all admissible and is called the malign capacity of the algorithm. Essentially, this is a bound on how much information the malign hypotheses can transmit into the world via the AI during a period of . The goal then becomes finding algorithms with simultaneously good regret bounds and good (in particular, at most polylogarithmic in ) malign capacity bounds.

1. This is an idea I'm collaborating on with Johannes Treutlein. ↩︎

AXRP Episode 14 - Infra-Bayesian Physicalism with Vanessa Kosoy

Multiple branches can only exist transiently during the weird experiment (i.e. neither before nor after). Naturally, if the agent knows in advance the experiment is going to happen, then it anticipates those branches to appear.

AXRP Episode 14 - Infra-Bayesian Physicalism with Vanessa Kosoy

The wavefunction has other branches, because it's the same mathematical object governed by the same equations. Only, the wavefunction doesn't exist physically, it's just an intermediate variable in the computation. The things that exist (corresponding to the variable in the formalism) and the things that are experienced (corresponding to some function of the variable in the formalism) only have one branch.

AXRP Episode 14 - Infra-Bayesian Physicalism with Vanessa Kosoy

Btw, there is some amount of philosophical convergence between this and some recent work I did on critical agential physics;

Thanks, I'll look at that!

It seems like "infra-Bayesianism" may be broadly compatible with frequentism;

Yes! In frequentism, we define probability distributions as limits of frequencies. One problem with this is, what to do if there's no convergence? In the real world, there won't be convergence unless you have an infinite sequence of truly identical experiments, which you never have. At best, you have a long sequence of similar experiments. Arguably, infrabayesianism solves it by replacing the limit with the convex hull of all limit points.

But, I view infrabayesianism more as a synthesis between bayesianism and frequentism. Like in frequentism, you can get asymptotic guarantees. But, like in bayesiansim, it makes sense to talk of priors (and even updates), and measure the performance of your policy regardless of the particular decomposition of the prior into hypotheses (as opposed to regret which does depend on the decomposition). In particular, you can define the optimal infrabayesian policy even for a prior which is not learnable and hence doesn't admit frequentism-style guarantees.

AXRP Episode 14 - Infra-Bayesian Physicalism with Vanessa Kosoy

By "weird experiment" I mean things like, reversing decoherence. That is, something designed to cause interference between branches of the wavefunction with minds that remember different experiences[1]. Which obviously requires levels of technology we are nowhere near to reaching[2]. As long as decoherence happens as usual, there is only one copy.

1. Ofc it requires erasing their contradicting memories among other things. ↩︎

2. There is a possible "shortcut" though, namely simulating minds on quantum computers. Naturally, in this case only the quantum-uploaded-minds can have multiple copies. ↩︎