Vanessa Kosoy

Director of AI research at ALTER, where I lead a group working on the learning-theoretic agenda for AI alignment. I'm also supported by the LTFF. See also LinkedIn.

E-mail: {first name}@alter.org.il

20

Two thoughts about the role of quining in IBP:

- Quine's are non-unique (there can be multiple fixed points). This means that, viewed as a
*prescriptive*theory, IBP produces multi-valued prescriptions. It might be the case that this multi-valuedness can resolve problems with UDT such as Wei Dai's 3-player Prisoner's Dilemma and the anti-Newcomb problem^{[1]}. In these cases, a particular UDT/IBP (corresponding to a particular quine) loses to CDT. But, a different UDT/IBP (corresponding to a different quine) might do as well as CDT. - What to do about agents that don't know their own source-code? (Arguably humans are such.) Upon reflection, this is not really an issue! If we use IBP
*prescriptively*, then we can always assume quining: IBP is just telling you to follow a procedure that uses quining to access its own (i.e. the procedure's) source code. Effectively, you are instantiating an IBP agent inside yourself with your own prior and utility function. On the other hand, if we use IBP*descriptively*, then we don't need quining: Any agent can be assigned "physicalist intelligence" (Definition 1.6 in the original post, can also be extended to not require a known utility function and prior, along the lines of ADAM) as long as*the procedure doing the assigning*knows its source code. The agent doesn't need to know its own source code in any sense.

^{^}@Squark is my own old LessWrong account.

20

*I believe that all or most of the claims here are true, but I haven't written all the proofs in detail, so take it with a grain of salt.*

Ambidistributions are a mathematical object that simultaneously generalizes infradistributions and ultradistributions. It is useful to represent how much power an agent has over a particular system: which degrees of freedom it can control, which degrees of freedom obey a known probability distribution and which are completely unpredictable.

**Definition 1:** Let be a compact Polish space. A (crisp) *ambidistribution on * is a function s.t.

- (Monotonocity) For any , if then .
- (Homogeneity) For any and , .
- (Constant-additivity) For any and , .

Conditions 1+3 imply that is 1-Lipschitz. We could introduce non-crisp ambidistributions by dropping conditions 2 and/or 3 (and e.g. requiring 1-Lipschitz instead), but we will stick to crisp ambidistributions in this post.

The space of all ambidistributions on will be denoted .^{[1]} Obviously, (where stands for (crisp) infradistributions), and likewise for ultradistributions.

**Example 1:** Consider compact Polish spaces and a continuous mapping . We can then define by

That is, is the value of the zero-sum two-player game with strategy spaces and and utility function .

Notice that in Example 1 can be regarded as a Cartesian frame: this seems like a natural connection to explore further.

**Example 2:** Let and be finite sets representing actions and observations respectively, and be an infra-Bayesian law. Then, we can define by

In fact, this is a faithful representation: can be recovered from .

**Example 3:** Consider an infra-MDP with finite state set , initial state and transition infrakernel . We can then define the "ambikernel" by

Thus, every infra-MDP induces an "ambichain". Moreover:

**Claim 1:** is a monad. In particular, ambikernels can be composed.

This allows us defining

This object is the infra-Bayesian analogue of the convex polytope of accessible state occupancy measures in an MDP.

**Claim 2:** The following limit always exists:

**Definition 3:** Let be a convex space and . We say that *occludes* when for any , we have

Here, stands for convex hull.

We denote this relation . The reason we call this "occlusion" is apparent for the case.

Here are some properties of occlusion:

- For any , .
- More generally, if then .
- If and then .
- If and then .
- If and for all , then .
- If for all , and also , then .

Notice that occlusion has similar algebraic properties to logical entailment, if we think of as " is a weaker proposition than ".

**Definition 4:** Let be a compact Polish space. A *cramble set*^{[2]} over is s.t.

- is non-empty.
- is topologically closed.
- For any finite and , if then . (Here, we interpret elements of as credal sets.)

**Question:** If instead of condition 3, we only consider binary occlusion (i.e. require , do we get the same concept?

Given a cramble set , its *Legendre-Fenchel dual* ambidistribution is

**Claim 3:** Legendre-Fenchel duality is a bijection between cramble sets and ambidistributions.

The space is equipped with the obvious partial order: when for all . This makes into a distributive lattice, with

This is in contrast to which is a non-distributive lattice.

The bottom and top elements are given by

Ambidistributions are closed under pointwise suprema and infima, and hence is complete and satisfies both infinite distributive laws, making it a complete Heyting and co-Heyting algebra.

is also a De Morgan algebra with the involution

For , is not a Boolean algebra: and for any we have .

One application of this partial order is formalizing the "no traps" condition for infra-MDP:

**Definition 2:** A finite infra-MDP is *quasicommunicating* when for any

**Claim 4:** The set of quasicommunicating finite infra-MDP (or even infra-RDP) is learnable.

Going to the cramble set representation, iff .

is just , whereas is the "occlusion hall" of and .

The bottom and the top cramble sets are

Here, is the top element of (corresponding to the credal set .

The De Morgan involution is

**Definition 5:** Given compact Polish spaces and a continuous mapping , we define the *pushforward* by

When is surjective, there are both a left adjoint and a right adjoint to , yielding two pullback operators :

Given and we can define the semidirect product by

There are probably more natural products, but I'll stop here for now.

**Definition 6:** The *polytopic* ambidistributions are the (incomplete) sublattice of generated by .

Some conjectures about this:

- For finite , an ambidistributions is polytopic iff there is a finite polytope complex on s.t. for any cell of , is affine.
- For finite , a cramble set is polytopic iff it is the occlusion hall of a finite set of polytopes in .
- and from Example 3 are polytopic.

^{^}The non-convex shape reminds us that ambidistributions need not be convex or concave.

^{^}The expression "cramble set" is meant to suggest a combination of "credal set" with "ambi".

Here's the sketch of an AIT toy model theorem that **in complex environments without traps, applying selection pressure reliably produces **__learning__** agents.** I view it as an example of Wentworth's "selection theorem" concept.

Consider any environment of infinite Kolmogorov complexity (i.e. uncomputable). Fix a computable reward function

Suppose that there exists a policy of finite Kolmogorov complexity (i.e. computable) that's optimal for in the slow discount limit. That is,

Then, cannot be the only environment with this property. Otherwise, this property could be used to define using a finite number of bits, which is impossible^{[1]}. Since requires infinitely many more bits to specify than and , there has to be infinitely many environments with the same property^{[2]}. Therefore, is a reinforcement learning algorithm for some infinite class of hypothesis.

Moreover, there are natural examples of as above. For instance, let's construct as an infinite sequence of finite communicating infra-RDP refinements that converges to an unambiguous (i.e. "not infra") environment. Since each refinement involves some arbitrary choice, "most" such have infinite Kolmogorov complexity. In this case, exists: it can be any learning algorithm for finite communicating infra-RDP with arbitrary number of states.

Besides making this a rigorous theorem, there are many additional questions for further investigation:

- Can we make similar claims that incorporate computational complexity bounds? It seems that it should be possible to at least constraint our algorithms to be PSPACE in some sense, but not obvious how to go beyond that (maybe it would require the frugal universal prior).
- Can we argue that must be an
*infra-Bayesian*learning algorithm? Relatedly, can we make a variant where computable/space-bounded policies can only attain some*part*of the optimal asymptotic reward of ? - The setting we described requires that all the traps in can be described in a finite number of bits. If this is not the case, can we make a similar sort of an argument that implies is
*Bayes-optimal*for some prior over a large hypothesis class?

^{^}Probably, making this argument rigorous requires replacing the limit with a particular regret bound. I ignore this for the sake of simplifying the core idea.

^{^}There probably is something more precise that can be said about how "large" this family of environment is. For example, maybe it must be uncountable.

31

Can you explain what's your definition of "accuracy"? (the 87.7% figure)

Does it correspond to some proper scoring rule?

50

I can see that research into proof assistants might lead to better techniques for combining foundation models with RL. Is there anything more specific that you imagine? Outside of math there are very different problems because there is no easy to way to synthetically generate a lot of labeled data (as opposed to formally verifiable proofs).

While some AI techniques developed for proof assistants might be transferable to other problems, I can easily imagine a responsible actor^{[1]} producing a net positive. Don't disclose your techniques (except maybe very judiciously), don't open your source, maintain information security, maybe only provide access as a service, maybe only provide access to select people/organizations.

^{^}To be clear, I don't consider Alphabet to be a responsible actor.

The recent success of AlphaProof updates me in the direction of "working on AI proof assistants is a good way to reduce AI risk". If these assistants become good enough, it will supercharge agent foundations research^{[1]} and might make the difference between success and failure. It's especially appealing that it leverages AI capability advancement for the purpose of AI alignment in a relatively^{[2]} safe way, thereby the deeper we go into the danger zone the greater the positive impact^{[3]}.

EDIT: To be clear, I'm *not* saying that working on proof assistants in e.g. DeepMind is net positive. I'm saying that a hypothetical safety-conscious project aiming to create proof assistants for agent foundations research, that neither leaks dangerous knowledge nor repurposes it for other goals, would be net positive.

^{^}Of course, agent foundation research doesn't

*reduce*to solving formally stated mathematical problems. A lot of it is searching for the right formalizations. However, obtaining proofs is a critical arc in the loop.^{^}There are some ways for proof assistants to feed back into capability research, but these effects seem weaker: at present capability advancement is not primarily driven by discovering theorems, and if this situation changes it would mean we now actually know something about what we're doing, which would be great news in itself.

^{^}Until we become saturated on proof search and the bottlenecks are entirely elsewhere.

40

Here is a modification of the IBP framework which removes the monotonicity principle, and seems to be more natural in other ways as well.

First, let our notion of "hypothesis" be . The previous framework can be interpreted in terms of hypotheses of this form satisfying the condition

(See Proposition 2.8 in the original article.) In the new framework, we replace it by the weaker condition

This can be roughly interpreted as requiring that (i) whenever the output of a program P determines whether some other program Q will run, program P has to run as well (ii) whenever programs P and Q are logically equivalent, program P runs iff program Q runs.

The new condition seems to be well-justified, and is also invariant under (i) mixing hypotheses (ii) taking joins/meets of hypotheses. The latter was not the case for the old condition. Moreover, it doesn't imply that is downward closed, and hence there is no longer a monotonicity principle^{[1]}.

The next question is, how do we construct hypotheses satisfying this condition? In the old framework, we could construct hypotheses of the form and then apply the bridge transform. In particular, this allows a relatively straightforward translation of physics theories into IBP language (for example our treatment of quantum theory). Luckily, there is an analogous construction in the new framework as well.

First notice that our new condition on can be reformulated as requiring that

- For any define by . Then, we require .

For any , we also define by

Now, for any , we define the "conservative bridge transform^{[2]}" as the closure of all where is a *maximal* element of It is then possible to see that is a valid hypothesis if and only if it is of the form for some and .

^{^}I still think the monotonicity principle is saying something about the

*learning theory*of IBP which is still true in the new framework. Namely, it is possible to learn that a program is running but not possible to (confidently) learn that a program is*not*running, and this limits the sort of frequentist guarantees we can expect.^{^}Intuitively, it can be interpreted as a version of the bridge transform where we postulate that a program doesn't run unless contains a reason while it must run.

30

Sorry, that footnote is just flat wrong, the order actually *doesn't* matter here. Good catch!

There is a related thing which might work, namely taking the downwards closure of the affine subspace w.r.t. some cone which is somewhat *larger* than the cone of measures. For example, if your underlying space has a metric, you might consider the cone of signed measures which have non-negative integral with all positive functions whose logarithm is 1-Lipschitz.

I feel that this post would benefit from having the math spelled out. How is inserting a trader a way to do feedback? Can you phrase classical RL like this?