Vanessa Kosoy

AI alignment researcher supported by MIRI and LTFF. Working on the learning-theoretic agenda. Based in Israel. See also LinkedIn.

E-mail: vanessa DOT kosoy AT {the thing reverse stupidity is not} DOT org

Wiki Contributions

Comments

Sorted by

Here's the sketch of an AIT toy model theorem that in complex environments without traps, applying selection pressure reliably produces learning agents. I view it as an example of Wentworth's "selection theorem" concept.

Consider any environment  of infinite Kolmogorov complexity (i.e. uncomputable). Fix a computable reward function

Suppose that there exists a policy  of finite Kolmogorov complexity (i.e. computable) that's optimal for  in the slow discount limit. That is,

Then,  cannot be the only environment with this property. Otherwise, this property could be used to define  using a finite number of bits, which is impossible[1]. Since  requires infinitely many more bits to specify than  and , there has to be infinitely many environments with the same property[2]. Therefore,  is a reinforcement learning algorithm for some infinite class of hypothesis.

Moreover, there are natural examples of  as above. For instance, let's construct  as an infinite sequence of finite communicating infra-RDP refinements that converges to an unambiguous (i.e. "not infra") environment. Since each refinement involves some arbitrary choice, "most" such  have infinite Kolmogorov complexity. In this case,  exists: it can be any learning algorithm for finite communicating infra-RDP with arbitrary number of states.

Besides making this a rigorous theorem, there are many additional questions for further investigation:

  • Can we make similar claims that incorporate computational complexity bounds? It seems that it should be possible to at least constraint our algorithms to be PSPACE in some sense, but not obvious how to go beyond that (maybe it would require the frugal universal prior).
  • Can we argue that  must be an infra-Bayesian learning algorithm? Relatedly, can we make a variant where computable/space-bounded policies can only attain some part of the optimal asymptotic reward of ?
  • The setting we described requires that all the traps in  can be described in a finite number of bits. If this is not the case, can we make a similar sort of an argument that implies  is Bayes-optimal for some prior over a large hypothesis class?
  1. ^

    Probably, making this argument rigorous requires replacing the limit with a particular regret bound. I ignore this for the sake of simplifying the core idea.

  2. ^

    There probably is something more precise that can be said about how "large" this family of environment is. For example, maybe it must be uncountable.

Can you explain what's your definition of "accuracy"? (the 87.7% figure)
Does it correspond to some proper scoring rule?

I can see that research into proof assistants might lead to better techniques for combining foundation models with RL. Is there anything more specific that you imagine? Outside of math there are very different problems because there is no easy to way to synthetically generate a lot of labeled data (as opposed to formally verifiable proofs).

While some AI techniques developed for proof assistants might be transferable to other problems, I can easily imagine a responsible actor[1] producing a net positive. Don't disclose your techniques (except maybe very judiciously), don't open your source, maintain information security, maybe only provide access as a service, maybe only provide access to select people/organizations.

  1. ^

    To be clear, I don't consider Alphabet to be a responsible actor.

The recent success of AlphaProof updates me in the direction of "working on AI proof assistants is a good way to reduce AI risk". If these assistants become good enough, it will supercharge agent foundations research[1] and might make the difference between success and failure. It's especially appealing that it leverages AI capability advancement for the purpose of AI alignment in a relatively[2] safe way, thereby the deeper we go into the danger zone the greater the positive impact[3].

EDIT: To be clear, I'm not saying that working on proof assistants in e.g. DeepMind is net positive. I'm saying that a hypothetical safety-conscious project aiming to create proof assistants for agent foundations research, that neither leaks dangerous knowledge nor repurposes it for other goals, would be net positive.

  1. ^

    Of course, agent foundation research doesn't reduce to solving formally stated mathematical problems. A lot of it is searching for the right formalizations. However, obtaining proofs is a critical arc in the loop.

  2. ^

    There are some ways for proof assistants to feed back into capability research, but these effects seem weaker: at present capability advancement is not primarily driven by discovering theorems, and if this situation changes it would mean we now actually know something about what we're doing, which would be great news in itself.

  3. ^

    Until we become saturated on proof search and the bottlenecks are entirely elsewhere.

Here is a modification of the IBP framework which removes the monotonicity principle, and seems to be more natural in other ways as well.

First, let our notion of "hypothesis" be . The previous framework can be interpreted in terms of hypotheses of this form satisfying the condition

(See Proposition 2.8 in the original article.) In the new framework, we replace it by the weaker condition

This can be roughly interpreted as requiring that (i) whenever the output of a program P determines whether some other program Q will run, program P has to run as well (ii) whenever programs P and Q are logically equivalent, program P runs iff program Q runs.

The new condition seems to be well-justified, and is also invariant under (i) mixing hypotheses (ii) taking joins/meets of hypotheses. The latter was not the case for the old condition. Moreover, it doesn't imply that  is downward closed, and hence there is no longer a monotonicity principle[1].

The next question is, how do we construct hypotheses satisfying this condition? In the old framework, we could construct hypotheses of the form  and then apply the bridge transform. In particular, this allows a relatively straightforward translation of physics theories into IBP language (for example our treatment of quantum theory). Luckily, there is an analogous construction in the new framework as well.

First notice that our new condition on  can be reformulated as requiring that

  • For any  define  by . Then, we require .

For any , we also define  by

Now, for any , we define the "conservative bridge transform[2] as the closure of all  where  is a maximal element of  It is then possible to see that  is a valid hypothesis if and only if it is of the form  for some  and .

  1. ^

    I still think the monotonicity principle is saying something about the learning theory of IBP which is still true in the new framework. Namely, it is possible to learn that a program is running but not possible to (confidently) learn that a program is not running, and this limits the sort of frequentist guarantees we can expect.

  2. ^

    Intuitively, it can be interpreted as a version of the bridge transform where we postulate that a program doesn't run unless  contains a reason while it must run.

Sorry, that footnote is just flat wrong, the order actually doesn't matter here. Good catch!

There is a related thing which might work, namely taking the downwards closure of the affine subspace w.r.t. some cone which is somewhat larger than the cone of measures. For example, if your underlying space has a metric, you might consider the cone of signed measures which have non-negative integral with all positive functions whose logarithm is 1-Lipschitz.

Sort of obvious but good to keep in mind: Metacognitive regret bounds are not easily reducible to "plain" IBRL regret bounds when we consider the core and the envelope as the "inside" of the agent.

Assume that the action and observation sets factor as  and , where  is the interface with the external environment and  is the interface with the envelope.

Let  be a metalaw. Then, there are two natural ways to reduce it to an ordinary law:

  • Marginalizing over . That is, let  and  be the projections. Then, we have the law .
  • Assuming "logical omniscience". That is, let  be the ground truth. Then, we have the law . Here, we use the conditional defined by . It's easy to see this indeed defines a law.

However, requiring low regret w.r.t. neither of these is equivalent to low regret w.r.t :

  • Learning  is typically no less feasible than learning , however it is a much weaker condition. This is because the metacognitive agents can use policies that query the envelope to get higher guaranteed expected utility.
  • Learning  is a much stronger condition than learning , however it is typically infeasible. Requiring it leads to AIXI-like agents.

Therefore, metacognitive regret bounds hit a "sweep spot" of stength vs. feasibility which produces a genuinely more powerful agents than IBRL[1].

  1. ^

    More precisely, more powerful than IBRL with the usual sort of hypothesis classes (e.g. nicely structured crisp infra-RDP). In principle, we can reduce metacognitive regret bounds to IBRL regret bounds using non-crsip laws, since there's a very general theorem for representing desiderata as laws. But, these laws would have a very peculiar form that seems impossible to guess without starting with metacognitive agents.

Is it possible to replace the maximin decision rule in infra-Bayesianism with a different decision rule? One surprisingly strong desideratum for such decision rules is the learnability of some natural hypothesis classes.

In the following, all infradistributions are crisp.

Fix finite action set  and finite observation set .  For any  and , let

be defined by

In other words, this kernel samples a time step  out of the geometric distribution with parameter , and then produces the sequence of length  that appears in the destiny starting at .

For any continuous[1] function , we get a decision rule. Namely, this rule says that, given infra-Bayesian law  and discount parameter , the optimal policy is

The usual maximin is recovered when we have some reward function  and corresponding to it is

Given a set  of laws, it is said to be learnable w.r.t.  when there is a family of policies  such that for any 

For  we know that e.g. the set of all communicating[2] finite infra-RDPs is learnable. More generally, for any  we have the learnable decision rule

This is the "mesomism" I taked about before

Also, any monotonically increasing  seems to be learnable, i.e. any  s.t. for  we have . For such decision rules, you can essentially assume that "nature" (i.e. whatever resolves the ambiguity of the infradistributions) is collaborative with the agent. These rules are not very interesting.

On the other hand, decision rules of the form  are not learnable in general, and so are decision rules of the form  for  monotonically increasing.

Open Problem: Are there any learnable decision rules that are not mesomism or monotonically increasing?

A positive answer to the above would provide interesting generalizations of infra-Bayesianism. A negative answer to the above would provide an interesting novel justification of the maximin. Indeed, learnability is not a criterion that was ever used in axiomatic constructions of decision theory[3], AFAIK.

  1. ^

    We can try considering discontinuous functions as well, but it seems natural to start with continuous. If we want the optimal policy to exist, we usually need  to be at least upper semicontinuous.

  2. ^

    There are weaker conditions than "communicating" that are sufficient, e.g. "resettable" (meaning that the agent can always force returning to the initial state), and some even weaker conditions that I will not spell out here.

  3. ^

    I mean theorems like VNM, Savage etc.

Formalizing the richness of mathematics

Intuitively, it feels that there is something special about mathematical knowledge from a learning-theoretic perspective. Mathematics seems infinitely rich: no matter how much we learn, there is always more interesting structure to be discovered. Impossibility results like the halting problem and Godel incompleteness lend some credence to this intuition, but are insufficient to fully formalize it.

Here is my proposal for how to formulate a theorem that would make this idea rigorous.

(Wrong) First Attempt

Fix some natural hypothesis class for mathematical knowledge, such as some variety of tree automata. Each such hypothesis  represents an infradistribution over : the "space of counterpossible computational universes". We can say that  is a "true hypothesis" when there is some  in the credal set  (a distribution over ) s.t. the ground truth  "looks" as if it's sampled from . The latter should be formalizable via something like a computationally bounded version of Marin-Lof randomness.

We can now try to say that  is "rich" if for any true hypothesis , there is a refinemen which is also a true hypothesis and "knows" at least one bit of information that  doesn't, in some sense. This is clearly true, since there can be no automaton or even any computable hypothesis which fully describes . But, it's also completely boring: the required  can be constructed by "hardcoding" an additional fact into . This doesn't look like "discovering interesting structure", but rather just like brute-force memorization.

(Wrong) Second Attempt

What if instead we require that  knows infinitely many bits of information that  doesn't? This is already more interesting. Imagine that instead of metacognition / mathematics, we would be talking about ordinary sequence prediction. In this case it is indeed an interesting non-trivial condition that the sequence contains infinitely many regularities, s.t. each of them can be expressed by a finite automaton but their conjunction cannot. For example, maybe the -th bit in the sequence depends only the largest  s.t.  divides , but the dependence on  is already uncomputable (or at least inexpressible by a finite automaton).

However, for our original application, this is entirely insufficient. This is because in the formal language we use to define  (e.g. combinator calculus) has some "easy" equivalence relations. For example, consider the family of programs of the form "if 2+2=4 then output 0, otherwise...". All of those programs would output 0, which is obvious once you know that 2+2=4. Therefore, once your automaton is able to check some such easy equivalence relations, hardcoding a single new fact (in the example, 2+2=4) generates infinitely many "new" bits of information. Once again, we are left with brute-force memorization.

(Less Wrong) Third Attempt

Here's the improved condition: For any true hypothesis , there is a true refinement  s.t. conditioning  on any finite set of observations cannot produce a refinement of .

There is a technicality here, because we're talking about infradistributions, so what is "conditioning" exactly? For credal sets, I think it is sufficient to allow two types of "conditioning":

  • For any given observation  and , we can form .
  • For any given observation  s.t. , we can form .

This rules-out the counterexample from before: the easy equivalence relation can be represented inside , and then the entire sequence of "novel" bits can be generated by a conditioning.

Alright, so does  actually satisfy this condition? I think it's very probable, but I haven't proved it yet. 

Load More