Towards formalizing universality

3Wei Dai

3Paul Christiano

2Wei Dai

2Paul Christiano

2Wei Dai

2Paul Christiano

2Wei Dai

2Paul Christiano

2Stuart Armstrong

New Comment

This post defines the concept of “ascription universality,” which tries to capture the property that a question-answering system A is better-informed than any particular simpler computation C.

I'm getting lost right away here. In the case of "large enough teams of humans can carry out arbitrarily complicated reasoning", what is A and C? Presumably A is the whole team, but I have no idea what C is.

C is an arbitrary computation, to be universal the humans must be better informed than *any* simple enough computation C.

Also, I'm confused about the practical case.

For example, suppose that C formulates a plan to “trick” A[C]. Then the subjective universality condition implies that we don’t expect C to succeed.

What does "expect" mean here? Probability > .5? Also, can you walk through an example of how C might "trick" A[C] and how subjective dominance implies that we don't expect C to succeed?

We’d like to say that the impoverished perspective is still “good enough” for us to feel safe, despite not being good enough to capture literally everything we know. But now we risk begging the question: how do we evaluate whether the impoverished perspective is good enough? I think this is probably OK, but it’s definitely subtle.

I don't know how to make sense of this. If an impoverished perspective expects C not to be able to "trick" A, it seems kind of obvious that's not good enough for me to think the same? You must have reason to think otherwise but the inferential distance is too far for me to figure out what it is.

What does "expect" mean here? Probability > .5?

We don't think that C has any expected edge in identifying an attack that A does not---if C is up to something, and A can't tell, then we think that the "being up to something" provides no evidence about whether something bad is about to happen. This is much stronger than a fixed probability of failure.

Also, can you walk through an example of how C might "trick" A[C] and how subjective dominance implies that we don't expect C to succeed?

Simple caricatured examples:

- C might propose a design for a computer that has a backdoor that an attacker can use to take over the computer. But if this backdoor will actually be effective, then A[C] will know about it.
- C might propose a design that exploits a predictable flaw in A's reasoning (e.g. overlooking consequences of a certain kind, being overly optimistic about some kinds of activities, incorrectly equating two importantly different quantities...). But then A[C] will know about it, and so if A[C] actually reasons in that way then (in some sense) it is endorsed.

I don't know how to make sense of this. If an impoverished perspective expects C not to be able to "trick" A, it seems kind of obvious that's not good enough for me to think the same?

If an impoverished perspective expects C not to be able to trick A, then either:

- In fact C won't be able to trick A.
- C will trick A, but the perspective is too weak to tell.

I think I don't quite understand what you are saying here, what exactly is obvious?

From a suitably advanced perspective it's obvious that C will be able to trick A sometimes---it will just get "epistemically lucky" and make an assumption that A regards as silly but turns out to be right.

I think I don’t quite understand what you are saying here, what exactly is obvious?

I think I expressed myself badly there. What I mean is that it seems a sensible default to not trust an impoverished perspective relative to oneself, and you haven't stated a reason why we should trust the impoverished perspective. This seems to be at least a big chunk of the formalization of universality that you haven't sketched out yet.

Suppose that I convinced you "if you didn't know much chemistry, you would expect this AI to yield good outcomes." I think you should be pretty happy. It may be that the AI would predictably cause a chemistry-related disaster in a way that would be obvious to you if you knew chemistry, but overall I think you should expect not to have a safety problem.

This feels like an artifact of a deficient definition, I should never end up with a lemma like "if you didn't know much chemistry, you'd expect this AI to to yield good outcomes" rather than being able to directly say what we want to say.

That said, I do see some appeal in proving things like "I expect running this AI to be good," and if we are ever going to prove such statements they are probably going to need to be from some impoverished perspective (since it's too hard to bring all of the facts about our actual epistemic state into such a proof), so I don't think it's totally insane.

If we had a system that is ascription universal from some impoverished perspective, you may or may not be OK. I'm not really worrying about it; I expect this definition to change before the point where I literally end up with a system that is ascription universal from some impoverished perspective, and this definition seems good enough to guide next research steps.

So "simpler" in that sentence should be replaced by "simple enough"? In other words, it's not the case that A is better-informed than every computation C that is simpler than A, right? Also, can you give a sense of how much simpler is simple enough?

I'm aiming for things like:

- round debate dominating any fast computation with alternations (including an round debate)
- max-HCH with budget dominating max-HCH with budget for some constant .
- HCH with advice and budget dominating HCH with no advice and budget .

It seems the ascription process is approximately "deduce an agent's beliefs from their outputs". This seems to have the same problem as "deduce an agent's preferences from their outputs", which I showed was not possible in general, even with simplicity.

So when dealing with non-perfectly rational agents, it seems you'll have to put in the irrationality by hand. So it's not so much "ascribing beliefs", but "prescribing beliefs": our interpretation determines what the agent believes. The fact that "This procedure wouldn’t capture the beliefs of a native Spanish speaker, or for someone who wasn’t answering questions honestly", are just two examples of a much more universal problem.

(

Cross-posted at ai-alignment.com)The scalability of iterated amplification or debate seems to depend on whether large enough teams of humans can carry out arbitrarily complicated reasoning. Are these schemes “universal,” or are there kinds of reasoning that work but which humans fundamentally can’t understand?

This post defines the concept of “ascription universality,” which tries to capture the property that a question-answering system

Ais better-informed than any particular simpler computationC.These parallel posts explain why I believe that the alignment of iterated amplification largely depends on whether HCH is ascription universal. Ultimately I think that the “right” definition will be closely tied to the use we want to make of it, and so we should be refining this definition in parallel with exploring its applications.

I’m using the awkward term “ascription universality” partly to explicitly flag that this is a preliminary definition, and partly to reserve linguistic space for the better definitions that I’m optimistic will follow.

(Thanks to Geoffrey Irving for discussions about many of the ideas in this post.)

I. DefinitionWe will try to define what it means for a question-answering system

Ato be “ascription universal.”1. Ascribing beliefs to AFix a language (e.g. English with arbitrarily big compound terms) in which we can represent questions and answers.

To ascribe beliefs to

A, we ask it. IfA(“are there infinitely many twin primes?”) = “probably, though it’s hard to be sure” then we ascribe that belief about twin primes toA.This is not a general way of ascribing “belief.” This procedure wouldn’t capture the beliefs of a native Spanish speaker, or for someone who wasn’t answering questions honestly. But it can give us a sufficient condition, and is particularly useful for someone who wants to use

Aas part of an alignment scheme.Even in this “straightforward” procedure there is a lot of subtlety. In some cases there are questions that we can’t articulate in our language, but which (when combined with

A’s other beliefs) have consequences that we can articulate. In this case, we can infer something aboutA’s beliefs from its answers to the questions that we can articulate.2. Ascribing beliefs to arbitrary computationsWe are interested in whether

A“can understand everything that could be understood by someone.” To clarify this, we need to be more precise about what we mean by “could be understood by someone.”This will be the most informal step in this post. (Not that any of it is very formal!)

We can imagine various ways of ascribing beliefs to an arbitrary computation

C. For example:Cquestions in a particular encoding and assume its answers reflect its beliefs. We can either use those answers directly to inferC’s beliefs (as in the last section), or we can ask what set of beliefs about latent facts would explainC’s answers.Cas optimizing something and ask what set of beliefs rationalize that optimization. For example, we can giveCa chess board as input, see what move it produces, assume it is trying to win, and infer what it must believe. We might conclude thatCbelieves a particular line of play will be won by black, or thatCbelieves general heuristics like “a pawn is worth 3 tempi,” or so on.C’s behavior depends on facts about the world, and ask what state of the world is determined by its current behavior. For example, we can observe thatC(113327) = 1 but thatC(113327) “would have been” 0 if 113327 had been composite, concluding thatC(11327) “knows” that 113327 is prime. We can extend to probabilistic beliefs, e.g. ifC(113327) “probably” would have been 0 if 113327 had been composite, then we might thatCknows that 113327 is “probably prime.” This certainly isn’t a precise definition, since it involves considering logical counterfactuals, and I’m not clear whether it can be made precise. (See also ideas along the lines of “knowledge is freedom”.)Ccan be understood as optimizing the way data is laid out in memory, then we can ascribe beliefs to that computation about the way that data will be used.Note that these aren’t intended to be efficient procedures that we could actually apply to a given computation

C. They are hypothetical procedures that we will use to define what it means forAto be universal.I’m not going to try to ascribe a single set of beliefs to a given computation; instead, I’ll consider all of the reasonable ascription procedures. For example, I think different procedures would ascribe different beliefs to a particular human, and don’t want to claim there is a unique answer to what a human “really” believes. A universal reasoner needs to have more reasonable beliefs than the beliefs ascribed to that a human using any particular method.

An ascription-universal reasoner needs to compete with any beliefs that can be ascribed to

C, so I want to be generous with this definition. For example, given a chess-playing algorithm, we might rationalize it as trying to win a game and infer its beliefs about the rules of chess. Or we might rationalize it as trying to look like a human and infer its beliefs about what a human would do. Or something different altogether. Most of these will be kind of crazy ascriptions, but I want to compete with them anyway (competing with crazier beliefs will turn out to just be easier).It’s not totally clear what counts as a “reasonable” ascription procedure, and that’s the biggest source of informality. Intuitively, the key property is that the ascription itself isn’t doing the “hard work.” In practice I’m using an informal extensional definition, guided by examples like those in the bulleted list.

3. Comparing beliefsWhat does it mean to say that one agent is “better-informed” than another?

It’s natural to try to express this in terms of empirical information about the world, but we are particularly interested in the different inferences that agents are able to draw from the same data. Another natural approach is to compare their “knowledge,” but I have no idea how to define knowledge or justified belief. So I’m reduced to working directly with sets of beliefs.

Consider two sets of beliefs, described by the subjective expectations 𝔼¹ and 𝔼². What does it mean to say that 𝔼¹ is better-informed than 𝔼²?

This framing makes it tempting to try something simple: “for every quantity, 𝔼¹’s belief about that quantity is more accurate.” But this is property is totally unachievable. Even if 𝔼¹ is obtained by conditioning 𝔼² on a true fact, it will almost certainly happen to update in the “wrong” direction for some claims.

We will instead use a subjective definition, i.e. we’ll define this concept from a particular epistemic position represented by another subjective expectation 𝔼.

Then we say that 𝔼¹

dominates𝔼² (w.r.t. 𝔼) if, for every bounded quantity X and for every “nice” property Φ:(By “nice” I mean something like: simple to define and open in the product topology, viewing 𝔼¹ and 𝔼² as infinite tables of numbers.)

Intuitively, this means that 𝔼 always “trusts” 𝔼¹, even if given arbitrary information about 𝔼¹ and 𝔼². For example, if 𝔼 was told that 𝔼¹[X] ≈

xand𝔼²[X] ≈

y, then it would expect X to be aroundx(rather thany).Allowing arbitrary predicates Φ allows us to make stronger inferences, effectively that 𝔼 thinks that 𝔼¹ captureseverythinguseful about 𝔼².I’m not sure if this is exactly the right property, and it becomes particularly tricky if the quantity X is itself related to the behavior of 𝔼¹ or 𝔼² (continuity in the product topology is the minimum plausible condition to avoid a self-referential paradox). But I think it’s at least roughly what we want and it may be exactly what we want.

Note that dominance is

subjective, i.e. it depends on the epistemic vantage point 𝔼 used for the outer expectation. This property is a little bit stronger than what we originally asked for, since it also requires 𝔼 to trust 𝔼¹, but this turns out to be implied anyway by our definition of universality so it’s not a big defect.Note that dominance is a property of the

descriptionsof 𝔼¹ and 𝔼². There could be two different computations that in fact compute the same set of expectations, such that 𝔼 trusts one of them but not the other. Perhaps one computation hard-codes a particular result, while the other does a bunch of work to estimate it. Even if the hard-coded result happened to be correct, such that the two computations had the same outputs, 𝔼 might trust the hard work but not the wild guess.4. Complexity and parameterizationThere are computations with arbitrarily sophisticated beliefs, so no fixed

Acan hope to dominate everything. To remedy this, rather than comparing to a fixed question-answererA, we’ll compare to a parameterized familyA[C].I’ll consider two different kinds of potentially-universal reasoners

A:A[C] depends only on the complexity ofC.For example, we might hope that an

n-round debate dominates any beliefs that could be ascribed to a fast computation with (n-1) rounds of alternation. In particular, thisA[C] is the same for any two computationsCof the same complexity.A[C] depends on the complexity ofCbut also uses the computationCas a hint. For example, ifCis the training process for a neural net, then we might takeA[C] to be a debate in which the debaters are able to share weights and activations with the neural net throughout the entire training process.I’m generally interested in the case where

A[C] is only slightly more powerful thanCitself. This mirrors the setting where a universal Turing machine is able to run any other Turing machine with only a modest slowdown.Putting it all togetherWe say that a set of beliefs 𝔼ᴬ

epistemically dominatesa computationC(w.r.t. some beliefs 𝔼 and language L) if the beliefs ascribed to A by the “straightforward” procedure, using L, dominate (w.r.t. 𝔼) the beliefs ascribed toCby any reasonable ascription procedure.We say that a family of question-answering systems

A[·] areascription universal(w.r.t. 𝔼 and L) ifA[C] epistemically dominatesCfor every computationC.II. DiscussionWhy is (subjective) dominance sufficient?This universality condition requires that

webelieve thatA[C] is better-informed thanC. Naively we might have wanted it toactuallybe the case thatA[C] is better-informed thanC; the stronger condition is clearly unachievable, but why should we be satisfied with the weaker condition?In applications of this property, the subjective condition is what we need in order for

usto believe thatA[C] will cope with the challenges posed byC. For example, suppose thatCformulates a plan to “trick”A[C]. Then the subjective universality condition implies that we don’t expectCto succeed.This isn’t as good as actually knowing that

Cwon’t succeed. But I think it should be good enough for us — the reason we are thinking about AI safety is because we are concerned that something bad will happen. If we find a technique that defuses this argument, then we’ve addressed the motivating problem. It may still be the case that bad things happen (and we should still search for additional reasons that bad things might happen), but we don’t particularly expect them to.Of course if you select over a large number of computations, then you may find one that will succeed in tricking

A. But if we are concerned about that, then we can instead apply ascription universality to the entire process including the selection.Why trust opaque computation?If

Cuses some clever heuristics that I don’t understand, thenC’s “beliefs” might be excellent, but I might notexpectthem to be excellent. In this sense understanding may seem almost vacuous. If there is some heuristic that I trust, wouldn’tAjust use it?To see why the definition is demanding, consider the special case where

Cperforms an extensive search to find a computation that works well empirically. For example,Cmight be the following computation:In this case, we can ascribe beliefs to

Cabout the contents of the new image. And because those beliefs are coming from a simple program that works empirically, I expect them to be accurate (in some respects).For example, a simple classifier

Cmay “believe” that the new image contains a particular curve that typically appears in images labeled “dog;” or a really sophisticated classifier may perform complex deductions about the contents of the scene, starting from premises that were empirically validated on the training set.So it’s not OK for

Ato simply ignore whatever heuristicsCis using — if those heuristics have the kind of empirical support that makes us think they actually work, then A needs to be able to understand everything that those heuristics imply about the domain.Why be so general?I’ve formulated universality as competing with arbitrary computations

C. It seems totally possible that the form ofCdiscussed in the last section — searching for a program that works well in practice and then using it in a new situation — is so central that the definition of universality should focus entirely on it.One reason to use the broader definition is because sometimes this “selection” process can be embedded in a non-trivial way in a larger computation. For example, if I have a sufficiently large group of humans, I might expect memetic selection to occur and produce systems that could be said to have “beliefs,” and I’d like universal systems to dominate those beliefs as well.

The other reason to use this very general definition is because I don’t see an easy way to simplify the definition by using the additional structural assumption about

C. I do think it’s likely there’s a nicer statement out there that someone else can find.Universal from whose perspective?Unfortunately, achieving universality depends a lot on the epistemic perspective 𝔼 from which it is being evaluated. For example, if 𝔼 knows any facts, than a universal agent must know all of those facts as well. Thus “a debate judged by Paul” may be universal from Paul’s perspective, but “a debate arbitrated by Alice” cannot be universal from my perspective unless I believe that Alice knows everything I know.

This isn’t necessarily a big problem. It will limit us to conclusions like: Google engineers believe that the AI they’ve built serves the user’s interests reasonably well. The user might not agree with that assessment, if they have different beliefs from Google engineers. This is what you’d expect in any case where Google engineers build a product, however good their intentions.

(Of course Google engineers’ notion of “serving the user’s interests” can involve deferring to the user’s beliefs in cases where they disagree with Google engineers, just as they could defer to the user’s beliefs with other products. That gives us reason to be less concerned about such divergences, but eventually these evaluations do need to bottom out somewhere.)

This property becomes more problematic when we ask questions like: is there a way to seriously limit the inputs and outputs to a human while preserving universality of HCH? This causes trouble because even if limiting the human intuitively preserves universality, it will effectively eliminate some of the human’s knowledge and know-how that can only be accessed on large inputs, and hence violate universality.

So when investigating schemes based on this kind of impoverished human, we would need to evaluate universality from some impoverished epistemic perspective. We’d like to say that the impoverished perspective is still “good enough” for us to feel safe, despite not being good enough to capture literally everything we know. But now we risk begging the question: how do we evaluate whether the impoverished perspective is good enough? I think this is probably OK, but it’s definitely subtle.

I think that defining universality w.r.t. 𝔼 is an artifact of this definition strategy, and I’m optimistic that a better definition wouldn’t have this dependence, probably by directly attacking the notion of “justified” belief (which would likely also be useful for actually establishing universality, and may even be necessary). But that’s a hard problem. Philosophers have thought about very similar problems extensively without making the kind of progress that seems adequate for our purposes, and I don’t see an immediate angle of attack.

III. Which A might be universal?Two regimesI’m interested in universality in two distinct regimes:

A[C] depends only the computational complexity ofC.Cas a “hint.” For example, we might consider amplification or debate where the agents are allowed to inspect a transcript ofC; or ifCis a neural net training process, we might train them jointly withC.Idealized modelsA[C] could be:C. (Hopefully in such a way thatA[A[C]] is only a slightly longer debate thanA[C].)C, ideally such thatA[A[C]] does not involve a much larger computation thanA[C].A[C](Q) to be Answer(Q,a*),wherea* =argmax Info(a, Q),and the optimization is over advice stringsaof length comparable to the description ofC.I don’t think any of these would be universal if instantiated with an open-ended human policy — providing arbitrary complex inputs to a human is likely to lead to bad behavior. Instead, I imagine restricting the human to some small enough space of inputs that they can behave “reasonably” over the whole space, while still being rich enough to be universal. Even as such, I think additional measures might be needed to achieve security (since a team of humans may behave badly in some exotic corners of the input space, even if individual humans are sheltered from exotic inputs).

Practical modelsIn practice I’m particularly interested in the case where

Chas the special form described in the “Trusting opaque computations” section: it trains a neural net on some dataset, and then runs that neural network on new inputs.A[C] could be:C, and can make references to individual examples from the dataset.C, and the overseer can consult individual entries from the dataset.Cis thinking and to relate its behavior to the dataset on which it was trained.For each of these practical variants, I don’t expect it to be universal unless the corresponding idealized model is. So we can think of dividing the universality question into two pieces:

if the optimization was performed perfectly?I think none of the suggestions in this section are plausibly universal without some additional mechanisms, and so even if HCH or debate is universal, additional ideas are needed in order to translate those to practical universal systems. I’m still inclined to focus on the idealized case, because I want to better understand what is needed for universality without thinking about algorithmic difficulties.

IV. Which C are hard to epistemically dominate?The definition in this post was somewhat informal, but the discussion in this appendix will be extremely informal. The purpose is to communicate some of my early thinking about universality and my sense for what a universal reasoner might look like.

The first three examples will consider programs

Cwith the following form:fthat successfully predicts the answers to some dataset of historical questions for which we know the answer.The different examples explore features of

fthat might make it difficult to dominate the overall computation.DeductionSuppose

ftakes the following form:etc.)We could attribute the statements produced by this process to

Cas beliefs. We expect this set of beliefs to have useful information because it was selected for making correct predictions.Fortunately, if HCH is able to see the training dataset then it may be able to arrive at similarly accurate views:

f .f’s beliefs.ModelingSuppose

ftakes the following form:We could say that

Cbelieves everything implied by the outcome of this physical simulation. We expect these beliefs to be accurate becausefhas predicted well in the past.As in the deduction case, hopefully HCH can compute that P(observations|M is accurate) is high, since it can also compute the consequences of M. Then Bayes implies the dynamics model is accurate, and HCH can use that model to compute physical states.

Inferring all the beliefs from a dynamics model is not trivial though. As an extreme example, if

fis performing an atom-by-atom simulation of a room, and that room contains Alice and Bob, then we could ascribe extensive beliefs about Alice and Bob to the computationC.(Here we run head on into the fuzziness about what counts as a “reasonable” ascription procedure, but for the moment I’ll assume that some reasonable procedure ascribes beliefs about Alice and Bob to the computation.)

To compete with these ascriptions, HCH needs to infer those high-level beliefs about Alice and Bob from the low-level computation involving atoms. One way to do this is to search over possible “bridging” hypotheses that relate low-level physical facts to high-level facts about the environment. If such a hypothesis can explain additional high-level facts, then a Bayesian can learn that it is true. Similarly, if the bridging hypothesis relates facts about the model to constraints we know from the high-level interpretation, then the Bayesian can potentially use that as evidence. (This kind of reasoning will be discussed in a bit more detail in the next section.)

We could further hope that searching for a bridging hypothesis isn’t much harder than performing the original search over low-level physics, given that the low-level physics needed to explain a bunch of high-level facts and so already must encode some part of that correspondence.

(Note that the “deduction” example in the previous case could also involve alien concepts or models, in which case the same kind of work would be needed.)

Alien reasoningIn the previous section we described two styles of reasoning we already understand. But there are probably many kinds of reasoning that work well in practice but that would be more alien, and those might be more challenging. This section will explore one example in some detail to try to help anchor our reasoning about the general phenomenon. It will also elaborate on some of the reasoning about “bridging” hypotheses mentioned in the last section.

Suppose that our predictions are always of the same form (e.g. what is the probability the stock market will go up today), and

fworks as follows (the details are long but not very important):i,j) entry of A represented the expectation E[X(i)X(j)], then the matrix would necessarily satisfy a bunch of constraints, which we impose A. For example:i) = 1, then E[X(i)X(j)] = E[X(j)] = E[X(j)²], so A(i,j) = A(j,j).i) and X(j) together imply X(k), then we must have E[X(i)X(j)] ≤ E[X(i)X(k)] and hence A(i,j) ≤ A(i,k).a,b, …, E[(aX(1) +bX(2) + … )²] ≥ 0 — i.e., the matrix A must be PSD.The chosen matrix A(opt) corresponds to a set of beliefs about the propositions X(

i), and we can ascribe these beliefs toC. Becausefpredicts well, we again expect these beliefs to say something important about the world.I chose this procedure

fin part because we can give a kind of argument for why the matrix A(opt) should tend to encode accurate beliefs. But I don’t think that a universal reasoner can make use of that argument:fworks is an additional problem, beyond findingfitself, which might be much harder.Instead, I think a universal reasoner needs to be able to infer the efficacy of this reasoning procedure from its empirical success. It’s relatively easy for a Bayesian to learn the regularity “

fmakes good predictions.” Recovering the rest of the matrix A, and learning how to interpret and whether to trust them, is the hard part.This is going to require the same kind of bridging/identification we discussed in the last section. Let’s write X(A) for the set of beliefs about the world implied by the “intended” identification. Searching over possible identifications to find X (or something like it) is the only way we can ever relate the rows of A to the quantities X(

i). Again, we can hope that it isn’t much harder than finding the original reasoning procedure.I think that a sufficiently sophisticated Bayesian would probably be able to learn to trust X(A):

fis performing well enough that we think it’s more likely to be right in the future, then the Bayesian is going to end believing some claim like “the predictions offare good” (since it explains the data so well).f). The Bayesian is motivated to find an explanation with higher prior probability.(To the extent that we are uncertain and think A’s beliefs have a non-negligible chance of capturing reality, then hopefully we can capture that by the same mechanism by ending up with a non-degenerate posterior.)

(a) “If you use the constraints implied by correspondence X(A) + positive semidefiniteness, and then optimize log det, you get a matrix A for which X(A) makes good predictions,”

(b) “The actual situation in the real world is described by positive semi-definite matrices with higher log determinant (under the correspondence X).”

frespects the constraints on our beliefs, and why that optimization leads to good predictions. Hopefully this is simpler than making two separate bridging claims, one which explainsfas respecting the constraints implied by X, and one which claims thatfmakes good predictions. Ideally, this 2-for-1 that favors (b) exactly mirrors the underlying reasoning that leads us to actually believe that X(A) is correct, rather than resembling what we know about reality and making good predictions “by coincidence.”This is a pretty speculative discussion — it’s not very careful, and it’s hard to make it careful in part because I don’t have a formalization of Bayesian reasoning that can even really be applied to this setting. But it seems to match my intuitions about what reasonable Bayesian reasoning “should” do, which gives me a lot more optimism that a careful Bayesian would be able to epistemically dominate

C.Deliberation and self-improvementOften we expect the computation

Cto have accurate beliefs because it uses a strategy that appears to work in practice — the last 3 examples have discussed that case. But there are other reasons to trust a computation.For example, humans often write code and trust it (to some extent) even without extensive empirical testing — instead, we have a reason to think it will work, and need only modest testing to make sure that we haven’t made an error in our implementation or reasoning. If I write an automated mathematician that works by finding proofs that pass a proof checker, I don’t expect it to be correct because of the empirical record (Empirical data backs up some key assumptions, but isn’t being used to directly establishing the correctness of the method.)

Likewise, if we train a powerful agent, that agent might initially use strategies that work well in training, but over time it might use learned reasoning to identify other promising strategies and use those. Reasoning might allow it to totally skip empirical testing, or to adopt the method after much less testing than would have been necessary without the reasoning.

To dominate the beliefs produced by such reasoning, we can’t directly appeal to the kind of statistical inference made in the previous section. But in these cases I think we have access to an even more direct strategy.

Concretely, consider the situation where

Ccontains a processfthat designs a new reasoning processg. Then:gbecause we trustfand it trustsg.Awill dominatef’s beliefs, and in particular iffis justified in thinking thatgwill work thenAwill believe that and understand why.f’s beliefs, dominatinggis essentially another instance of the original ascription universality problem, but now from a slightly stronger epistemic state that involves both what 𝔼 knows and whatfknows. So unless our original approach to universality was tightly wedded to details of 𝔼, we can probably dominateg.At the end of the day we’d like to put all of this together into a tight argument for universality, which will need to incorporate both statistical arguments and this kind of dynamic. But I’m tentatively optimistic about achieving universality in light of the prospect of agents designing new agents, and am much more worried about the kind of opaque computations that “just work” described in the last few sections.