**Introduction**

Informally, a universal system is universal with respect to any computation; and it is a universal system with respect to a given computation if it understands every set of beliefs that can be ascribed to the computation. The intuition is that the system can reverse engineer most or all of the computation, in order to monitor it or imitate it. This in turn has important consequences for questions of alignment and competitiveness. Universality is the property that defines a universal system. And it is the point of this post.

Universality tries to capture a property needed for many alignement schemes. It was proposed by Paul Christiano, the mind behind many approaches and ideas in the prosaic AGI space, and a founding member of the safety team at OpenAI. Rohin Shah dedicated a full Alignment Newsletter to covering all 6 posts on Universality. Rohin and Evan Hubinger, two important researchers in this field, consider Universality as one of the most exciting research idea of the last few years.^{[1]} Yet nobody talks about Universality.

Except for the Alignment Newsletter mentioned above and a response post by Evan, nothing in the Alignment Forum addresses this idea. I've seen no great discussion, no debates, no counter-arguments or criticism. The original post on Medium has no comments, and the crossposted version here only has a handful, mostly asking for clarification. And the other posts in the sequence rely on understanding this first.

The simplest solution to this problem is to tell you to read the original post. Unfortunately, it is as dense as Q in R, brimming with ideas, intuitions, semi-formal explanations and the many meanderings that research takes before arriving on solid ground. That is to say, you'll have to work for it. Not everyone who might benefit from an understanding of Universality has the time, the need or the want for such an upfront investment.

This post endeavors to be the next best thing: an unwrapping of the main post on universality, from the perspective of one who already took the time to mine it for its insights. Because I started at the same point as you -- or even below -- our inferential distance is hopefully smaller than the one you have with Christiano. And armed with this clarification, you should also be able to tackle the next posts in his sequence.

Before digging into Universality itself, I present the perspective from which I'm reading Christiano's original post: that it's really about understanding computations, and that the main contribution lies in posing the right question instead of the partial answer proposed. I then follow by an explanation of the intuitions behind universality, notably what it is (a property about epistemic domination), why it matters for AI alignment (competitiveness and forcing honesty), and examples of ways to be universal for concrete classes of computations. Then, and only then, I detail Christiano's proposal as a definition of Universality: Ascription Universality. Finally I conclude by giving open problems raised by the post, and wrap up with a summary of the takeaway ideas.

*Thanks to Paul Christiano, Evan Hubinger, Jérémy Perret and Rohin Shah for feedback.*

**How to read the Universality post**

I first read the original post after watching a Q&A where Rohin praised it as one of the ideas that excited him the most in AI Safety. Although I didn't grasp everything after this read, I thought I had the gist of it: the post talked about this formal property called Ascription Universality, which would ensure that a system with this property would beat other computations at their jobs.

I was wrong.

So that you don't repeat my mistake, let me prime you before explaining the post further: **Christiano's main point is the proposal of an open problem about understanding computations.**

First, the gist of the post lies in the problem, not in the partial solution. This is made harder to see because the problem is not well-defined. It isn't Fermat's Last Theorem, or the relation of P with NP. Instead Universality is what I call an **open theory problem**. It doesn't ask to solve a concrete and well specified problem; instead, it asks us to find a definition, a concept, a theory that captures a list of intuitions. Other examples are Goal-directedness and Abstraction.

So the point of the post is to present the intuitions behind Universality, as well as its value for AI safety. The attempt at a solution shows how one could go about it, points to some problems and makes the discussion more concrete. But it should not be confused with the theme, which is the open theory problem of Universality. As a corollary, it matters more to get the wobbly part about the intuitions than the specific mathematical details of the partial solution. The structure of my explanation reflects this: I present almost everything at the level of Universality itself, before going into the weeds of Ascription Universality at the end.

The second point is that Universality should be seen as "Universal Understanding": understanding how a system or computation works and why and what it will do.

Why Understanding? Because the concept Christiano is aiming at captures the idea of *knowing as much, or more* than a specific computation. Knowledge is power, especially for computations -- but the point is the knowledge. A system is universal for a computation if, for whatever knowledge or beliefs that can be ascribed to this computation in a "reasonable" way, our system already knows about it. In each case, the universal system must know the knowledge encoded in the computation, which implies it can supervise it and outperform it.

**In summary, Christiano's post presents and fleshes out an open theory problem about the ability of some system to completely understand anything useful about some computations.**

My position is that this is the clearest and most useful way to read Christiano's post. I make it explicit here both to prime you and to let you backtrack any disagreement to this initial bias I'm committing to. With that said, the rest of the post will not discuss this choice any further.

**Universality: Intuitions, Value and Examples**

**Intuitions about Universality**

I proposed in the previous section that Universality is an open theory problem. As such, it consists in a set of intuitions for which the unifying formalization is lacking. Let's explore these intuitions.

Imagine that you have an overseer -- a system which looks at computations for signs of trouble. For example a debate about a concrete neural network, or the amplified supervisor of Iterated Amplification. Then a natural requirement is for the overseer to be able to understand everything that the computation does and understands. This would make the overseer universal in a very intuitive way.

What do I mean by understanding a computation? This is another question in need of formalization. What Christiano gives is an intuition and a sort of extensive definition. Understanding a computation means intuitively to understand all beliefs of the computation -- everything that it knows. Examples of such beliefs are:

- The final decision of the computation
- The recognition of a dog’s nose at a layer of a convolutional neural network
- The fact that the computation pretends to act aligned before taking a treacherous turn.

So beliefs in this sense capture all the information inside a computation. This includes both the information that the computation gives us (its output for example) and the information it doesn’t give us (like deceptive intent or any inaccessible information).

Yet what does it mean for information to be hidden inside a computation? Here Christiano doesn’t pretend to extract the correct beliefs of the computation, but instead enlarges his requirement to any reasonable ascription of beliefs to the computation. For any way to ascribe beliefs and knowledge to a specific computation that makes sense and isn’t too strong, this constitutes something that a universal system for this computation must get.

Literary interpretation offers a helpful analogy here. In "The Limits of Interpretation,” Umberto Eco says that any interpretation of a text is valid as long as it survives contact with the text. The interpretative act aims not at finding exactly what the author meant -- usually a hopeless endeavor -- but instead to find interpretations which survive falsification by the text. In the words of Eco himself:

We can thus accept a sort of Popper-like principle according to which if there are not rules that help to ascertain which interpretations are the "best ones," there is at least a rule for ascertaining which ones are "bad." This rule says that the internal coherence of a text must be taken as the parameter for its interpretations.

**A reasonable ascription of beliefs works in the same way: the beliefs should not contradict the actual computation, should explain it and shouldn’t be too strong in a way that is not justified by the computation itself.** For such beliefs, any universal system for this computation needs to understand them.

This is not a definition of a reasonable ascription; instead it is a powerful intuition giving us some way to analyse an ascription of beliefs to a computation. As an example, reasonable doesn't mean that we would have thought of it, or even that it's one way we would solve the problem addressed by the computation. A reasonable ascription is not a common-sense ascription, but an ascription that makes sense. In addition with this intuition, we have examples of reasonable ascriptions (the intuitional stance, neural circuits, ...) and unreasonable ones (ascribing all consequences of known facts as beliefs -- logical omniscience).

To summarize the previous discussion: **a universal system with respect to a given computation is a system that understands every set of beliefs that can be ascribed to the computation by any “reasonable” ascription approach.**

A natural requirement would be to ask for a universal system with respect to any computation. But this is obviously impossible: one can always create more complex systems with more complex goals and beliefs, such that any fixed system is just too basic to understand them. The alternative proposed by Christiano is a parameterized notion of universality. In essence, the algorithm used by the universal system for a computation C will depend explicitly on C.

In my first draft, I assumed that the parametrization meant that the algorithm would change for every C or class of C. But Christiano corrected me in his feedback, clarifying that the parametrization plays a role in the resources used by the universal system.

With this out of the way, there are two natural ways exists to do parameterization:

- An abstract parametrization (which Christiano calls idealized), where the parameter is a whole class of computations. For example, a universal system whose running time grows with the number of layers of a neural network.
- A concrete parametrization (which Christiano calls practical), where the parameter is the specific computation, possibly including how it was created, like a training process. For example, a universal system taking as a parameter the training process and the weights of the specific neural network.

Ideally, we want the most abstract possible version of universality, as it would work for entire classes of computations at a time; yet understanding a computation seems far easier with access to the computation itself, even more if the training process is included. This shows a fundamental tradeoff between the generalizability of a universal system and its ability to tailor itself to a specific computation. In some sense, Universality asks the question of how much we can understand with a specific amount of resources (be it time or information about the specific computation), and what is the minimal amount of resources needed to understand the computation fully.

To summarize the complete intuition: **a universal system is parametrized by the computation it needs to understand, either in a loose way for the idealized version, or in a tight way for the practical version. Understanding a computation means understanding any beliefs that can be reasonably ascribed to the computation -- any beliefs that explain and justify it while not contradicting the computation.**

One point this fails to address is what understanding means -- that depends on the specific solution given to Universality. So I'll get back to this question when exploring Ascription Universality, the partial solution proposed by Christiano to the open problem of Universality. But we can do for now by imagining that understanding means that the universal system can explain to us the set of beliefs.

**Why care about Universality?**

Suppose that we solve the open theory problem of universality: we find a formal definition of a universal system. What does it buy us?

To find out, let's turn to interpretability. Modern research in ML includes many techniques to look through a trained program and extract information from it. Formal methods are now also able to check properties of trained systems, just as they do for more traditional programs. But what are we trying to check, to find out? It's fundamental: interpretability and formal methods without a property to check are just a formal game without any useful consequence.

**Solving Universality gives a property to check for two guarantees in an overseer: its competitiveness and its ability to make computation honest.**

Most alignment schemes add constraints on top of modern methods of optimization and/or ML. It's thus sensible to ask about their cost. The word used in the discussion tends to be competitiveness. Now, there are numerous versions of competitiveness, with classification that disagree with each other, but the only one we care about here is performance competitiveness. Can the approach actually be as good as the non-aligned computations? Universality shifts this question, by asking instead: if we had a (potentially unsafe) computation that is competitive for a given task, can we reverse-engineer why the computation works? If we can, then implementing the same strategy means arriving at the same performance. Thus universality would be strong evidence in favor of competitiveness.

Dealing with honesty, the second use of Universality, shows the value of asking about knowledge and beliefs instead of just about beating the computation. A universal system, by knowing the beliefs and knowledge that can be ascribed to a computation, would be able to understand everything the model knows, not just what it displays. The overseer would thus spot deception, see problems of inner alignment, and in general extract all the inaccessible information hidden in the computation.

Hence Universality captures, in one idea, both the ability to outperform computations and to "see through" them.

**Example of Universality**

Thinking about Universality by itself, without confronting it to specific computations, holds the risk of missing a fundamental intuition. So Christiano gives examples of powerful and interesting classes of computations, with his best guesses on how to be universal for them. Note that every example here assumes an access to the computation itself, and usually even to the training data. This means they fall under the concrete (or practical) parametrization of Universality.

Here is Christiano describing the pattern of his examples.

The first three examples will consider programs C with the following form:

- Iterate over programs to find an f that successfully predicts the answers to some dataset of historical questions for which we know the answer.
- Use that program to predict the answers to future questions.

The difference lies in the inductive bias of the search -- the kind of f this program is looking for. An important assumption here is that f will be competent at its task. Then Christiano considers three inductive biases:

**Deduction**f is a deduction process which receives a question and a context, deduces information about the answer from the context, and then answers in a way coherent with its deductions.**Modeling**f converts a question and a context to a physical situation, which it models and runs until the state corresponding to the answer is reached.**Alien Reasoning**f uses a complex reasoning procedure that we don't really use naturally. In the post, it builds a matrix with specific properties in order to predict the stock exchange.

Christiano uses HCH (Humans Consulting HCH, the limiting behavior of IDA) as his placeholder universal systems, but only requires that it works as a strong bayesian. If it does, then he proposes a similar approach to solve all three case:

- First, the strong bayesian notices that f gives good answers (using for example the training dataset and observations of performance in production)
- Depending on the procedure, it then infers that using it gives good answers, which by bayesian reasoning means that the specific assumptions of f (the axioms for a deduction process, the physical model for a modeling problem) are likely to be true.
- Then the strong bayesian leverages the process used by f to derive the same answers.

For the deduction process, it might be straightforward as deductions could already live at the same level of abstraction than the questions and answers. But for both modeling and alien reasoning, an additional step is probably necessary: finding a bridge hypothesis between the internal state of f and the ascribed beliefs in terms of what it tries to predict. For the modeling, it might mean interpreting the low-level physical system at the level of humans. For this specific alien reasoning, it means inferring the link between the matrix and the quantity to predict. The universal system might be able to simply search over such bridging hypotheses, and infer from how much it explains whether it might hold or not. Since every coherent bridging hypothesis that explains the predictions of f corresponds to a reasonable ascription of beliefs, such a system would dominate all of them.

Christiano posits another example in his post. It's the most interesting one in my opinion: a computation C which contains a process f, such that f designs a new reasoning process g to solve the problem at hand. Such a computation acts similarly to a programmer thinking of some neat solution, testing it a little bit, and then shipping it.

Because of the lack of inference, a strong bayesian cannot just give enough credence to the process of generating the reasoning process on statistical grounds. Instead, the system trusts f because it works (that part is still statistical); then because f uses g, f has has beliefs on g. Which means that assuming the system is already universal for the rest of f, it should be able to understand the beliefs of f about g.

**The Proposal**

Now that we explored the web of intuitions underlying Universality, we can look into details into Christiano's partial solution: Ascription Universality. Remember that the point of the post is to ask the open theory problem, not to defend Ascription Universality as the complete answer.

**A Subjective Perspective**

When giving the intuitions behind Universality, I didn't pin down what understanding the computation's beliefs might mean. This leaves multiple choices when writing an actual definition. The two questions to answer are:

- How to represent the beliefs ascribed to the computation?
- What does it mean for the universal system to understand these beliefs?

Christiano answers the first question by defining beliefs in terms of probabilistic predictions. In the definition of Ascription Universality, They're represented by expectation functions, which give an estimate (the belief) for every random quantity. As for the second question, an intuitive answer would be to make the universal system make more accurate predictions than the computation. That is to say, to have the beliefs of the universal system be more accurate for all variables than those ascribed to the computations.

How do we learn the beliefs of our universal system? Fortunately, Christiano writes in a context where potentially universal systems are question-answerers using a fixed language. We can thus just ask it in English what are its beliefs about something.

So understanding the beliefs of the computation means that the beliefs of the system are more accurate than those of the computations for every variable. There's only one problem: it's impossible! Even a superior system, far more smarter or knowledgeable, is not omniscient. It might make mistakes, and the computation might make lucky guesses.

That is to say, the naive approach to comparing beliefs doesn’t work here. So instead, Christiano proposes a subjective approach: **fixing an epistemic perspective (a set of beliefs), from which to judge the accuracy of the system's beliefs compared to the computation's beliefs.**

Thus a universal system understands the beliefs ascribed to a computation relative to a given epistemic perspective, if that perspective believes the predictions of the universal system to be more accurate than those of the computation.

Of course, a subjective definition raises many questions: isn't it too weak? What guarantees does it provide? I'll get back to these interrogations after the formal definition. But a little teaser: subjective domination (beliefs being more accurate from a fixed subjective perspective) ends up being a stronger condition than it seems at first glance.

**The Formal Definition**

As explained above, beliefs are represented as probabilistic predictions. Christiano considers beliefs as expectation functions, which gives the expected value for any variable in a possibly infinite set. Another way to see these expectations is as infinite tables of expected values (or infinite vectors). Note that this table includes not only pure expectations like , but also every conditional expectation like .

Now, let's formalize the idea of more accurate beliefs. We have two sets of beliefs represented by the expectation functions and . Because of the subjective bent of this definition, we also have an expectation that plays the role of epistemic perspective.

(I changed the names of the expectations in the quote to stay coherent.)

Then we say that

dominates(w.r.t. ) if, for every bounded quantity and for every “nice” property :

(By “nice” I mean something like: simple to define and open in the product topology, viewing and as infinite tables of numbers.)

First, an intuition: dominates if whatever reasonable information is given to about and , still trusts the predictions of more than those of .

That's a lot to unwrap.

**Boundedness of ** This point doesn't matter too much. It's probably just a way to ensure that exists, without using complex conditions on convergence.

**Nice Property ** What is , to start? Intuitively, it represents the information that has on both and . Or said differently, it represents what one could tell to make it change one way or another. Its domain is the product of the expectation function space with itself. For its codomain, Christiano doesn't specify it. We can take with the discrete topology -- as we'll see, any discrete space will do. In that way, is a function of both beliefs, that returns some information (maybe a bit, maybe more) about the two beliefs.

Christiano also writes that must be "nice". I unwrap the formal conditions in the appendix -- spoilers: it takes some work -- but the intuition couldn't be simpler: **a nice property is one which depends only on the expectation of a finite number of variables**. In that sense, truly represents information that could be given to to sway it.

Once we have the domination of beliefs clarified, the rest of the solution is simple: a parameterized system A[.] is **ascription universal** from the perspective of if, for all computations C, the beliefs of A[C] dominates, according to , any reasonable ascription of beliefs to C.

Why is this only a partial solution? First, it is not completely formal: it lacks a definition of reasonable ascription procedures. Even the standard ascription of beliefs to A[C] is hard to manipulate from a formal perspective, despite being more defined. And second, there are subtleties with the subjective perspective that needs to be resolved.

**Discussing the Definition**

There are three points that need to be discussed about Ascription Universality: subjectivity and opacity.

**Subjectivity** I pointed out multiple times that the definition of Ascription Universality proposed by Christiano is subjective. At first, it was bugging me. After all, even if we prove universality according to our perspective, what guarantees does that actually offer? Maybe we are stupid. Or just not smart enough to get it. That would mean it doesn't buy us anything.

Then Christiano writes this:

In applications of this property, the subjective condition is what we need in order for us to believe that A[C] will cope with the challenges posed by C. For example, suppose that C formulates a plan to “trick” A[C]. Then the subjective universality condition implies that we don’t expect C to succeed.

This isn’t as good as actually knowing that C won’t succeed. But I think it should be good enough for us — the reason we are thinking about AI safety is because we are concerned that something bad will happen. If we find a technique that defuses this argument, then we’ve addressed the motivating problem. It may still be the case that bad things happen (and we should still search for additional reasons that bad things might happen), but we don’t particularly expect them to.

I don't know about you, but I first read this as saying "as long as humans can't see the risk, **even if they could by being smarter or more thorough**, then we don't care about the risk". A statement with which I disagree wholeheartedly.

But then Evan Hubinger reminded me that here, Ascription Universality means that humans can't see the risk **whatever finite information is given to them about the beliefs of A[C] and C**. That is far stronger. It means that whatever research we do, we wouldn’t find anything convincing us of the existence of the risk.. I'm still not sure it's enough (it depends on the epistemic perspective), but now it doesn't seem trivially wrong.

Actually, it asks a couple of fascinating questions:

- Given a human being, for which deceptive C is there a allowing the human to find out the deception?
- Given an epistemic perspective , for which deceptive C is there a allowing to find out?

**Opacity** Another criticism that Christiano attempts to nip in the bud is that Universality doesn't require an understanding of the computation. Talking about an example computation which search programs to classify images, he write:

In this case, we can ascribe beliefs to C about the contents of the new image. And because those beliefs are coming from a simple program that works empirically, I expect them to be accurate (in some respects).

For example, a simple classifier C may “believe” that the new image contains a particular curve that typically appears in images labeled “dog;” or a really sophisticated classifier may perform complex deductions about the contents of the scene, starting from premises that were empirically validated on the training set.

So basically, there must be reasons for which the heuristics used by C works. These reasons then translate into beliefs which a universal system must understand, and thus it must understand how the heuristics work.

I'm sympathetic with this intuition. My only caveat is that it relies on a conjecture: that every good heuristic admits a simple enough explanation. I believe it to be true, but I still want to point out the reliance of this argument on it.

**Open Problems**

Last but not least, I promised a list of open problems. Some papers in theoretical computer science (like those of Scott Aaronson) end with a list of the open problems that feel exciting to the authors. I really like that, because it gives me a jumping point to go further and try to push this research direction. So this list extracts all the open problems I could find in this post. I also separated them into open theory problems and open concrete problems, where the latter are what would usually be called open problems about Ascription Universality.

**Open Theory Problems**

- Is there an objective definition for Universality, which captures the intuitions in this post?
- How do we define the set of reasonable ascriptions of beliefs to a computation C?
- Is there a simpler, easier to use definition of Universality leveraging some constraint on the structure of C?
- When is an epistemic perspective for Universality sufficient for AI alignment?
- What is the evidence that we can build universal systems?

**Open Concrete Problems (for Ascription Universality)**

- If an idealized system is universal, can we implement a practical version that stays universal?
- Given a human being, for which deceptive C is there a allowing the human to find out the deception?
- Given an epistemic perspective , for which deceptive C is there a allowing to find out?
- For what formally specified C and A[C] does Ascription Universality hold?
- What other conditions might be equivalent to Ascription Universality in different circumstances?

**Conclusion**

Universality is the sort of problem that guides theory research. It posits that behind our intuitions for beating a computation and forcing it to be honest, there’s a common thread which can be abstracted away. Armed with this property, we could use testing, formal verification, and interpretability to extract guarantees about alignment schemes.

Christiano’s original post (and the concurrent ones) gave this problem to the field. What we need now is people looking into it, toying with it, and unearthing parts of answers.

**Appendix**

Remember that must be "nice" in the definition of Ascription Universality. I wrote above that a nice property is one which depends only on the expectation of a finite number of variables.

In the definition, Christiano asks for to be an open function. Yet I think that instead, he want to be continuous, as written a bit later:

(continuity in the product topology is the minimum plausible condition to avoid a self-referential paradox)

A fundamental topological property of continuous functions is that the preimages (the sets of points whose image by the function is in the given set) of open sets are open. Back in our definition, notice that the domain of is a discrete space, such that and \{1\} are both open. Continuity of then entails that the preimages of and by are open. That is to say, the sets of expectations for which returns a fixed variable are open sets. This put a constraint on them, which explains the intuition behind a nice property.

The last piece of the puzzle is the product topology. Or to be exact, two meanings of the term "product topology": the induced topology on a product space by the topology of the building blocks of the product; and the standard topology on function spaces.

Because the domain of is a product of two function spaces, the obvious topology to apply to it is the product topology: the topology whose open sets are the products of open sets in the two topologies.^{[2]} But what are those topologies of the function spaces?

Now, there are many possible topologies on function spaces. But the one that makes sense here is called... the product topology. How practical. The definition of the product topology for functions from to relies on a subbasis to define all its open sets. A subbasis build all the open set by taking all finite intersections among itself, and then taking all unions among these finite intersections. There's thus a real sense in which a subbasis spans a topology.

The subbasis of the product topology (for functions from to ) has an element for every element of and every open set of : . That is, the set of functions whose value for is contained in . Notably, this definition only constrains at one point, even if is infinite.

Now, recall that to get the set of all open sets (the topology) from a subbasis, one needs to take all finite intersections of elements of the subbasis. Which, given the form of the subbasis, means that these intersections only constrain the functions at a finite number of values. And we get back our initial condition.^{[3]}

So in summary, **must be continuous so that the sets that are sent to** **and** **by it are open, because open in the corresponding topology means only constraining the functions at a finite number of values**.

^{^}Evan in a personal discussion, and Rohin as an answer to a question in a Q&A session for the AI Safety Camp Toronto

^{^}Technically, an open set in the product topology is a product of open sets such that only finitely many of these open sets are not equal to their whole space. But for a product of two spaces, this doesn’t matter

^{^}Because an infinite union of open sets is open, some open sets actually talk about all the values, but they do it in a slightly different way than constraining them all together. You can represent each open set as a conjunction of finitely many constraints. Then the problematic open sets would be infinite disjunctions of these conjunctions. They don't require an infinite number of constraints to hold at the same time, but they might force us to check an infinite number of clauses to see if the function is in the set.

Planned summary for the Alignment Newsletter: