# 11

Joint work with Jan Kulveit. Many thanks to the various people who gave me feedback.

This text describes a model of the space of possible AI systems, as we try to navigate it safely towards AGI.[1] Its primary goal is to provide a simple mental model that people with disagreeing opinions about AI can use to identify the different assumptions and intuitions that they have. Ideally, people would agree that you can imagine things using the pictures of the type suggested below, yet productively disagree on how precisely they should look. The text has two main parts: a description of the model and some examples of how to use it. At the end, we point out some of the model’s shortcomings. We suggest first skimming the figures and their captions and then optionally reading the full text for clarifications.

# Visualizing the Space of AI Systems

Figure 1: An abstract visualization of the space of AI systems, with a hypothetical trajectory of AI development. (Disclaimers: The arrangement of the boxes with existing types of AI is not meant to represent reality, nor is the trajectory meant to be historically accurate. The AGI line should be blurry.)

Informally, we can imagine the space of all AI systems as an abstract space where “dissimilar” AI systems are further apart from each other than “similar” ones. (For example, two implementations of the same algorithm will be fairly close, as will be the same code using different amounts of compute. Similarly, AlphaGo will be closer to AlphaZero than to Quicksort, and different iterations of hypothetical IDA would be closer to each other than two random systems.) While our analysis isn’t too sensitive to how precisely we define similarity, we will assume that one of the properties that $$x$$ and $$y$$ need to share to have higher similarity is to be “reachable” from each other in the sense of the following paragraph. If we currently have some AI system $$x$$, we can ask which system are reachable from $$x$$ --- i.e., which other systems are we capable of constructing now that we have $$x$$. On the one hand, if we cannot construct y, it must be significantly different from $$x$$ - otherwise, how could we construct $$x$$ and not $$y$$? (This assumes that if we have built $$x$$ once, we could always do so again.) On the other hand, we might also be able to construct some systems significantly dissimilar to $$x$$ (and fail for others of medium similarity). Graphically, we can thus imagine the set of “systems reachable from $$x$$” as a, possibly very irregularly shaped, blob around $$x$$.[2] As we develop new AI systems, we aim for eventually reaching AGI - we can imagine that in the high-dimensional system space, there is some direction in which the systems tend to increase in capability (Figure 1). However, this isn’t meant to literally mean that the system has well-defined axes one of which is capability.

To each AI system, we can assign its “immediate harmfulness” (IH) - let’s say, a number between 0 and 10, indicating how bad the current negative effects of the system seem to me (no side effects = 0, the world right now = ~2, universe optimized for suffering = 10). A key part of IH is the immediacy: For example, if the world were to inevitably end tomorrow but everybody was happy right now, we would have IH = 0 (despite $${IH}_{tomorrow}$$ being 8+). In general, we could (and perhaps should) consider each of the currently-active AI systems individually. However, this might make it difficult or impossible to identify which negative effects are due to which system. We will, therefore, assume that at any given time, there is only a single system that matters for our considerations (either because there is a single system that matters more than all others taken together, or because we can isolate the effects of individual systems, or because we are viewing all systems jointly as a single super-system). This allows us to translate the 0-10 IH scale into colours and visualize the space of AI systems as a "risk map" (Figure 2).

Figure 2: A hypothetical risk map of various AI systems. (Green = no negative consequences, red = universe optimized for suffering. The cross denotes the current system.)

As we move through the space of AI systems, we will gain information about IH of its various subparts. We will assume that we always have sufficient information about what is going on in the world to determine the IH of the current system. Then there will be some systems that are sufficiently similar to the current one (or simple enough to reason about) so that we also have accurate information about their IH. Some other systems might be more opaque, so we will only have fuzzy predictions about their IH. For others yet, we might have no information at all. We can visualize this as decreasing the visibility in parts of the image, indicating that our information is less likely to correspond to reality (Figure 3).

Figure 3: The true state of a hypothetical AI-system risk map (left) and a depiction of our belief about systems' safety and the confidence in it (right). In practice, we cannot see the whole risk map. Rather, there will be areas where our beliefs are accurate and we are confident in them (bright), areas where we are uncertain and our beliefs might be somewhat wrong (darker; note the circled mistakenly-green part and red blotch whose position is wrong), and areas where we have no information to go with, where information could be completely wrong (black). (This is a simplified model; in reality, confidence and accuracy, unfortunately, aren’t the same.)

Aside from moving through the space by adopting new AI systems, we can also do various types of research in an attempt to gain knowledge about the space. In this sense, research strategies can be viewed as ways of translating effort (e.g., compute and researcher-years) into increased clarity about the space. To simplify the discussion, we will assume that the amount of clarity gained, and which area of the space this happens in, depends on (i) the specific research strategy used, (ii) how much effort goes into it, and (iii) the AI system that is currently available (i.e., the current position in the system space). Typically, but not necessarily, more effort will result in more clarity, with gains being bigger in whatever the research focused on and in the vicinity of the system that we currently have (since these are easier to experiment with). We will later argue that (iv) some parts of the space might also be more or less amenable to understanding. In this post, we will ignore effects other than (i-iv).
When the research method is clear from the context, we can imagine research progress as gradually expanding the visibility of the risk map(Figure 4, top). Alternatively, we can only pay attention to how the “fully visible” area expands as more research gets done. Visually, this amounts to adding contour lines (extending from the current point) to the risk map (Figure 4, bottom).

Figure 4: Top: Research gradually improves our understanding of AI systems’ properties. Bottom left: contours depicting the spread of the “area of accurate understanding”. (Since reality isn’t limited by my drawing skills, the real progress can be irregular, discontinuous, or uncover some disconnected areas.) Bottom right: However, different methods might prioritize different areas of the space. Moreover, some parts of the system might be difficult to understand (even prohibitively so), like the part above x.

Finally, while the effort needed to find out whether some AI-system is harmful or not will depend on the method and on the starting point, we can expect that some types of AI systems will be “universally” more difficult to understand than others. We can visualize this, for example, as a fog lying over the space of AI systems. The fog is thick in some parts and thin in others, and the thicker it is, the more effort it takes to dispel it (Figure 5).

Figure 5: Some parts of the AI-system space might be naturally more difficult to understand than others. (In this simplified model, research effort required to verify whether a system S is safe only depends on two factors: the distance between S and our current system S0 and the “fogginness” of the area between S and S0.)

# Using this Mental Model to Clarify Disagreements

Different people have significantly different underlying assumptions when it comes to the development of AI and its safety. We now give some examples of how the above model can be used to communicate these differences. (To make the exposition clearer, we present extreme views on various issues, without making any claims about which scenarios are more likely to be true. The opinions that people actually hold are typically more nuanced.)

First, while people generally agree that both completely harmless and maximally harmful AI systems are physically possible, their opinions often differ when it comes to finer properties of the corresponding risk map (Figure 6). Examples of the different views that people hold are:

• The changes in IH are always continuous and gradual, s.t. in a single development step, IH will never shift by more than 1 or 2.
• The changes in IH can be very sudden, or even literally discontinuous. In particular, we could see IH shift from “mostly fine” (2 or less) to “really bad” (7+) in a single step.

Figure 6: Harmfulness of AI systems might change gradually (a,b) or quickly (c,d). This is more important than the difference between harmfulness being continuous (a,c) or discontinuous (b,d).

Secondly, people have various intuitions about the “default visibility” within the space - that is, how much knowledge about safety of AI systems are we going to have without investing extra effort[3] into it (Figure 7). Some possible views in this direction are: We might be able to develop AI systems despite having no information about their safety properties (or having very wrong impressions about them). That is, we might be walking blind. We will always have accurate knowledge of IH “at least one step ahead”. While we may not always have fully accurate information of the IH of AIs that we could build next, we will get an early warning if we get to more dangerous areas of the space. For example, if we become able to build a “7+ harmful AI”, we will know that such AI is somewhere in the neighbourhood of our present system (although perhaps not exactly where).

Figure 7: Different variants of how much information we might have about safety of systems we can build. (a) true state of affairs, (b) accurate information on all options, (c) inaccurate information about some options, but we realize we are in a dangerous territory, (d) completely wrong information about some uncertain options.

Thirdly, people also disagree on the viability of research of systems that are far more advanced than what we currently have (Figure 8). One extreme view is that there are research methods which might, with enough effort, “uncover a safe path” all the way towards AGI and beyond without requiring us to build AI more advanced than what we have now. The opposite view is that understanding more distant AI systems requires progressively more effort, and quickly becomes practically infeasible.

Figure 8: How practically viable is research now into understanding advanced AI systems? Are we in a world where (a) research has strong diminishing returns as we study systems dissimilar to ours, or in one where (b) safe AGI can already be understood now?

Finally, opinions vary on how amenable different parts of the AI-systems space are to safety analysis (Figure 9). Some relevant claims are:

• Systems based on current ML techniques are, if not less safe, then at least inherently more difficult to understand.
• Doing early research now could give us a better big-picture understanding of which types of AI are harmful vs safe (the risk map) and which are more amenable to analysis (fogginess).

Figure 9: Visualization of the hypothesis that the space of advanced AI systems based on machine learning is more difficult to navigate than other (as of yet non-existent) types of AI (perhaps because those have been designed with interpretability and safety in mind).

# Limitations of the model

Some of the ways in which the model is inaccurate are the following: We have more than one system running at a time. (While the model assumes that only one relevant system exists.) The model of how research influences our understanding of the space of AI systems is too naive. (It ignores all the research done earlier by assuming that our understanding is only a function of which AI system we currently have and how much research we did while having it.) In practice, the world does not stand still even if no new AI systems get deployed. As a result, we should either measure the harmfulness of a system in a more robust manner or stick with immediate harmfulness, but assume that the points in the system correspond to (AI system, environment) pairs (with the environment changing even if the system doesn’t). The relation between the topology of the space and “how far we can move from the current point in a single step” is unclear. We defined Immediate harmfulness is in terms of “assuming I understand the current situation fully, how undesirable does it seem to me”. This is too subjective and inconsistent across different situations. The model doesn’t explicitly talk about the capability of different systems.

1. A side-note: I think that some people already have a model that looks a bit like this in their head. In this sense, I am not making a claim on novelty, but rather view this text as trying to make it easier to refer to ideas that were already floating around to some extent. ↩︎
2. By “blob”, I mean a neighbourhood of $$x$$ that is possibly non-uniform and non-convex (assuming we are in a vector space with a norm). And possibly one that has disconnected components; but since we are already assuming non-uniformity, this seems like a less important distinction. ↩︎
3. The biggest potential for disagreement might lie in the operationalizing this “no extra effort” baseline. ↩︎

# 11

Mentioned in
New Comment

Thanks for writing this up! I am a huge fan of diagrams. This is indeed roughly the mental model I had, I think, though maybe this is one of those cases where after reading something you internalize it to the point where you think you always thought that way.

Great post. Reminded me a bit of this one. As a concrete tool for discussing AI risk scenarios, I wonder if it doesn't have too many parameters? Like you have to specify the map (at least locally), how far we can see, what will the impact be of this and that research... I'm not saying I don't see the value of this approach, I'm just wondering where to start when trying to apply this tool. More of a methodological question.

Similarly, AlphaGo will be closer to AlphaZero than to Quicksort, and different iterations of hypothetical IDA would be closer to each other than two random systems.

Maybe I'm reading too much in the mention of quicksort here, but I deduce that the abstract space is one of algorithms/programs, not AIs. I don't think it breaks anything you say, but that's potentially different -- I don't know anyone who would call quicksort AI.

If we currently have some AI system , we can ask which system are reachable from --- i.e., which other systems are we capable of constructing now that we have .

What are the constraints on "construction"? Because by definition, if those others systems are algorithms/programs, we can build them. It might be easier or faster to build them using x, but it doesn't make construction possible where it was impossible before. I guess what you're aiming at is something like "x reaches y" as a form of "x is stronger than y", and also "if y is stronger, it should be considered when thinking about x"?

A potentially relevant idea is oracles in complexity theory.

In general, we could (and perhaps should) consider each of the currently-active AI systems individually. However, this might make it difficult or impossible to identify which negative effects are due to which system. We will, therefore, assume that at any given time, there is only a single system that matters for our considerations (either because there is a single system that matters more than all others taken together, or because we can isolate the effects of individual systems, or because we are viewing all systems jointly as a single super-system). This allows us to translate the 0-10 IH scale into colours and visualize the space of AI systems as a "risk map" (Figure 2).

This part confused me. Are you just saying that to have a color map (a map from the space of AIs to ), you need to integrate all systems into one, or consider all systems but one as irrelevant?

We will assume that we always have sufficient information about what is going on in the world to determine the IH of the current system. Then there will be some systems that are sufficiently similar to the current one (or simple enough to reason about) so that we also have accurate information about their IH. Some other systems might be more opaque, so we will only have fuzzy predictions about their IH. For others yet, we might have no information at all. We can visualize this as decreasing the visibility in parts of the image, indicating that our information is less likely to correspond to reality (Figure 3).

This seem like a great visualization.

As a concrete tool for discussing AI risk scenarios, I wonder if it doesn't have too many parameters? Like you have to specify the map (at least locally), how far we can see, what will the impact be of this and that research...

I agree that the model does have quite a few parameters. I think you can get some value out of it already by being aware of what the different parameters are and, in case of a disagreement, identifying the one you disagree about the most.

>If we currently have some AI system x, we can ask which system are reachable from x --- i.e., which other systems are we capable of constructing now that we have x.
What are the constraints on "construction"? Because by definition, if those others systems are algorithms/programs, we can build them. It might be easier or faster to build them using x, but it doesn't make construction possible where it was impossible before. I guess what you're aiming at is something like "x reaches y" as a form of "x is stronger than y", and also "if y is stronger, it should be considered when thinking about x"?

I agree that if we can build x, and then build y (with the help of x), then we can in particular build y. So having/not having x doesn't make a difference on what is possible. I implicitly assumed some kind of constraint on the construction like "what can we build in half a year", "what can we build for \$1M" or, more abstractly, "how far can we get using X units of effort".

[About the colour map:] This part confused me. Are you just saying that to have a color map (a map from the space of AIs to R), you need to integrate all systems into one, or consider all systems but one as irrelevant?

The goal was more to say that in general, it seems hard to reason about the effects of individual AI systems. But if we make some of the extra assumptions, it will make more sense to treat harmfulness as a function from AIs to $$\mathbb R$$.