Controlling the amount of logical information available to an agent could hypothetically give upper bounds on how well it can perform various computational tasks like predicting humans, while allowing the agent to make progress on some unrelated task. This could hypothetically enable things like preventing operator modeling and manipulation.

I think this ended up being mostly an interesting exercise in flaw finding. Experts may get most of the value from the summary.

Executive summary

Say an overseer prevents an agent from running any computation that would give too much logical information about, say, humans, relative to what the agent already knows. That is, the probability assignments made by a good logical predictor about a human’s responses to queries, given the output of , are not allowed to become much more precise or accurate than the current estimate using just the list of previous computations . The agent is still allowed to run computations not informative about the human.

The hope is that the agent will be able to think usefully about other things, while remaining unable to model the human and therefore remaining safe. If the agent could model the human using the results of the computations it has run, then so should the predictor; but the overseer prevented from growing enough to let the predictor model the human well.

Problems with this idea are described in detail in the last section. As I see it, the main problems are:

  • The protocol isn’t really safe at all; the agent might ask to run superintelligences that either manipulate the overseer directly, or else obfuscate their machinations.

  • Safety guarantees are hard to maintain in conjunction with a growing knowledge base, since both the overseer and the agent would need additional thinking time to even be able to read the contents of .

  • If the agent is prevented from thinking about things related to humans, then it can’t do anything useful to humans, since finding something humans like would at least tell a predictor that a human would likely accept proposals to cause that thing.

  • It may be pretty easy to manipulate humans to disastrous effect without thinking all that hard about it.

Background and motivation

TL;DR: it would be nice to be able to stop an AI system from knowing too much or thinking too hard about certain other systems.

Limiting optimization power

It is not hard to build an agent that is known to be safe, in the sense of not being able to manipulate its environment strongly. For example, consider the following robustly safe autonomous agent:

Definition. Let be a rock.

Then we know from empirical data and good theoretical understanding that is safe, given certain size assumptions. But, is not very useful. It would be nice to have agents that

  • Are (stably) unable to manipulate certain parts of their environment, and

  • Are able to strongly manipulate other parts of their environment.

For example, we might want an agent that can invent a cure for cancer (design experiments, do abstract thinking, etc.), but doesn’t e.g. manipulate its operators into accepting some garbage output instead of an actual cancer cure.

Throttling computation

A possible approach is to prevent from getting too much information about the forbidden parts of the environment, while allowing it to work on its intended task. The hope is that information-theoretic guarantees will translate into guarantees about how accurately can model forbidden things.

However, throttling the amount of ordinary empirical information available to about some system does not guarantee that can’t model that system; can just think really hard about the system. For example, AI with lots of computing power, talking to a programmer, would after a while be able to locate the programmer within an accurate simulation of the universe (or some other equally good predictor), and hence be able to manipulate the programmer.

This suggests that preventing from thinking too hard about the forbidden part of the environment might help. Now the hope is that for systems of interest, such as human operators, it would require lots of thinking to model that system well; so if can’t think too hard about the system, then it won’t be able to model the system, and won’t be able to manipulate it. I suggest making this slightly more concrete by replacing “thinking about a system ” with “running computations that are informative about , for a good predictor that can’t directly compute ”.

Toy model: plan approver

TL;DR: for purposes of illustration, consider a human in a box approving proposed plans from an agent. An overseer vetoes computations the agent wants to run, preventing the agent from thinking about the human.

Human in a box

We have a human in a box. The human will be given plans by the AI, and the human will approve or disapprove the plan. For example, might only approve plans that clearly describe a cure for cancer.

We assume that we have the human as a computation such that if the agent can model the actual human well, then it can predict the outputs of well. For example, might be a whole-brain emulation, or might be a computationally expensive induction scheme with input from an actual human in a box.

Agent-overseer architecture

The AI consists of a top-level agent , a knowledge base of outputs of computations, and an overseer (for Predictor). The agent is assumed to be in some sense simple; the overseer can do better than at any computational task. can ask the overseer for permission to run some computation . If the overseer approves, is run to output , and then is added to the knowledge base . At some point, outputs a plan.

We assume that there is some computation describing the goal, such that if can model well then can output a good plan. For example, may be a verifier for the predicate “has some chance of being a good plan”.

Goal: avoid modeling the human

There are lots of blank spots in this setup, most notably the assumption that we can point to the goal computation . This encapsulates most of the difficulty; e.g. it black-boxes the part where the output is human-readable, has a chance of doing something human-valuable, etc. However, it is not obvious to me that this is quite FAI-complete.

Specifically, there may still be degrees of freedom in plausible plans that might output, so that asking a superintelligent oracle to find something approved by may be unsafe. Probably a righter way to deal with this is to understand decision theory and value loading / ontology identification better. But in any case the approach in this post is to try to directly prevent from modeling the human , while letting it think about somewhat.

Throttling modeling capabilities

TL;DR: the agent thinks about its output using a knowledge base of outputs of computations. The overseer checks to see how much its beliefs about the human would change if it learned what outputs. If the output would be too informative, the agent is not allowed to run . This is hopefully safe because the agent can’t see much that the predictor can’t. This might be cool because maybe the agent can do very clever things that happen to be uninformative about the human.

Assume a good predictor

To talk about “modeling” a system, we’ll view as a predictor of logical / computational facts. The overseer is assumed to be a “good” predictor, meaning it does at least as well (in some appropriate sense) as all other algorithms as simple or resource-constrained as .

Throttling logical information

Recall the notion of logical information. If we want to prevent from modeling a system, one possible method would be to limit the amount of logical information is allowed to have about the system. So, in the plan approver toy model, the overseer will allow to run a computation and add the result to only if has not much more negentropy (or some other measure of informativeness) than .

Vague safety argument

The (vague, intuitive) argument for safety is that if could model the human well enough to search for and find a manipulative output, then it must be able to make good predictions about . If can make good predictions about , then so can , since by assumption does at least as well as the sort of thing that is. If can make good predictions about , then has low entropy. But this contradicts that the overseer was keeping that quantity high, so can’t manipulate the human, even with access to . This argument is intended to include both the computations that does itself, as well as the computations selected by to be added to ; if has some good reason to think will be helpful, should be able to see that reason.

This could also be used to prevent the agent from modeling anything outside of its box, or from modeling its internal workings.

Notions of information

I don’t know what would be a good measure of the informativeness of a computation about a forbidden computation . Possible parameters:

  • Computing resources that can use to think about its output.

  • Computing resources that can use to think about its probability assignments.

  • Valuation of the conditional distributions . E.g. entropy, KL divergence or total variation between and , the probability of the true outcome, etc.

  • Dependence on the possible inputs to . E.g. prevent from predicting what the human will do given any possible short input, or given all inputs from some other set.

  • Dependence on the possible outputs of . E.g. prevent the valuation of from being bad with respect to any possible output of , or in expectation according to , or just for the actual output of .

Interesting properties

(These properties don’t actually hold.)

Collapsing levels of indirection and obfuscation

Naively trying to forbid certain computations simply doesn’t work, because computations can be rearranged to an extreme degree. If you forbid certain code, other code will do the trick; if you forbid data in one format, another format will do the trick; if you forbid simulated civilizations from constructing silicon computers, another simulated substrate will do the trick; and so on. The hope here is that strong guarantees from logical uncertainty will translate into guarantees saying that any hidden forms of computation, if they are still recognizable and useful, will be recognized by and forbidden if necessary.

Modularized modeling ability

It’s easy to limit ’s ability to model by putting strong bounds on ’s computational resources, but this also stops from doing anything impressively useful. The hope is that there are useful computations that are (a) difficult to directly compute, and (b) not very informative about . Then the agent can use the full force of a superintelligence (albeit lesioned w.r.t. ) to make good guesses about , while remaining ignorant about .

(Almost) full superintelligence

The agent can run arbitrary computations, as long as they aren’t informative about . In particular, it can do all manner of self-improvement, in the form of running computations that output pointers to other, better computations. So the agent-overseer architecture appears to have not sacrificed much power.


TL;DR: there are a bunch of holes in this sort of proposal, indicating that it is not the right way to understand things.

Information leaks

TL;DR: empirical information also has to be accounted for, the feedback from the overseer is an information leak, and the overseer is blind to thinking about things the predictor already knows about. Most notably, there could be superintelligences running in that hack or , ruining any safety guarantees.

Superintelligences in

might be able to run an agent in that superintelligently searches for ways to thwart the overseer, and sneakily think about the forbidden computation . There’s some hope to salvage this: if picks a computation in order to sneakily think about , then must have had a reason to think would do that, and then should be able to see that reason and veto .

What seems to be an even more serious problem: could mistakenly choose to run a computation that is itself unsafe, in the sense that it implements a superintelligence that hacks from inside and runs amok. It may be that in fact has nothing to do with , so that approves , and then proceeds to recommend to a cascading series of computations that model and hack out of the box, all behind a veil of superintelligent obfuscation.

In either case, can superintelligently search for any weakness in ’s predictions. Specifically, can search for any output that could give that would be mistakenly approved by by virtue of being miscalibrated, but that would in fact end up leading to successfully model .

Solomonoff induction can probably manipulate humans given very little empirical data and lots of computation power. Similarly, it seems likely that it is possible to manipulate humans using not too much computation, given the right empirical data. Simple machine learning algorithms, with access to lots of data about human behavior, could find manipulative behaviors without doing any computations that, by themselves, constitute detailed models of humans. So a scheme for controlling information should probably take into account “total information”, i.e. both logical and empirical information.

Empirical information can be logical information

Such a scheme should also take into account that empirical observations can be informative about logical facts. For example, if the agent visually observes the output of some reliable computer that it believes to be running a certain computation, the agent would hopefully update its predictions about the logical question of what the output of the computation is. This may come for free from a good theory of logical uncertainty, for example by replacing observations with logical statements like “my sense inputs are in such-and-such state”, in the context of some theory of the world that includes a description of how the agent’s sense inputs work.

Information vetoes are themselves informative

Suppose asks if it is ok to run , and says no (because would be informative about ). Now, even though has not changed, it seems that has learned something. Namely, it has learned that rejects , and so it has learned that is informative about . Likewise, if is approved, learns that is not informative about . This constitutes a security leak; for example, it seems to make it a little nonobvious to state what ’s goodness guarantee should be relative to , since in the overseer architecture, has access not only to the knowledge base , but also to an oracle for the predicate “is informative about ”.

Pre-existing information

As mentioned in the post on logical information, the overseer may have idiosyncrasies that compromise security. For example, if the predictor is powerful enough to compute the results of , then it will judge everything to be non-informative about , and so will allow everything. As a special case, may already know enough to model well.

Sleeper information

As grows, it seems like has to be given more time to think about computations; otherwise it doesn’t have time to read the contents of , which has to do if it is going to say when is useful for guessing . But if gets more time to think, then computations already in that were previously uninformative about , may suddenly become informative. This seems like a serious problem.

Overly restricting

TL;DR: there are still harsh tradeoffs between restricting too much to be useful vs. sacrificing safety guarantees, so it’s unclear whether progress has been made; notably, computations that are useful for humans are also at least a little bit useful for modeling humans.

Many things are entangled with many other things

For a given forbidden computation , there may be many intricate and fascinating ways to use other computations to help speed up guesses about . The upshot being that may be forbidden from running many computations that would otherwise be very useful for , even if they don’t intuitively have much to do with in particular.

Indeed, any computations that are, roughly speaking, convergent instrumental goals, stand a good chance of being useful for predicting but vetoed by the overseer. For example, a computation that invents an algorithm for inverting matrices might be useful for many things, including both and .

Another example: consider a computation that searches for computations that a good predictor would think are informative about and not informative about . This seems like a reasonable thing for to do in its situation, but this computation might (or might not) by vetoed by the overseer, since being told that is not informative about is itself information about (though perhaps not very much at all; at least it would save a predictor the time of running that computation).

One more example: humans are significantly entangled with all sorts of things in the environment that will be acting in. For example, modeling anything that humans think about or anything that humans made will also model the humans to some extent. If is forbidden, will be very restricted.

Human-useful computations are necessarily informative about humans

More importantly, if is forbidden, then it seems that can’t do anything useful for humans. Finding a good solution to a problem that a human is interested in, can be viewed as a search for outputs that are likely to be approved by . This is logical information about .

This seems like a fundamental problem with this approach. In other words, manipulation is a subtle enough concept that we can’t just forbid computations that look for outputs the human will like; searching for a good output is such a computation.

For example, we might hope that we could ask to develop reliable molecular nanotechnology and report its designs, without manipulating the approver . However, even the search for a good design itself gives a lot of information about what sorts of designs are likely to be accepted by a human anticipating an MNT design.

Bounding information across the approver’s entire input space

Maybe we can allow to perform useful computations by not bounding the information can have about for all inputs to . But then we give up safety; can search over the allowed inputs for manipulative inputs.

Other problems

Dependence on guarantees from logical uncertainty

Any safety arguments about throttling logical information will probably only go through given very strong optimality guarantees from logical uncertainty. For example, we are assuming that makes as good estimates as any program could, over a wide range of possible knowledge bases (level three of Scott’s program). In particular, has to be robust to all sorts of computations might request, including potential superintelligences.

Throttling the wrong thing

Say we throttle negentropy in the plan approver setup, so is prevented from having low entropy estimates of on any short input . Depending on the output space of , this might not help very much. Say outputs either “Yes”, or “No” along with an explanation of what was wrong with the plan. Then it is possible for to gain a lot of information just about the question of whether or not says yes, without decreasing the overall entropy, by increasing entropy over the explanation part of the output. This means that a scheme like this would have to pay careful attention to what was being throttled.

Nonsense counterfactuals

Say we want to define information by looking at for all possible outputs of , instead of just the true one; for example, limiting the maximum or the expected change in information. This e.g. avoids having to compute , deals with probabilistic programs, and might capture a notion of “what an agent would reasonably expect to learn about by running ”.

For this to make any sense, would have to give somewhat sensible results even given a knowledge base with possibly false information. It seems vaguely plausible that this could work, because it seems like a predictor that does very well given accurate knowledge bases would take the data at face value and make reasonable deductions under that assumption. But, this by no means comes for free from a good theory of logical uncertainty, and may be just as hard as understanding logical counterfactuals.

General problems with controlling modeling capabilities

Modeling capability is not the same as optimization power or effect size

To strongly optimize some region of the universe, it is probably not at all necessary to model that region very much; for example, an agent could just destroy everything in the region, and then rearrange it as desired. Furthermore, it may not take much optimization power to have a large effect relative humans values. For example, an agent might acquire and detonate weapons of mass destruction; this is easy enough that even humans can do it, but has a large effect.

Unstable under self-modification

The overseer scheme obviously strongly incentivizes an agent to kill the overseer, so it is not stable under reflection. It might be possible to also lesion about its own operation, though I don’t know quite how that would work. Naively forbidding any computations that are informative about would forbid everything, since an approved computation gets added to .

High overhead

Checking every computation seems like it would be expensive, especially if has to separately think about for all conditioned on the result of some computation.

Personal Blog


New Comment