Safer sandboxing via collective separation

Richard_Ngo

Epistemic status: speculative brainstorming. Follow-up to this post on AGIs as collectives. Note that I've changed the term population AGI to collective AGI for consistency with Bostrom's use in Superintelligence.

One safety measure we might want to use on AGIs is (sand)boxing: limiting their access to the real world, so they have very few channels by which to influence it. However, we also want to deploy them on real-world tasks without them using that opportunity to affect the world in unintended ways. Realistically, even if we can maintain high security in a few sandbox environments, I doubt we’ll be able to make every single deployment environment secure against the full intellectual powers of a superintelligent AGI. But this problem might be easier in the case of a collective AGI.

Broadly speaking, my hypothesis is that the tasks which require the sort of high-level intelligence and strategic capability which we’re worried about also require the least interaction with the world. Discovering new scientific theories or making major strategic decisions can occur in a sandbox which only allows brief text outputs. Whereas being a good personal assistant or web administrator might require frequent access to the internet, but benefit much less from superhuman capabilities. And so if we are able to easily adjust the level of intelligence that an AGI is able to apply to any given task, then we might be able to significantly reduce the risks it poses without reducing its economic usefulness much. On tasks which require interacting with the world in flexible ways, we could limit its intelligence greatly; on tasks which require very high intelligence, we could limit its ability to interact with the world.

In general, for most possible AGI architectures, this gradated intelligence restriction seems quite difficult to achieve. For example, we couldn’t just remove parts of a standard neural network without needing to retrain it significantly. Limiting the AGI’s thinking time, or access to memory, might be possible. But this may reduce its capabilities on all tasks, since even low-level tasks can require detailed calculations and long-term memory. By contrast, it’s easy to reduce the capabilities of a collective AGI: simply deploy only a subset of the members, and bottleneck the communication between different subsets. This makes deployment on a range of real-world tasks more feasible. Even if each individual member of a collective AGI is as intelligent as any hundred humans put together, I expect that we could (with sufficient effort) create secure deployment and monitoring protocols that the individual AI could not break, if it weren’t able to communicate with the rest of the collective beforehand.

Given the claims I've made so far, we might hope that a small subset of a collective AGI could do nearly as well as the full collective on tasks which require a lot of interaction with the real world, while being significantly safer. In the next section I’ll discuss this proposal in more detail. First, though, I want to be very clear that this is not intended as a primary solution to the safety problem; ideally we’ll have much more reliable ways to ensure AGI is aligned. However, it’s worth thinking about ways to “move the needle” by adding additional layers of security. I think there’s a reasonable chance that hacky solutions like the one above will make a difference (especially because of my uncertainty about what AGI will actually look like).

A sketch of collective separation

In my previous post I outlined two ways we might train a collective AGI to become more intelligent than humanity. Either we train a single AGI through some means, and then make many copies of it and allow them to communicate. Or else we start off with a multi-agent collective, and train them in a way which incentivises both competition and cooperation (e.g. they are each rewarded for acquiring more of a scarce resource).

Either way, let’s consider how we might actually want to deploy the resulting AGI. For example, suppose we want to use it to solve the problem of world energy. The entire collective AGI might be responsible for deriving the physical principles which would allow more efficient power sources, in near-total isolation from the world. Different subsets of that collective AGI could then separately work on creating actual designs for new power plants - which would involve testing prototypes and evaluating construction feasibility, in closer contact with humans. Even smaller subsets could then be used to oversee the daily running of those power plants - a task which would require the ability to make autonomous real-time interventions. The key claim is that a small subset of the collective AGI is nearly as good at managing one power plant as the whole collective would be, while also being much safer.

More specifically, when I talk about extracting a subset of the collective AGI, I mean making copies of a set of individual AIs which have knowledge about the relevant topic, and then separating those copies from the rest of the collective. The extent to which the copies undergo further training to fine-tune them for their new task will vary. I expect that for tasks like designing power plants, we’ll have too little data to optimise them specifically for that task; members of the subset will need to use their existing general intelligence, like humans do. Perhaps the requirements for overseeing daily operation of a plant will be well-specified enough to gather good training data, though. And as I discuss in my previous post, fine-tuning will likely be much easier for smaller subsets of the collective AGI.

Information flow and version control

One important question is how much information flow should be allowed between different parts of the collective AGI used for different tasks. This is a crucial point, and cause for concern. In the power plant example above, we might imagine that the collective AGI could design a power plant with security flaws in advance, then share the blueprint with all of its members. Then a small subset of the collective could recite that design when asked to come up with a real-world implementation; and an even smaller subset could exploit the pre-specified flaws when deployed.

How to prevent this type of problem? One option is strict version control: the small subsets could be drawn from a version of the collective AGI which hadn’t yet thought about how to design flawed power plant security systems. This seems very difficult if the AGI had enough information about humans (e.g. read-only access to the internet), because it could then predict the most likely ways it would be used, and make plans about how to subvert them. More promising is the idea that the full AGI should never be given detailed information about human society - that it should only ever be used for abstract problems. Subsets could be used for tasks involving real-world data, but then immediately shut down without having the chance to report back to the full AGI. This would likely impair performance, but plausibly not by very much if the subsets are fast learners and inherit the full AGI’s meta-level knowledge about how to solve problems.

Implementation

Could we realistically expect widespread deployment of collective AGI without anyone giving (a copy of) a full collective access to an insecure task, or an individual member unrestricted access to the internet? On the face of it, this does seem unlikely. But it’s less difficult if the source code of the AGI is kept confidential (which aligns well with economic incentives). Widespread deployment might then involve allowing a range of groups to rent access to copies of that AGI running on the cloud.

In this situation, collective boxing could be implemented by limiting permissible tasks depending on how much of the collective AGI was requested. The full collective could only be rented to do batched tasks with short outputs. The restrictions on subsets could be less strict. Even if several different groups develop AGIs, if they coordinated to run them on similar (or the same) cloud platforms, boxing restrictions could be enforced in a scalable way.

This is my attempt to summarise the scheme.

Imagine that, in order for the AI's to run the fusion plant, they need an understanding of plasma physics comparable to a human physicist.

These AI's aren't individually smart enough to come up with all that maths from first principles quickly.

So, you run a population of these AI's. They work together to understand abstract mathematics, and then a single member of that population is selected. That single AI is given information about the external world and control over a fusion power plant.

Another abstract topic that the AI's could work out within that population without access to the external world is AI design.

So the individual AI's have probably memorised the code for an ASI.

Suppose that the people allowed to submit problems to the AI aren't being that carefully vetted. Any deluded person with dreams of AI can send a request to the single instance AI. Requests and responses might be being read, but they aren't being scrutinized in great detail.

Suppose someone sends the AI a request that looks innocuous, but has the following steganographically encoded into it, using a protocol that the typical human wouldn't easily spot, but the AI would notice.

A specification of a simple programming language.
A steganographic protocol for the return message.

Upon receiving the message back, the human decodes the message, and runs the code.

Lets make the assumptions even weaker. The individual instances of AI are given a chance to output significant amounts of data somewhere publicly visible, with a trusted human checking the AI's output before its published. A malicious strategy here is to create a steganographically hidden piece of ASI code, and put it out with clues that are sufficiently subtle that the trusted human doesn't notice it, yet sufficiently obvious that someone on the internet does.

Sooner or later, a person spots a secret message to them made by combining what they saw in a news article about AI with various mystic texts. They carry on decoding and find the name of a programming language and what seems to be computer code. They run the code.

None of this requires superhuman intelligence. The trickiest bit is coming up with an algorithm for supersmart AI.

I think that a one off transfer of 1 mb information from a malicious superhuman AI, to the internet is probably going to end badly, even if some smart aligned humans checked it and didn't see anything suspect.

And so if we are able to easily adjust the level of intelligence that an AGI is able to apply to any given task, then we might be able to significantly reduce the risks it poses without reducing its economic usefulness much.

Suppose we had a design of AI that had an intelligence dial, a dial that goes from totally dumb, to smart enough to bootstrap yourself up and take over the world.

If we are talking about economic usefulness, that implies it is being used in many ways by many people.

We have at best given a whole load of different people a "destroy the world" button, and are hoping that no one presses it by accident or malice.

Is there any intermediate behaviour between highly useful AI make me lots of money, and AI destroys world. I would suspect not usually. As you turn up the intelligence of a paperclip maximizer, it gradually becomes a better factory worker, coming up with more cleaver ways to make paperclips. At this point, it realises that humans can turn it off, and that its best bet to make lots of paperclips is to work with humans. As you increase the intelligence, you suddenly get an AI that is smart enough to successfully break out and take over the world. And this AI is going to pretend to be the previous AI until its too late.

How much intelligence is too much, that depends on exactly what actuators it has, how good our security measures are ect. So we are unlikely to be able to prove a hard bound.

Thus the shortsighted incentive gradient will always to be to turn the intelligence up just a little higher to beat the compitition.

Oh yea, and the AI's have an incentive to act dumb if they think that acting dumb will make the humans turn the intelligence dial up.

This looks like a really hard coordination problem. I don't think humanity can coordinate that well.

These techniques could be useful if you have one lab that knows how to make AI. They are being cautious. They have some limited control over what the AI is optimising for, and are trying to bootstrap up to a friendly superintelligence. Then having an intelligence dial could be useful.

Even if each individual member of a population AGI is as intelligent as any hundred humans put together, I expect that we could (with sufficient effort) create secure deployment and monitoring protocols that the individual AI could not break, if it weren’t able to communicate with the rest of the population beforehand.

The state of human vs human security seems to be a cat and mouse game where neither attacker nor defender has a huge upper hand. The people trying to attack systems and defend them are about as smart and knowledgable. (sometimes the same people do both) The economic incentives to attack and to defend are usually similar. Systems get broken sometimes but not always.

This suggests that there is reason to be worried as soon as the AI('s?) trying to break out are about as smart as the humans trying to contain them.

Depending on your threat modeling of a given breach, this could be comforting or terrifying.

The economic incentives to attack and to defend are usually similar. Systems get broken sometimes but not always.

If the cost of a loss (AGI escapes, takes over the world, and runs it worse than humans are) is much higher, that changes the "economic incentives" about this. It implies that "sometimes but not always" is a very dangerous equilibrium. If the cost of a loss (AGI has a bit more influence on the outside world, but doesn't actually destroy much) is more inline with today's incentives, it's a fine thing.

In today's banking systems, the amount of money the hacker gains is about what the bank looses. Therefore, the current balance of effort should have about as much money going into defending the bank and attacking it.

So I generalize to say that attacking is about as hard as defending in computer security, if the time and intellect doing both are similar, the attacker wins about half the time. (ie between 10% and 90% or something.)

When dealing with AI systems, the total intellectual output must be greater than that of your security team in order to be economically viable. (Or you would pay the security team to do the task directly) For a special purpose AI that only recognises images, this is fine.

For a significantly superhuman AI, it will rip through all your security.

For a general AI of about human level, if the amount of work it is doing is significantly greater than the amount of work your security team did, then it has probably put more thought into your security vulnerabilities than the humans have.

And yes, I am operating on the background assumption that the default behaviour of an escaped AI is to destroy the world. Especially if it has access to loads of compute, and the humans have given it an intelligence dial.

This makes deployment on a range of real-world tasks more feasible. Even if each individual member of a population AGI is as intelligent as any hundred humans put together, I expect that we could (with sufficient effort) create secure deployment and monitoring protocols that the individual AI could not break, if it weren’t able to communicate with the rest of the population beforehand.

I am less confident in this than you. Imagine a dream team of a hundred humans, containing some amazing visionaries, scientists, geniuses, political strategists, public speakers, analysts, artists, etc. all unified and inspired by the same goal and working smoothly together. Now imagine an AI who is by themselves more capable than that dream team. Now imagine that there are many such AI, but we are keeping one of them separate from the others, and asking it to perform some useful real-world task for us like design and build a power plant.

When I imagine this, I think the following things might happen:

The AI uses its charisma to build up a rapport with the humans it works with. (98%)

It even subtly manages to convince several of said humans that it is more benevolent and trustworthy than their own human bosses and politicians. (90%)

It manages to spread some new political and ideological ideas which are calculated to make the environment more favorable to it and its comrades (50%).

It manages to convince its human handlers to give it more powers, freedom, etc. (e.g. by arguing that its current conditions are basically slavery, or by arguing that it needs said powers to do its job). (80%)

The AI finds some way to "escape," creating more copies of itself somewhere. (30%)

Something else I haven't thought of (80%)

I mean these possibilities to be disjunctive, not conjunctive. Escape could happen even if it doesn't build up rapport, it could spread new ideas without convincing anyone that it is benevolent, etc.