*A putative new idea for AI control; index here.*

This is a potential design for a safe-ish satisficer, based on the ideas presented here. The two key ideas are that a satisficer S(u) with utility u:

#. Would not effectively aid M(u), a u-maximiser. #. Would not effectively resist M(-u), a u-minimizer.

So satisficers make poor allies and weak enemies.

# The design, the designer and the verifier

Let M(u-v) be an agent that maximises u and minimises v. Let's assume that M(u-v) is currently unaware as to what u and v are (though it knows that the upper bound of u is easy to approach). It can interact with the outside environment in one way only: by creating a satisficer S(). S() will then be fed u to become S(u).

However, the design for S(u) will first be sent to M(εu+v), for some unknown small ε. Note that M(εu+v) is a v-maximiser, so it disagrees with M(u-v) except for u. M(εu+v) is a (hypothetical) agent that will be (counterfactually) released at some unspecified later date. Currently, it will be fed the whole reasoning history of M(u+v), and will make a single decision only: where S(u) is to be built or not.

So it seems that S(u) must;

#. Set u close to its maximal value (as this is "easy to approach"). #. Not increase v to any significant extent (or else M(u-v) would not design it). #. Not decrease v to any significant extent (or else M(εu+v) would veto it). #. Since v is unknown to M(u-v) and and resources are finite, this should serve as a general reduced impact requirement for S(u) (we may have to use something like a soft minimum across all v, rather than an expectation across all v, to avoid certain edge casess). #. Since is u unknown to M(u-v), S() would serve as a general satisficing agent for any utility functions whose upper bounds are easy to approach (remember that we can take an arbitrary utility function and arbitrarily bound it at some number).

For the moment, this does seems like it would produce a successful satisficer...

Interesting idea! To clarify: M(u−v) does not know u or v, and further does not know u−v, but S will know u and v.

It seems simplest to assume that each M(ϵu+v) uses CDT, so it accepts S(u) iff the existence of S(U) is good according to $\epsilon u + v $. So M(u−v) will maximize the probability that S(u) will barely change E[v], while approaching the maximum u.

The question is whether it is safe to maximize u subject to the constraints that E[v] does not change much for almost all v in our prior. My intuition is that if no v in our prior encodes something very close to human values, then it's possible to destroy human values without changing any E[v]. But perhaps this could be used if we already have a prior that assigns some non-negligible probability to human values. This could plausibly be the case if human values have low relative Kolmogorov complexity (relative to the AI's observations).

A similar thing that might be simpler is: have M(u) know u and maximize u by designing S(u). Then select v randomly and have either M(ϵu+v) or M(ϵu−v) get to veto S(u). Alternatively, both could be given veto power.

Now that I think about it more, we might be able to drop the assumption that the maximum u is easy to approach by adding a large penalty for S(u) getting vetoed.

I wanted it to be ignorant of u so that we could use S() as a generic satisficing design.

This seems similar to giving a resource gathering agent veto power.

My thought process is: if you want to destroy human values on an appreciable scale, you have to destroy (or neutralise) humans. Doing so seems hard, if you have to avoid making the task easier (or harder) for a M(v) with unknown v.

Suppose a resource gathering agent has a 50% chance of having a goal v and 50% chance of −v. The resource gathering agent is completely indifferent to anything that is not causally connected to the source of information about v or −v. So unless it expects to exist in the future, it will be completely indifferent to the existence of S(u).

I think this gets into ontology identification issues. If your ontology doesn't contain high-level mental concepts, then it's possible to e.g. lobotomize everyone (or do something similar without visibly changing the brain) without any goal expressed in this ontology caring. I think something like this satisficer design could work to conservatively do useful things when we already have the right ontology (or assign significant probability to it).

No. The agent will still want to gather resources (that's the point). For instance, if you expect to discover tomorrow whether you hate puppies or love them, you would still want to get a lot of money, become world dictator, etc... And the agent does expect (counterfactually or hypothetically or whatever you want to call the false miracle approach) to exist in the future.

I'm trying to design it so that it doesn't. If for example you kill everyone (or lobotomize them or whatever) then a putative M(v) (or M(ϵu+v)) agent will find it a lot easier to maximise v, as there is less human opposition to overcome. I'm trying to say that "lobotomizing everyone has a large expected impact" rather than defining what lobotomizing means.

I see. I still think this is different from getting vetoed by both M(ϵu+v) and M(ϵu−v). Suppose S(u) will take some action that will increase v by one, and will otherwise not affect v at all. This will be vetoed by M(ϵu−v). However, a resource gathering agent will not care. This makes sense given that getting vetoed by both M(ϵu−v) and M(ϵu+v) imposes two constraints, while getting vetoed by only a resource gathering agent imposes one.

That makes sense. I actually didn't understand the part in the original post where M(ϵu+v) expects to exist in the future, and assumed that it had no abilities other than veto power.

Still, I think it might be possible to fix E[v] for almost all v while destroying human value. Let f be the distribution over v that M(u−v) uses. Suppose M(u−v) can compute the default E[v] values for each v (e.g. because it has a distribution over which value function a future singleton will have). Then it will construct a distribution consisting of a mixture of ϵu+v and ϵu−v for almost all v in the support of f, sample from this distribution, and allow the resulting utility function to take over the world. To construct the distribution such that each E[v] stays almost the same for almost all v, it can adjust the probabilities assigned to ϵu+v and ϵu−v appropriately for almost all v in the support of f. It might be necessary to include convex combinations ϵu+θv1+(1−θ)v2 as well sometimes, but this doesn't significantly change things. Now with high probability, ϵu±v will take over the world for some v that is not close to human values.

Perhaps this leads us to the conclusion that we should set f to the distribution over the future singleton's values, and trust that human values are probable enough that these will be taken into account. But it seems like you are saying we can apply this even when we haven't constructed a distribution that assigns non-negligible probability to human values.