1

Personal Blog

A putative new idea for AI control; index here.

This is a potential design for a safe-ish satisficer, based on the ideas presented here. The two key ideas are that a satisficer S(u) with utility u:

#. Would not effectively aid M(u), a u-maximiser. #. Would not effectively resist M(-u), a u-minimizer.

So satisficers make poor allies and weak enemies.

The design, the designer and the verifier

Let M(u-v) be an agent that maximises u and minimises v. Let's assume that M(u-v) is currently unaware as to what u and v are (though it knows that the upper bound of u is easy to approach). It can interact with the outside environment in one way only: by creating a satisficer S(). S() will then be fed u to become S(u).

However, the design for S(u) will first be sent to M(εu+v), for some unknown small ε. Note that M(εu+v) is a v-maximiser, so it disagrees with M(u-v) except for u. M(εu+v) is a (hypothetical) agent that will be (counterfactually) released at some unspecified later date. Currently, it will be fed the whole reasoning history of M(u+v), and will make a single decision only: where S(u) is to be built or not.

So it seems that S(u) must;

#. Set u close to its maximal value (as this is "easy to approach"). #. Not increase v to any significant extent (or else M(u-v) would not design it). #. Not decrease v to any significant extent (or else M(εu+v) would veto it). #. Since v is unknown to M(u-v) and and resources are finite, this should serve as a general reduced impact requirement for S(u) (we may have to use something like a soft minimum across all v, rather than an expectation across all v, to avoid certain edge casess). #. Since is u unknown to M(u-v), S() would serve as a general satisficing agent for any utility functions whose upper bounds are easy to approach (remember that we can take an arbitrary utility function and arbitrarily bound it at some number).

For the moment, this does seems like it would produce a successful satisficer...

Personal Blog

1

New Comment

Interesting idea! To clarify: does not know or , and further does not know , but will know and .

It seems simplest to assume that each uses CDT, so it accepts iff the existence of is good according to $\epsilon u + v$. So will maximize the probability that will barely change , while approaching the maximum .

The question is whether it is safe to maximize subject to the constraints that does not change much for almost all in our prior. My intuition is that if no in our prior encodes something very close to human values, then it's possible to destroy human values without changing any . But perhaps this could be used if we already have a prior that assigns some non-negligible probability to human values. This could plausibly be the case if human values have low relative Kolmogorov complexity (relative to the AI's observations).

A similar thing that might be simpler is: have know and maximize by designing . Then select randomly and have either or get to veto . Alternatively, both could be given veto power.

Now that I think about it more, we might be able to drop the assumption that the maximum is easy to approach by adding a large penalty for getting vetoed.

have know and maximize by designing .

I wanted it to be ignorant of so that we could use as a generic satisficing design.

both could be given veto power.

This seems similar to giving a resource gathering agent veto power.

then it’s possible to destroy human values without changing any .

My thought process is: if you want to destroy human values on an appreciable scale, you have to destroy (or neutralise) humans. Doing so seems hard, if you have to avoid making the task easier (or harder) for a with unknown .

This seems similar to giving a resource gathering agent veto power.

Suppose a resource gathering agent has a 50% chance of having a goal and 50% chance of . The resource gathering agent is completely indifferent to anything that is not causally connected to the source of information about or . So unless it expects to exist in the future, it will be completely indifferent to the existence of .

if you want to destroy human values on an appreciable scale, you have to destroy (or neutralise) humans

I think this gets into ontology identification issues. If your ontology doesn't contain high-level mental concepts, then it's possible to e.g. lobotomize everyone (or do something similar without visibly changing the brain) without any goal expressed in this ontology caring. I think something like this satisficer design could work to conservatively do useful things when we already have the right ontology (or assign significant probability to it).

The resource gathering agent is completely indifferent to anything that is not causally connected to the source of information about v or −v.

No. The agent will still want to gather resources (that's the point). For instance, if you expect to discover tomorrow whether you hate puppies or love them, you would still want to get a lot of money, become world dictator, etc... And the agent does expect (counterfactually or hypothetically or whatever you want to call the false miracle approach) to exist in the future.

I think this gets into ontology identification issues.

I'm trying to design it so that it doesn't. If for example you kill everyone (or lobotomize them or whatever) then a putative M(v) (or M(u+v)) agent will find it a lot easier to maximise v, as there is less human opposition to overcome. I'm trying to say that "lobotomizing everyone has a large expected impact" rather than defining what lobotomizing means.

And the agent does expect (counterfactually or hypothetically or whatever you want to call the false miracle approach) to exist in the future.

I see. I still think this is different from getting vetoed by both and . Suppose will take some action that will increase by one, and will otherwise not affect at all. This will be vetoed by . However, a resource gathering agent will not care. This makes sense given that getting vetoed by both and imposes two constraints, while getting vetoed by only a resource gathering agent imposes one.

If for example you kill everyone (or lobotomize them or whatever) then a putative M(v) (or M(ϵu+v)) agent will find it a lot easier to maximise v, as there is less human opposition to overcome.

That makes sense. I actually didn't understand the part in the original post where expects to exist in the future, and assumed that it had no abilities other than veto power.

Still, I think it might be possible to fix for almost all while destroying human value. Let be the distribution over that uses. Suppose can compute the default values for each (e.g. because it has a distribution over which value function a future singleton will have). Then it will construct a distribution consisting of a mixture of and for almost all in the support of , sample from this distribution, and allow the resulting utility function to take over the world. To construct the distribution such that each stays almost the same for almost all , it can adjust the probabilities assigned to and appropriately for almost all in the support of . It might be necessary to include convex combinations as well sometimes, but this doesn't significantly change things. Now with high probability, will take over the world for some that is not close to human values.

Perhaps this leads us to the conclusion that we should set to the distribution over the future singleton's values, and trust that human values are probable enough that these will be taken into account. But it seems like you are saying we can apply this even when we haven't constructed a distribution that assigns non-negligible probability to human values.