Creating a satisficer

[-]jessicata11y00

Interesting idea! To clarify: $M (u - v)$ does not know $u$ or $v$ , and further does not know $u - v$ , but $S$ will know $u$ and $v$ .

It seems simplest to assume that each $M (ϵ u + v)$ uses CDT, so it accepts $S (u)$ iff the existence of $S (U)$ is good according to $\epsilon u + v $. So $M (u - v)$ will maximize the probability that $S (u)$ will barely change $E [v]$ , while approaching the maximum $u$ .

The question is whether it is safe to maximize $u$ subject to the constraints that $E [v]$ does not change much for almost all $v$ in our prior. My intuition is that if no $v$ in our prior encodes something very close to human values, then it's possible to destroy human values without changing any $E [v]$ . But perhaps this could be used if we already have a prior that assigns some non-negligible probability to human values. This could plausibly be the case if human values have low relative Kolmogorov complexity (relative to the AI's observations).

A similar thing that might be simpler is: have $M (u)$ know $u$ and maximize $u$ by designing $S (u)$ . Then select $v$ randomly and have either $M (ϵ u + v)$ or $M (ϵ u - v)$ get to veto $S (u)$ . Alternatively, both could be given veto power.

Now that I think about it more, we might be able to drop the assumption that the maximum $u$ is easy to approach by adding a large penalty for $S (u)$ getting vetoed.

[-]Stuart_Armstrong11y00

have $M (u)$ know $u$ and maximize $u$ by designing $S (u)$ .

I wanted it to be ignorant of $u$ so that we could use $S ()$ as a generic satisficing design.

both could be given veto power.

This seems similar to giving a resource gathering agent veto power.

then it’s possible to destroy human values without changing any $E [v]$ .

My thought process is: if you want to destroy human values on an appreciable scale, you have to destroy (or neutralise) humans. Doing so seems hard, if you have to avoid making the task easier (or harder) for a $M (v)$ with unknown $v$ .

[-]jessicata11y00

This seems similar to giving a resource gathering agent veto power.

Suppose a resource gathering agent has a 50% chance of having a goal $v$ and 50% chance of $- v$ . The resource gathering agent is completely indifferent to anything that is not causally connected to the source of information about $v$ or $- v$ . So unless it expects to exist in the future, it will be completely indifferent to the existence of $S (u)$ .

if you want to destroy human values on an appreciable scale, you have to destroy (or neutralise) humans

I think this gets into ontology identification issues. If your ontology doesn't contain high-level mental concepts, then it's possible to e.g. lobotomize everyone (or do something similar without visibly changing the brain) without any goal expressed in this ontology caring. I think something like this satisficer design could work to conservatively do useful things when we already have the right ontology (or assign significant probability to it).

[-]Stuart_Armstrong11y00

The resource gathering agent is completely indifferent to anything that is not causally connected to the source of information about v or −v.

No. The agent will still want to gather resources (that's the point). For instance, if you expect to discover tomorrow whether you hate puppies or love them, you would still want to get a lot of money, become world dictator, etc... And the agent does expect (counterfactually or hypothetically or whatever you want to call the false miracle approach) to exist in the future.

I think this gets into ontology identification issues.

I'm trying to design it so that it doesn't. If for example you kill everyone (or lobotomize them or whatever) then a putative M(v) (or M( $ϵ$ u+v)) agent will find it a lot easier to maximise v, as there is less human opposition to overcome. I'm trying to say that "lobotomizing everyone has a large expected impact" rather than defining what lobotomizing means.

[-]jessicata11y00

And the agent does expect (counterfactually or hypothetically or whatever you want to call the false miracle approach) to exist in the future.

I see. I still think this is different from getting vetoed by both $M (ϵ u + v)$ and $M (ϵ u - v)$ . Suppose $S (u)$ will take some action that will increase $v$ by one, and will otherwise not affect $v$ at all. This will be vetoed by $M (ϵ u - v)$ . However, a resource gathering agent will not care. This makes sense given that getting vetoed by both $M (ϵ u - v)$ and $M (ϵ u + v)$ imposes two constraints, while getting vetoed by only a resource gathering agent imposes one.

If for example you kill everyone (or lobotomize them or whatever) then a putative M(v) (or M(ϵu+v)) agent will find it a lot easier to maximise v, as there is less human opposition to overcome.

That makes sense. I actually didn't understand the part in the original post where $M (ϵ u + v)$ expects to exist in the future, and assumed that it had no abilities other than veto power.

Still, I think it might be possible to fix $E [v]$ for almost all $v$ while destroying human value. Let $f$ be the distribution over $v$ that $M (u - v)$ uses. Suppose $M (u - v)$ can compute the default $E [v]$ values for each $v$ (e.g. because it has a distribution over which value function a future singleton will have). Then it will construct a distribution consisting of a mixture of $ϵ u + v$ and $ϵ u - v$ for almost all $v$ in the support of $f$ , sample from this distribution, and allow the resulting utility function to take over the world. To construct the distribution such that each $E [v]$ stays almost the same for almost all $v$ , it can adjust the probabilities assigned to $ϵ u + v$ and $ϵ u - v$ appropriately for almost all $v$ in the support of $f$ . It might be necessary to include convex combinations $ϵ u + θ v_{1} + (1 - θ) v_{2}$ as well sometimes, but this doesn't significantly change things. Now with high probability, $ϵ u \pm v$ will take over the world for some $v$ that is not close to human values.

Perhaps this leads us to the conclusion that we should set $f$ to the distribution over the future singleton's values, and trust that human values are probable enough that these will be taken into account. But it seems like you are saying we can apply this even when we haven't constructed a distribution that assigns non-negligible probability to human values.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

1

1

The design, the designer and the verifier