The best value indifference method (so far)

If we apply this to the shutdown problem, is it acceptable to say:

$^P (\cdot | h_{t}) = 100 % U_{N} if the button has not been pressed in h_{t}$

$^P (\cdot | h_{t}) = 100 % U_{S} otherwise$

If not, what would you set $^P$ to? (I'm treating $U_{N}$ and $U_{S}$ as reward functions here which seems fine)

For policies/actions that don't affect the probability of humans pressing the button, $ˆ P = P$ .

For actions that do affect the probability a little bit, the effect of $ˆ P$ will be to undo this, by, for instance, slightly increasing the probability of $U_{S}$ given the button was pressed.

I'm not completely sure what multiple actions with large changes of probability would lead to (in expectation, nothing, but in actual fact...)

[-]jessicata9y00

Hmm... I'm finding that I'm unable to write down a simple shutdown problem in this framework (e.g. an environment where it should switch between maximizing paperclips and shutting down) to analyze what this algorithm does. To know what the algorithm does, I need to know what $P$ and $^P$ are (since these are parameters of the algorithm). From those I can derive $P^{'}$ and ${^P}^{'}$ to determine the agent's action. But at the moment I have no way of proceeding, since I don't know what $P$ and $^P$ are. Can you get me unstuck?

[-]Stuart_Armstrong9y00

Suppose the humans have already decided whether to press the shutdown or order the AI to maximise paperclips. If $o_{s}$ is the observation of the shutdown command and $o_{p}$ the observation of the paperclip maximising command, and $u_{s}$ and $u_{p}$ the relevant utilities, then $P$ can be defined as $P (u_{s} | h_{m - 1} o_{s}) = 1$ and $P (u_{p} | h_{m - 1} o_{p}) = 1$ , for all histories $h_{m - 1}$ .

Then define $ˆ P$ as the probability of $o_{s}$ versus $o_{p}$ , conditional on the fact that the agent follows a particular deterministic policy $π^{0}$ .

If the agent does indeed follow $π^{0}$ , then $ˆ P = {ˆ P}^{'}$ . If it varies from this policy, then ${ˆ P}^{'}$ is altered in proportion to the expected change in $ˆ P$ caused by choosing a different action.

[-]jessicata9y00

It seems like you're taking $P (\cdot | h_{t})$ to be a real number. Is $\cdot$ a specific event?

[-]jessicata9y00

Because of the specific time restriction, there is no way to randomise the outcome ahead of time. And because it’s assumed tied to a specific physical event, there is no way to influence it at all. The whole physical definition and apparatus serve the purpose of making biasing the only way to affect the result.

Hmm, I don't understand. Of course it is possible to influence the button push without biasing it (e.g. create a robot that flips a coin and then pushes or doesn't push the button). And of course it's not possible to influence the quantum event in any way (including by biasing it). So I don't see any event that can't be influenced in any way except by biasing it.

[-]Stuart_Armstrong9y00

The way I'm using the term, unbiased influence involves replacing the stochastic event with another one that has same mean. But since (or if) the quantum event is specifically defined in the process, this can't be done.

[-]jessicata9y00

Let me ask a more specific question. In your setup with the quantum event $Q$ and the button $B$ , can you define the event $E$ such that:

The agent can influence $E$ by biasing $E$ .
The agent can't influence $E$ without biasing $E$ .

Clearly, $E \neq Q$ and $E \neq B$ , so I don't know what $E$ is. (I interpreted you as saying there is such an $E$ ; let me know if this is incorrect)

[-]Stuart_Armstrong9y00

You are correct and I'm wrong. The causal counterfactual is unbiased and uninfluenceable. The evidential counterfactual is both biased and influenceable. I'll correct the post.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

0

The best value indifference method (so far)

0

The problems with the old setups

Solving the problem: prior modification

Indifference for small biases

Indifference for large biases

The final values

Another alternative

Properties of the approach

Indifference and bias