Ian Televan — AI Alignment Forum

While I agree that the algorithm might output 5, I don't share the intuition that it's something that wasn't 'supposed' to happen, so I'm not sure what problem it was meant to demonstrate. I thought of a few ways to interpret it, but I'm not sure which one, if any, was the intended interpretation:

a) The algorithm is defined to compute argmax, but it doesn't output argmax because of false antecedents.

- but I would say that it's not actually defined to compute argmax, therefore the fact that it doesn't output argmax is not a problem.

b) Regardless of the output, the algorithm uses reasoning from false antecedents, which seems nonsensical from the perspective of someone who uses intuitive conditionals, which impedes its reasoning.

- it may indeed seem nonsensical, but if 'seeming nonsensical' doesn't actually impede its ability to select actions wich highest utility (when it's actually defined to compute argmax), then I would say that it's also not a problem. Furthermore, wouldn't MUDT be perfectly satisfied with the tuple ? It also uses 'nonsensical' reasoning 'A()=5 => U()=0' but still outputs action with highest utility.

c) Even when the use of false antecedents doesn't impede its reasoning, the way it arrives at its conclusions is counterintuitive to humans, which means that we're more likely to make a catastrophic mistake when reasoning about how the agent reasons.

- Maybe? I don't have access to other people's intuitions, but when I read the example, I didn't have any intuitive feeling of what the algorithm would do, so instead I just calculated all assignments $(x, y) \in {0, 5, 10}^{2}$ , eliminated all inconsistent ones and proceeded from there. And this issue wouldn't be unique to false antecedents, there are other perfectly valid pieces of logic that might nonetheless seem counterintuitive to humans, for example the puzzle with islanders and blue eyes.

Yet, we reason informally from false antecedents all the time, EG thinking about what would happen if

When I try to examine my own reasoning, I find that when I do so, I'm just selectively blind to certain details and so don't notice any problems. For example: suppose the environment calculates "U=10 if action = A; U=0 if action = B" and I, being a utility maximizer, am deciding between actions A and B. Then I might imagine something like "I chose A and got 10 utils", and "I chose B and got 0 utils" - ergo, I should choose A.

But actually, if I had thought deeper about the second case, I would also think "hm, because I'm determined to choose the action with highest reward I would not choose B. And yet I chose B. This is logically impossible! OH NO THIS TIMELINE IS INCONSISTENT!" - so I couldn't actually coherently reason about what could happen if I chose B. And yet, I would still be left with the only consistent timeline where I choose A, which I would promptly follow, and get my maximum of 10 utils.

The problem is also "solved" if the agent thinks only about the environment, ignoring its knowledge about its own source code.

The idea with reversing the outputs and taking the assignment that is valid for both versions of the algorithm seemed to me to be closer to the notion "but what would actually happen if you actually acted differently", i.e. avoiding seemingly nonsensical reasoning while preserving self-reflection. But I'm not sure when, if ever, this principle can be generalized.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments