World State is the Wrong Abstraction for Impact

[-]johnswentworth5y70

I was just looking back through 2019 posts, and I think there's some interesting crosstalk between this post and an insight I recently had (summarized here).

In general, utility maximizers have the form "maximize E[u(X)|blah]", where u is the utility function and X is a (tuple of) random variables in the agent's world-model. Implication: utility is a function of random variables in a world model, not a function of world-states. This creates ontology problems because variables in a world-model need not correspond to anything in the real world. For instance, some people earnestly believe in ghosts; ghosts are variables in their world model, and their utility function can depend on how happy the ghosts are.

If we accept the "attainable utility" formulation of impact, this brings up some tricky issues. Do we want to conserve attainable values of E[u(X)], or of u(X) directly? The former leads directly to deceit: if there are actions a human can take which will make them think that u(X) is high, then AU is high under the E[u(X)] formulation, even if there is actually nothing corresponding to u(X). (Example: a human has available actions which will make them think that many ghosts are happy, even though there are no actual ghosts, leading to high AU under the E[u(X)] formulation.) On the other hand, if we try to make attainable values u(X) high directly, then there's a question of what that even means when there's no real-world thing corresponding to X. What actions in the real world do or do not conserve the attainable levels of happiness of ghosts?

[-]TurnTrout5y30

Right. Another E[u(X)] problem would be, the smart AI realizes that if the dumber human keeps thinking, they'll realize they're about to drive off of a cliff, which would negatively impact their attainable utility estimate. Therefore, distract them.

I forgot to mention this in the sequence, but as you say - the formalisms aren't quite right enough to use as an explicit objective due to confusions about adjacent areas of agency. AUP-the-method attempts to get around that by penalizing catastrophically disempowering behavior, such that the low-impact AI doesn't obstruct our ability to get what we want (even though it isn't going out of its way to empower us, either). We'd be trying to make the agent impact/de facto non-obstructive, even though it isn't going to be intent non-obstructive.

[-]DanielFilan6y30

I'm not aware of others explicitly trying to deduce our native algorithm for impact. No one was claiming the ontological theories explain our intuitions, and they didn't have the same "is this a big deal?" question in mind. However, we need to actually understand the problem we're solving, and providing that understanding is one responsibility of an impact measure! Understanding our own intuitions is crucial not just for producing nice equations, but also for getting an intuition for what a "low-impact" Frank would do.

I wish you'd expanded on this point a bit more. To me, it seems like to come up with "low-impact" AI, you should be pretty grounded in situations where your AI system might behave in an undesirably "high-impact" way, and generalise the commonalities between those situations into some neat theory (and maybe do some philosophy about which commonalities you think are important to generalise vs accidental), rather than doing analytic philosophy on what the English word "impact" means. Could you say more about why the test-case-driven approach is less compelling to you? Or is this just a matter of the method of exposition you've chosen for this sequence?

[-]TurnTrout6y20

Most of the reason is indeed exposition: our intuitions about AU-impact are surprisingly clear-cut and lead naturally to the thing we want "low impact" AIs to do (not be incentivized to catastrophically decrease our attainable utilities, yet still execute decent plans). If our intuitions about impact were garbage and misleading, then I would have taken a different (and perhaps test-case-driven) approach. Plus, I already know that the chain of reasoning leads to a compact understanding of the test cases anyways.

I've also found that test-case based discussion (without first knowing what we want) can lead to a blending of concerns, where someone might think the low-impact agent should do X because agents who generally do X are safer (and they don't see a way around that), where someone might secretly have a different conception of the problems that low-impact agency should solve, etc.

[-]Rafael Harth5y20

Thoughts I have at this point in the sequence

This style is extremely nice and pleasant and fun to read. I saw that the first post was like that months ago; I didn't expect the entire sequence to be like that. I recall what you said about being unable to type without feeling pain. Did this not extend to handwriting?
The message so far seems clearly true in the sense that measuring impact by something that isn't ethical stuff is a bad idea, and making that case is probably really good.
I do have the suspicion that quantifying impact properly is impossible without formalizing qualia (and I don't expect the sequence to go there), but I'm very willing to be proven wrong.

[-]TurnTrout5y10

Thank you! I poured a lot into this sequence, and I'm glad you're enjoying it. :) Looking forward to what you think of the rest!

I recall what you said about being unable to type without feeling pain. Did this not extend to handwriting?

Handwriting was easier, but I still had to be careful not to do more than ~1 hour / day.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

22

World State is the Wrong Abstraction for Impact

22

Appendix: We Asked a Wrong Question

Appendix: Avoiding Side Effects