People care about each other even though they have imperfect motivational pointers?

[-]Bunthut3y2-2

This prediction seems flatly wrong: I wouldn’t bring about an outcome like that. Why do I believe that? Because I have reasonably high-fidelity access to my own policy, via imagining myself in the relevant situations.

This seems like you're confusing two things here, because the thing you would want is not knowable by introspection. What I think you're introspecting is that if you'd noticed that the-thing-you-pursued-so-far was different from what your brother actually wants, you'd do what he actually wants. But the-thing-you-pursued-so-far doesn't play the role of "your utility function" in the goodhart argument. All of you plays into that. If the goodharting were to play out, your detector for differences between the-thing-you-pursued-so-far and what-your-brother-actually-wants would simply fail to warn you that it was happening, because it too can only use a proxy measure for the real thing.

[-]TurnTrout3y20

I want to know whether, as a matter of falsifiable fact, I would enact good outcomes by my brother's values were I very powerful and smart. You seem to be sympathetic to the falsifiable-in-principle prediction that, no, I would not. (Is that true?)

Anyways, I don't really buy this counterargument, but we can consider the following variant (from footnote 2):

We can also swap out "I bring about a good future for my brother" with "my brother brings about a good future for me, and I think that he will do a good job of it, even though he presumably doesn't contain a 'perfect' motivational pointer to my true values."

"True" values: My own (which I have access to)

"Proxy" values: My brother's model of my values (I have a model of his model of my values, as part of the package deal by which I have a model of him)

I still predict that he would bring about a good future by my values. Unless you think my predictive model is wrong? I could ask him to introspect on this scenario and get evidence about what he would do?

[-]Noosphere893y10

My short answer: Violations of the IID assumption is the likeliest problem in trying to generalize your values, and I see this as the key flaw underlying the post.

[This comment is no longer endorsed by its author]Reply

[-]TurnTrout3y20

What does that mean? Can you give an example to help me follow?

[-]Raemon3y10

I tagged this "Pointers Problem" but am not 100% sure it's getting at the same thing. Curious if there's a different tag that feels more appropriate.

^{^}

When I first wrote this dialogue, I may have swept difficulties under the rug like "augmenting intelligence may be hard for biological humans to do while preserving their values." I think the main point should still stand.

^{^}

We can also swap out "I bring about a good future for my brother" with "my brother brings about a good future for me, and I think that he will do a good job of it, even though he presumably doesn't contain a 'perfect' motivational pointer to my true values."

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

19

People care about each other even though they have imperfect motivational pointers?

19