Then, is considered to be a precursor of in universe when there is some -policy s.t. applying the counterfactual " follows " to (in the usual infra-Bayesian sense) causes not to exist (i.e. its source code doesn't run).

A possible complication is, what if implies that creates / doesn't interfere with the creation of ? In this case might conceptually be a precursor, but the definition would not detect it.

Can you plea...

210mo

The problem is that if Θ implies that H creates G but you consider a
counterfactual in which H doesn't create G then you get an inconsistent
hypothesis i.e. a HUC which contains only 0. It is not clear what to do with
that. In other words, the usual way of defining counterfactuals in IB (I
tentatively named it "hard counterfactuals") only makes sense when the condition
you're counterfactualizing on is something you have Knightian uncertainty about
(which seems safe to assume if this condition is about your own future action
but not safe to assume in general). In a child post
[https://www.lesswrong.com/posts/dPmmuaz9szk26BkmD/vanessa-kosoy-s-shortform?commentId=fdeMdyAdTfFN8Rs7N]
I suggested solving this by defining "soft counterfactuals" where you consider
coarsenings of Θ in addition to Θ itself.

2y0

So, let me try to summarize and check my understanding. In the first part of the post, you show that most random reward functions are not corrigible. This looks correct.

In the second part, you want to prove that VNM-coherence is incompatible with corrigibility in this universe, and I don't think I follow. So, suppose that R(A_blue),R(B_blue),R(C_blue)>max(R(A_red),R(B_red),R(C_red)). Now we change the dynamics so that the human will not correct the agent by default, but can be manipulated into it. Then we need to add states A_black and C_black, and arro...

22y

Although I didn't make this explicit, one problem is that manipulation is still
weakly optimal—as you say. That wouldn't fit the spirit of strict corrigibility,
as defined in the post.
Note that AUP doesn't have this problem.

Thank you for explaining this! But then how can this framework be used to model humans as agents? People can easily imagine outcomes worse than death or destruction of the universe.