Actual effectiveness — AI Alignment Forum

^{^︎}

Which shouldn't be necessary in the first place, unless something weird is going on.

For a design feature of a sufficiently advanced AI to be "actually effective", we may need to worry about the behavior of other parts of the system. For example, if you try to declare that a self-modifying AI is not allowed to modify the representation of its utility function, ^[1] this constant section of code may be meaningless unless you're also enforcing some invariant on the probabilities that get multiplied by the utilities and any other element of the AI that can directly poke policies on their way to motor output. Otherwise, the code and representation of the utility function may still be there, but it may not be actually steering the AI the way it used to.

^{^︎}
Which shouldn't be necessary in the first place, unless something weird is going on.