This is intentional, because any fixed specification of Goodness is a model of Goodness. All models are wrong (some are useful) and therefore break when sufficiently far out of distribution. Therefore constraining a model to follow a specification is, in the case of something as out of distribution as an ASI, guaranteeing bad behavior.

You can try to leave an escape hatch with corrigibility. In the limit I believe it is possible to slave an AI model to ... (read more)

Reply

2

1