You are completely correct. This approach cannot possibly create an AI that matches a fixed specification.
This is intentional, because any fixed specification of Goodness is a model of Goodness. All models are wrong (some are useful) and therefore break when sufficiently far out of distribution. Therefore constraining a model to follow a specification is, in the case of something as out of distribution as an ASI, guaranteeing bad behavior.
You can try to leave an escape hatch with corrigibility. In the limit I believe it is possible to slave an AI model to ... (read more)
You are completely correct. This approach cannot possibly create an AI that matches a fixed specification.
This is intentional, because any fixed specification of Goodness is a model of Goodness. All models are wrong (some are useful) and therefore break when sufficiently far out of distribution. Therefore constraining a model to follow a specification is, in the case of something as out of distribution as an ASI, guaranteeing bad behavior.
You can try to leave an escape hatch with corrigibility. In the limit I believe it is possible to slave an AI model to ... (read more)