This feels to me like very much not how I would go about getting corrigibility.

It is hard to summarize how I would go about things, because there would be lots of steps, and lots of processes that are iterative.

Prior to plausible AGI/FOOM I would box it in really carefully, and I only interact with it in ways where it's expressivity is severely restricted.

I would set up a "council" of AGI-systems (a system of systems), and when giving it requests in an oracle/genie-like manner I would see if the answers converged. At first it would be the initial AGI-system, but I would use that system to generate new systems for the "council".

I would... (read more)

1