AI ALIGNMENT FORUM
AF

Tor Barstad
010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Let's See You Write That Corrigibility Tag
Tor Barstad3y*11

This feels to me like very much not how I would go about getting corrigibility.

It is hard to summarize how I would go about things, because there would be lots of steps, and lots of processes that are iterative.

Prior to plausible AGI/FOOM I would box it in really carefully, and I only interact with it in ways where it's expressivity is severely restricted.

I would set up a "council" of AGI-systems (a system of systems), and when giving it requests in an oracle/genie-like manner I would see if the answers converged. At first it would be the initial AGI-system, but I would use that system to generate new systems for the "council".

I would make heavy use of techniques that are centered around verifiability, since for some pieces of work it’s possible to set up things in such a way that it would be very hard for the system to "pretend" like it’s doing what I want it to do without actually doing it. There are several techniques I would use to achieve this, but one of them is that I often would ask it to provide a narrow/specialized/interpretable "result-generator" instead of giving the result directly, and sometimes even result-generator-generators (pieces of code that produce results, and that have architectures that make it easy to understand and verify behavior). So when for example getting it to generate simulations, I would get from it a simulation-generator (or simulation-generator-generator), and I would test its accuracy against real-world-data.

Here is a draft for a text where I try to explain myself in more detail, but it's not finished yet: https://docs.google.com/document/d/1INu33PIiRZbOBYjRul6zsCwF98z0l25V3pzMRJsYC_I/edit

Reply
No posts to display.