Even better than "Getting models to explain why they’re doing what they’re doing in simpler terms that connect to things the human overseers understand" would be getting models to actually do the task in ways that are simpler and connect to things that human overseers understand. E.g. if a model can solve a task in multiple steps by looking up relevant information by doing internet searches that are recorded and readable by the overseer instead of using knowledge opaquely measured in the weights, that seems like a step in the right direction.
One easy way to make people who can't solve the task for sandwiching is to take people who could solve the task and then give them insufficient time to solve it, or have them be uninformed of some relevant facts about the specific task they are trying to solve.
A simpler way to measure whether you are making progress towards sandwiching if you can't go there directly is to look at whether you can get people to provide better supervision with your tool than without your tool, that is accomplishing more on the task.
Both of these approaches feel like they aren't quite solving the whole problem, because ultimately we want systems that help humans supervise tasks where they haven't developed the right concepts, or couldn't understand them even with years of study.
If the High-Rated Sentence Producer was restricted to output only single steps of a mathematical proof and the single steps were evaluated independently, with the human unable to look at previous steps, then I wouldn't expect this kind of reward hacking to occur. In math proofs, we can build proofs for more complex questions out of individual steps that don't need to increase in complexity.
As I see it, debate on arbitrary questions could work if we figured out how to do something similar, having arguments split into single steps and evaluated independently (as in the recent OpenAI debate work), such that the debate AI can tackle more complicated questions with steps that are restricted to the complexity that humans can currently work with. Hard to know if this is possible, but still seems worth trying to work on.
For the preference learning skepticism, does this extend to the research direction (that isn't yet a research area) of modelling long term preferences/preferences on reflection? This is more along the lines of the "AI-assisted deliberation" direction from ARCHES.To me it seems like AI alignment that can capture preferences on reflection could be used to find solutions to many of other problems. Though there are good reasons to expect that we'd still want to do other work (because we might need theoretical understanding and okay solutions before AI reaches the point where it can help on research, because we want to do work ourselves to be able to check solutions that AIs reach, etc.)It also seems like areas like FairML and Computational Social Choice will require preference learning as components - my guess is that people's exact preferences about fairness won't have a simple mathematical formulation, and will instead need to be learned. I could buy the position that the necessary progress in preference learning will happen by default because of other incentives.
One thing I'd like to see are some more fleshed out examples of the kinds of governance demands that you think might be important in the future and would be bottlenecked on research progress in these areas.
Okay, makes more sense now, now my understanding is that for question X, answer from ML system Y, amplification system A, verification in your quote is asking the A to answer "Would A(Z) output answer Y to question X?", as opposed to asking A to answer "X", and then checking if it equals "Y". This can at most be as hard as running the original system, and maybe could be much more efficient.
Right, but in the post the implicitly represented Z is used by an amplification or debate system, because it contains more information than a human can quickly read and use (so are you assuming it's simple to verify the results of amplification/debate systems?)
for extremely large Z which are represented only implicitly as in Paul's post, we might not always check whether the model matches the ground truth by actually generating the ground truth and instead just ask the human to verify the answer given Z
I'm not sure what "just ask the human to verify the answer given Z" looks like, for implicitly represented Z
Seems like it could be helpful if people who've thought about this would also predict on the question of what the survey value would be today. (e.g. via elicit snapshots)
Google's Meena (2.6 billion parameters, February 2020) creates original puns in "Conversation D", and I think "Cross-turn Repetition Example 2" is absurdist comedy, but maybe more as a result of the way the model fails.
Human: I like how cute cows are
Meena: Ah, interesting. You like how they are cute, right?
Human: yeah, and they’re realy smart
Meena: I heard they go to college
Human: cows go to college?
Meena: I heard that a cow went to Harvard.
Human: what did the cow study?
Meena: Bovine sciences.
Human: do horses go to Harvard?
Meena: Horses go to Hayvard.
Human: that’s a pretty good joke, I feel like you led me into it
Meena: You were trying to steer it elsewhere, I can see it.