PhD student at the University of Toronto, studying machine learning and working on AI safety problems.
Okay, makes more sense now, now my understanding is that for question X, answer from ML system Y, amplification system A, verification in your quote is asking the A to answer "Would A(Z) output answer Y to question X?", as opposed to asking A to answer "X", and then checking if it equals "Y". This can at most be as hard as running the original system, and maybe could be much more efficient.
Right, but in the post the implicitly represented Z is used by an amplification or debate system, because it contains more information than a human can quickly read and use (so are you assuming it's simple to verify the results of amplification/debate systems?)
for extremely large Z which are represented only implicitly as in Paul's post, we might not always check whether the model matches the ground truth by actually generating the ground truth and instead just ask the human to verify the answer given Z
I'm not sure what "just ask the human to verify the answer given Z" looks like, for implicitly represented Z
Seems like it could be helpful if people who've thought about this would also predict on the question of what the survey value would be today. (e.g. via elicit snapshots)
Google's Meena (2.6 billion parameters, February 2020) creates original puns in "Conversation D", and I think "Cross-turn Repetition Example 2" is absurdist comedy, but maybe more as a result of the way the model fails.
Human: I like how cute cows are
Meena: Ah, interesting. You like how they are cute, right?
Human: yeah, and they’re realy smart
Meena: I heard they go to college
Human: cows go to college?
Meena: I heard that a cow went to Harvard.
Human: what did the cow study?
Meena: Bovine sciences.
Human: do horses go to Harvard?
Meena: Horses go to Hayvard.
Human: that’s a pretty good joke, I feel like you led me into it
Meena: You were trying to steer it elsewhere, I can see it.
The worry I'd have about this interpretability direction is that we become very good at telling stories about what 95% of the weights in neural networks do, but the remaning 5% hides some important stuff, which could end up including things like mesa-optimizers or deception. Do you have thoughts on that?
I'm talking about an imitation version where the human you're imitating is allowed to do anything they want, including instatiting a search over all possible outputs X and taking that one that maximizes the score of "How good is answer X to Y?" to try to find X*. So I'm more pointing out that this behaviour is available in imitation by default. We could try to rule it out by instructing the human to only do limited searches, but that might be hard to do along with maintaining capabilities of the system, and we need to figure out what "safe limited search" actually looks like.
If M2 has adversarial examples or other kinds of robustness or security problems, and we keep doing this training for a long time, wouldn't the training process sooner or later sample an X that exploits M2 (gets a high reward relative to other answers without actually being a good answer), which causes the update step to increase the probability of M1 giving that output, and eventually causes M1 to give that output with high probability?
I agree, and think that this problem occurs both in imitation IA and RL IA
For example is the plan to make sure M2 has no such robustness problems (if so how)?
I believe the answer is yes, and I think this is something that would need to be worked out/demonstrated. I think there is one hope that if M2 can increase the amount computing/evaluation power it uses for each new sample X as we take more samples, then you can keep taking more samples without ever accepting an adversarial one (This assumes something like for any adversarial example, all M2 with at least some finite amount of computing power will reject it). There's maybe another hope that you could make M2 robust if you're allowed to reject many plausibly good X in order to avoid false positives. I think both of these hopes are in the IOU status, and maybe Paul has a different way to put this picture that makes more sense.
Overall, I think imitative amplification seems safer, but I maybe don't think the distinction is as clear cut as my impression of this post gives.
if you can instruct them not to do things like instantiate arbitrary Turing machines
I think this and "instruct them not to search over arbitrary text strings for the text string that gives the most approval", and similar things, are the kind of details that would need to be filled out to make the thing you are talking about actually be in a distinct class from approval-based amplification and debate (My post on imitation and RL amplification was intended to argue that without further restrictions, imitation amplification is in the same class as approval-based amplification, which I think we'd agree on). I also think that specifying these restrictions in a way that still lets you build a highly capable system could require significant additional alignment work (as in the Overseer's Manual scenario here)
Conversely, I also think there are ways that you can limit approval-based amplification or debate - you can have automated checks, for example, that discard possible answers that are outside of a certain defined safe class (e.g. debate where each move can only be from either a fixed library of strings that humans produced in advance or single direct quotes from a human-produced text). I'd also hope that you could do something like have a skeptical human judge that quickly discards anything they don't understand + an ML imitation of the human judge that discards anything outside of the training distribution (don't have a detailed model of this, so maybe it would fail in some obvious way)
I think I do believe that for problems where there is a imitative amplification decomposition that solves the problem without doing search, that's more likely to be safe by default than approval-based amplification or debate. So I'd want to use imitative amplification as much as possible, falling back to approval only if needed. On imitative amplification, I'm more worried that there are many problems it can't solve without doing approval-maximizing search, which brings the old problems back in again. (e.g. I'm not sure how to use imitative amplification at the meta-level to produce better decomposition strategies than humans use without using approval-based search)
Possible source for optimization-as-a-layer: SATNet (differentiable SAT solver)