PhD student at the University of Toronto, studying machine learning and working on AI safety problems.
The worry I'd have about this interpretability direction is that we become very good at telling stories about what 95% of the weights in neural networks do, but the remaning 5% hides some important stuff, which could end up including things like mesa-optimizers or deception. Do you have thoughts on that?
I'm talking about an imitation version where the human you're imitating is allowed to do anything they want, including instatiting a search over all possible outputs X and taking that one that maximizes the score of "How good is answer X to Y?" to try to find X*. So I'm more pointing out that this behaviour is available in imitation by default. We could try to rule it out by instructing the human to only do limited searches, but that might be hard to do along with maintaining capabilities of the system, and we need to figure out what "safe limited search" actually looks like.
If M2 has adversarial examples or other kinds of robustness or security problems, and we keep doing this training for a long time, wouldn't the training process sooner or later sample an X that exploits M2 (gets a high reward relative to other answers without actually being a good answer), which causes the update step to increase the probability of M1 giving that output, and eventually causes M1 to give that output with high probability?
I agree, and think that this problem occurs both in imitation IA and RL IA
For example is the plan to make sure M2 has no such robustness problems (if so how)?
I believe the answer is yes, and I think this is something that would need to be worked out/demonstrated. I think there is one hope that if M2 can increase the amount computing/evaluation power it uses for each new sample X as we take more samples, then you can keep taking more samples without ever accepting an adversarial one (This assumes something like for any adversarial example, all M2 with at least some finite amount of computing power will reject it). There's maybe another hope that you could make M2 robust if you're allowed to reject many plausibly good X in order to avoid false positives. I think both of these hopes are in the IOU status, and maybe Paul has a different way to put this picture that makes more sense.
Overall, I think imitative amplification seems safer, but I maybe don't think the distinction is as clear cut as my impression of this post gives.
if you can instruct them not to do things like instantiate arbitrary Turing machines
I think this and "instruct them not to search over arbitrary text strings for the text string that gives the most approval", and similar things, are the kind of details that would need to be filled out to make the thing you are talking about actually be in a distinct class from approval-based amplification and debate (My post on imitation and RL amplification was intended to argue that without further restrictions, imitation amplification is in the same class as approval-based amplification, which I think we'd agree on). I also think that specifying these restrictions in a way that still lets you build a highly capable system could require significant additional alignment work (as in the Overseer's Manual scenario here)
Conversely, I also think there are ways that you can limit approval-based amplification or debate - you can have automated checks, for example, that discard possible answers that are outside of a certain defined safe class (e.g. debate where each move can only be from either a fixed library of strings that humans produced in advance or single direct quotes from a human-produced text). I'd also hope that you could do something like have a skeptical human judge that quickly discards anything they don't understand + an ML imitation of the human judge that discards anything outside of the training distribution (don't have a detailed model of this, so maybe it would fail in some obvious way)
I think I do believe that for problems where there is a imitative amplification decomposition that solves the problem without doing search, that's more likely to be safe by default than approval-based amplification or debate. So I'd want to use imitative amplification as much as possible, falling back to approval only if needed. On imitative amplification, I'm more worried that there are many problems it can't solve without doing approval-maximizing search, which brings the old problems back in again. (e.g. I'm not sure how to use imitative amplification at the meta-level to produce better decomposition strategies than humans use without using approval-based search)
Possible source for optimization-as-a-layer: SATNet (differentiable SAT solver)
One way to try to measure capability robustness seperate from alignment robustness off of the training distribution of some system would be to:
This would let you make comparisons between different systems as to which was more capability robust.
Maybe there's a version that could train the new system using behavioural cloning, but it's less clear how you measure when you're as competent as the original agent (maybe using a discriminator?)
The reason for trying this is having a measure of competence that is less dependent on human judgement/closer to the systems's ontology and capabilities.
For Alaska vs. Bali, alternative answer is "You could be convinced that either Alaska or Bali is a good vacation destination". It's an interesting question whether this could actually win in debate. I think it might have a better chance in Factored Evaluation, because we can spin up two seperate trees to view the most compelling argument for Alaska and the most compelling argument for Bali and verify that these are convincing. In debate, you'd need view either Alaska Argument before Bali Argument, or Bali Argument before Alaska Argument, and you might just be convinced by the first argument you see in which case you wouldn't agree that you could be convinced either way.
I'd say that the claim is not sufficient - it might provide some alignment value, but it needs a larger story about how the whole computation is going to be safe. I do think that the HCH framework could make specifying an aligned GOFAI-like computation easier (but it's hard to come up with a rigorous argument for this without pointing to some kind of specification that we can make claims about, which is something I'd want to produce along the way while proceeding with HCH-like approaches)
I think a cleaner way of stating condition 3 might be "there exists some policy with a non-zero chance of deception at some point in the future, with expected value for the mesa-optimizer higher than the policy of always cooperating".
This could be met if the threat of modification never goes away, but maybe decreases over time. (at training time, system is monitored for defection every episode, at a deployment time it is monitored for defection 1/million episodes, not worthwhile to defect during training but worthwhile to defect at deployment time)
This could fail to be met if the mesa-optimizer believes that 1) there is a 1% chance that it will be detected by the base optimizer 2) if this happens, the base optimizer will be able to determine $O_mesa$ and give it a large negative reward, 100x more than the possible reward from the best defecting policy. (not sure if there's any version of this that would work, since it seems hard to figure out $O_mesa$ and provide negative reward)
To me, It seems like the point of this story is that we could build an AI that ends up doing very dangerous things without ever asking it "Will you do things I don't like if given more capability?" or some other similar question that requires it to execute the treacherous turn. In contrast, if the developers did something like build a testing world with toy humans in it who could be manipulated in a way detectable to the developers, and placed the AI in the toy testing world, then it seems like this AI would be forced into a position where it either acts in a way according to it's true incentives (manipulate the humans and be detected), or execute the treacherous turn (abstain from manipulating the humans so developers will trust it more). So it seems like this wouldn't happen if the developers are trying to test for treacherous turn behaviour during development.