PhD student at the University of Toronto, studying machine learning and working on AI safety problems.
Possible source for optimization-as-a-layer: SATNet (differentiable SAT solver)
One way to try to measure capability robustness seperate from alignment robustness off of the training distribution of some system would be to:
This would let you make comparisons between different systems as to which was more capability robust.
Maybe there's a version that could train the new system using behavioural cloning, but it's less clear how you measure when you're as competent as the original agent (maybe using a discriminator?)
The reason for trying this is having a measure of competence that is less dependent on human judgement/closer to the systems's ontology and capabilities.
For Alaska vs. Bali, alternative answer is "You could be convinced that either Alaska or Bali is a good vacation destination". It's an interesting question whether this could actually win in debate. I think it might have a better chance in Factored Evaluation, because we can spin up two seperate trees to view the most compelling argument for Alaska and the most compelling argument for Bali and verify that these are convincing. In debate, you'd need view either Alaska Argument before Bali Argument, or Bali Argument before Alaska Argument, and you might just be convinced by the first argument you see in which case you wouldn't agree that you could be convinced either way.
I'd say that the claim is not sufficient - it might provide some alignment value, but it needs a larger story about how the whole computation is going to be safe. I do think that the HCH framework could make specifying an aligned GOFAI-like computation easier (but it's hard to come up with a rigorous argument for this without pointing to some kind of specification that we can make claims about, which is something I'd want to produce along the way while proceeding with HCH-like approaches)
I think a cleaner way of stating condition 3 might be "there exists some policy with a non-zero chance of deception at some point in the future, with expected value for the mesa-optimizer higher than the policy of always cooperating".
This could be met if the threat of modification never goes away, but maybe decreases over time. (at training time, system is monitored for defection every episode, at a deployment time it is monitored for defection 1/million episodes, not worthwhile to defect during training but worthwhile to defect at deployment time)
This could fail to be met if the mesa-optimizer believes that 1) there is a 1% chance that it will be detected by the base optimizer 2) if this happens, the base optimizer will be able to determine $O_mesa$ and give it a large negative reward, 100x more than the possible reward from the best defecting policy. (not sure if there's any version of this that would work, since it seems hard to figure out $O_mesa$ and provide negative reward)
To me, It seems like the point of this story is that we could build an AI that ends up doing very dangerous things without ever asking it "Will you do things I don't like if given more capability?" or some other similar question that requires it to execute the treacherous turn. In contrast, if the developers did something like build a testing world with toy humans in it who could be manipulated in a way detectable to the developers, and placed the AI in the toy testing world, then it seems like this AI would be forced into a position where it either acts in a way according to it's true incentives (manipulate the humans and be detected), or execute the treacherous turn (abstain from manipulating the humans so developers will trust it more). So it seems like this wouldn't happen if the developers are trying to test for treacherous turn behaviour during development.
Are you interested in protocols involving multiple episodic questions (where you ask one question, wait for it to resolve, then ask another question?)
Submission: low-bandwidth oracle
Plan Criticism: Given plan to build an aligned AI, put together a list of possible lines of thought to think about problems with the plan (open questions, possible failure modes, criticisms, etc.). Ask the oracle to pick one of these lines of thought, pick another line of thought at random, and spend the next time period X thinking about both, judge which line of thought was more useful to think about (where lines of thought that spot some fatal missed problem are judged to be very useful) and reward the oracle if its suggestion was picked.