AI ALIGNMENT FORUM
AF

33
Beth Barnes
Ω107620970
Message
Dialogue
Subscribe

Alignment researcher. Views are my own and not those of my employer. https://www.barnes.page/

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
4Beth Barnes's Shortform
4y
3
No wikitag contributions to display.
Prover-Estimator Debate: A New Scalable Oversight Protocol
Beth Barnes3mo80

I can write more later, but here's a relevant doc I wrote as part of discussion with Geoffrey + others. Maybe the key point from there is that I don't think this protocol solves the examples given in the original post describing obfuscated arguments. But yeah, I was always thinking this was a completeness problem (the original post poses the problem as distinguishing a certain class of honest arguments from dishonest obfuscated arguments - not claiming you can't get soundness by just ignoring any arguments that are plausibly obfuscated.)

Reply1
Prover-Estimator Debate: A New Scalable Oversight Protocol
Beth Barnes3mo50

Yep, happy to chat!

Reply
Prover-Estimator Debate: A New Scalable Oversight Protocol
Beth Barnes3mo56

Yep. For empirical work I'm in favor of experiments with more informed + well-trained human judges who engage deeply etc, and having a high standard for efficacy (e.g. "did it get the correct answer with very high reliability") as opposed to "did it outperform a baseline by a statistically significant margin" where you then end up needing high n and therefore each example needs to be cheap / shallow

Reply
Prover-Estimator Debate: A New Scalable Oversight Protocol
Beth Barnes3mo50

IMO the requirements are a combination of stability and compactness - these trade off against each other, and the important thing is the rate at which you get evidence for which debater is dishonest while exploring the tree.

iiuc, the stability definition used here is pretty strong - says that the error in the parent is smaller than the largest error across the children. So any argument structure where errors can accumulate (like a conjunctive argument, or a proof which requires all the steps to be correct) is out. 

Reply
Prover-Estimator Debate: A New Scalable Oversight Protocol
Beth Barnes3mo*355

I was really impressed by the technical work in this paper. Getting to a formalization of the problem setup and the protocol that allows you to prove meaningful things is a big accomplishment. 

However, as the authors mention above, I don't think this should be a substantial update about whether obfuscated arguments are a problem for recursive decomposition approaches to scalable oversight. (I think the discussion in this post is fine, but I think the title of the paper "avoiding obfuscation with prover-estimator debate" is a bit misleading. I believe the authors are going to change this in v2.) 

I'm excited about more empirical work on making debate protocols work in practice. I feel a bit less excited about pure theory work, but I think work that mixes experimentation and theory could be helpful.

I think there are broadly two classes of hope about obfuscated arguments:

(1.) In practice, obfuscated argument problems rarely come up, due to one of:

  1. It’s difficult in practice to construct obfuscated arguments for arbitrary propositions
    1. It’s definitely plausible to me that the algorithm sketch I gave here for constructing obfuscated arguments for arbitrary propositions doesn’t work, given some relatively weak assumptions about the structure of the debaters’ knowledge
  2. For anything the debaters know, there’s usually a sufficiently compact and stable honest argument such that the honest debater can win by giving this argument
    1. It seems pretty easy to give counterexamples here - e.g. Paul gives the example of unsupervised translation. But maybe these are relatively rare. It's plausible that the honest debaters in the human experiments I did where we ran into obfuscation / instability problems weren't using the right tactics (i.e. they could have constructed more stable arguments that used independent lines of evidence more)

(2.) We can create a protocol that distinguishes between cases where:

  • (not obfuscatable) the debate mirrors the structure of the debaters’ understanding such that they would be able to easily identify which subtree contains a flaw if one was present   
  • (obfuscatable) they don't or wouldn't know which subtree contains the flaw. 

This is the way (apart from argument size) in which the primality test example differs from the obfuscated factorization example: the debaters have some high-level mathematical concepts which allow them to quickly determine whether some proposed lemma is correct.

This wouldn't get us to full ELK (bc maybe models still know things they have no human-understandable arguments for), but would at least expand the class of honest arguments that we can trust to include ones that are large + unstable in human-understandable form but where the debaters do have a faster way of identifying which subtree to go down. 


 

Reply2
Debate, Oracles, and Obfuscated Arguments
Beth Barnes1y66

This is a great post, very happy it exists :)

Quick rambling thoughts:

I have some instinct that a promising direction might be showing that it's only possible to construct obfuscated arguments under particular circumstances, and that we can identify those circumstances. The structure of the obfuscated argument is quite constrained - it needs to spread out the error probability over many different leaves. This happens to be easy in the prime case, but it seems plausible it's often knowably hard. Potentially an interesting case to explore would be trying to construct an obfuscated argument for primality testing and seeing if there's a reason that's difficult. OTOH, as you point out,"I learnt this from many relevant examples in the training data" seems like a central sort of argument. Though even if I think of some of the worst offenders here (e.g. translating after learning on a monolingual corpus) it does seem like constructing a lie that isn't going to contradict itself pretty quickly might be pretty hard.

Reply
ryan_greenblatt's Shortform
Beth Barnes1y77

I'd be surprised if this was employee-level access. I'm aware of a red-teaming program that gave early API access to specific versions of models, but not anything like employee-level.

Reply
Preventing model exfiltration with upload limits
Beth Barnes1y114

If anyone wants to work on this or knows people who might, I'd be interested in funding work on this (or helping secure funding - I expect that to be pretty easy to do).

Reply
Bounty: Diverse hard tasks for LLM agents
Beth Barnes2y10

Good point!
Hmm I think it's fine to use OpenAI/Anthropic APIs for now. If it becomes an issue we can set up our own Llama or whatever to serve all the tasks that need another model. It should be easy to switch out one model for another.

Reply
Bounty: Diverse hard tasks for LLM agents
Beth Barnes2y10

Yep, that's right.  And also need it to be possible to check the solutions without needing to run physical experiments etc!

Reply
Load More
61Clarifying METR's Auditing Role
1y
0
18Bounty: Diverse hard tasks for LLM agents
2y
29
32Send us example gnarly bugs
2y
0
33Managing risks of our own work
2y
0
70ARC Evals new report: Evaluating Language-Model Agents on Realistic Autonomous Tasks
2y
4
84More information about the dangerous capability evaluations we did with GPT-4 and Claude.
2y
13
53'simulator' framing and confusions about LLMs
3y
1
8Reflection Mechanisms as an Alignment target: A follow-up survey
3y
0
45Evaluations project @ ARC is hiring a researcher and a webdev/engineer
3y
6
39Help ARC evaluate capabilities of current language models (still need people)
3y
3
Load More