UK AISI’s Alignment Team: Research Agenda

Jacob Pfau; Marie_DB; Geoffrey Irving

Thanks for the post (and for linking the research agenda, which I haven't yet read through)! I'm glad that, even if you use the framing of debate (which I don't expect to pan out) to think about alignment, you still get to instrumental subproblems that would be broadly useful.

(If this post is "what would help make debate work for AI alignment," you can also imagine framings "what would help make updating on human feedback work" [common ARC framing] and "what would help make model-based RL work" [common Charlie framing])

I'd put these subproblems into two buckets:

Developing theorems about specific debate ideas.
- These are the most debate-specific.
Formalizing fuzzy notions.
- By which I mean fuzzy notions that are kept smaller than the whole alignment problem, and so you maybe hope to get a useful formalization that takes some good background properties for granted.

I think there's maybe a missing bucket, which is:

Bootstrapping or indirectly generating fuzzy notions.
- If you allow a notion to grow to the size of the entire alignment problem (The one that stands out to me in your list is 'systematic human error' - if you make your model detailed enough, what error isn't systematic?), then it becomes too hard to formalize first and apply second. You need to study how to safely bootstrap concepts, or how to get them from other learned processes.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

47

UK AISI’s Alignment Team: Research Agenda

47

1. Why safety case-oriented alignment research?

2. Our initial focus: honesty and asymptotic guarantees

Example: Debate safety case sketch

3. Future work

Concrete open problems in honesty

More details on our empirical approach

Moving beyond honesty: automated alignment

4. List of open problems we’d like to see solved

4.1 Empirical problems

4.2 Theoretical problems

Collaborate with us