Peter Favaloro — AI Alignment Forum

Research directions Open Phil wants to fund in technical AI safety

This is a linkpost for https://www.openphilanthropy.org/tais-rfp-research-areas/

The Open Philanthropy has just launched a large new Request for Proposals for technical AI safety research. Here we're sharing a reference guide, created as part of that RFP, which describes what projects we'd like to see across 21 research directions in technical AI safety.

This guide provides an opinionated overview of recent work and open problems across areas like adversarial testing, model transparency, and theoretical approaches to AI alignment. We link to hundreds of papers and blog posts and offer approximately a hundred different example projects. We hope this is a useful resource for technical people getting started in alignment research. We'd also welcome feedback from the LW community on our prioritization within or across research areas.

For each research area, we include:

Discussion of key technical problems and why

...

(Continue Reading - 17339 more words)

Open Philanthropy Technical AI Safety RFP - $40M Available Across 21 Research Areas

jake_mendel, maxnadeau, Peter Favaloro

This is a linkpost for https://www.openphilanthropy.org/request-for-proposals-technical-ai-safety-research/

Open Philanthropy is launching a big new Request for Proposals for technical AI safety research, with plans to fund roughly $40M in grants over the next 5 months, and available funding for substantially more depending on application quality.

Applications (here) start with a simple 300 word expression of interest and are open until April 15, 2025.

Overview

We're seeking proposals across 21 different research areas, organized into five broad categories:

Adversarial Machine Learning
- *Jailbreaks and unintentional misalignment
- *Control evaluations
- *Backdoors and other alignment stress tests
- *Alternatives to adversarial training
- Robust unlearning
Exploring sophisticated misbehavior of LLMs
- *Experiments on alignment faking
- *Encoded reasoning in CoT and inter-model communication
- Black-box LLM psychology
- Evaluating whether models can hide dangerous behaviors
- Reward hacking of human oversight
Model transparency
- Applications of white-box techniques
- Activation monitoring
- Finding feature representations
- Toy models for interpretability
- Externalizing reasoning
- Interpretability benchmarks
- More transparent architectures
Trust from first principles
- White-box estimation of rare misbehavior
- Theoretical

...

(See More - 157 more words)