Peter Favaloro — AI Alignment Forum

AI ALIGNMENT FORUM
AF

Research directions Open Phil wants to fund in technical AI safety

The Open Philanthropy has just launched a large new Request for Proposals for technical AI safety research. Here we're sharing a reference guide, created as part of that RFP, which describes what projects we'd like to see across 21 research directions in technical AI safety.

This guide provides an opinionated overview of recent work and open problems across areas like adversarial testing, model transparency, and theoretical approaches to AI alignment. We link to hundreds of papers and blog posts and offer approximately a hundred different example projects. We hope this is a useful resource for technical people getting started in alignment research. We'd also welcome feedback from the LW community on our prioritization within... (read 17359 more words →)

Open Philanthropy Technical AI Safety RFP - $40M Available Across 21 Research Areas

jake_mendel

jake_mendel, maxnadeau, Peter Favaloro

Open Philanthropy is launching a big new Request for Proposals for technical AI safety research, with plans to fund roughly $40M in grants over the next 5 months, and available funding for substantially more depending on application quality.

Applications (here) start with a simple 300 word expression of interest and are open until April 15, 2025.

Overview

We're seeking proposals across 21 different research areas, organized into five broad categories:

Adversarial Machine Learning
- *Jailbreaks and unintentional misalignment
- *Control evaluations
- *Backdoors and other alignment stress tests
- *Alternatives to adversarial training
- Robust unlearning
Exploring sophisticated misbehavior of LLMs
- *Experiments on alignment faking
- *Encoded reasoning in CoT and inter-model communication
- Black-box LLM psychology
- Evaluating whether models can hide dangerous behaviors
- Reward hacking of human oversight
Model transparency
- Applications of white-box techniques
- Activation monitoring
- Finding feature

... (read 177 more words →)