x

AI ALIGNMENT FORUM

AF

janos — AI Alignment Forum

janos

janos

Message

239

3

55

17y

janos

239

17y

On scalable oversight with weak LLMs judging strong LLMs

Abstract Scalable oversight protocols aim to enable humans to accurately supervise superhuman AI. In this paper we study debate, where two AI's compete to convince a human judge; consultancy, where a single AI tries to convince a human judge that asks questions; and compare to a baseline of direct question-answering,...

Jul 8, 2024•49

Power-seeking can be probable and predictive for trained agents

Power-seeking is a major source of risk from advanced AI and a key element of most threat models in alignment. Some theoretical results show that most reward functions incentivize reinforcement learning agents to take power-seeking actions. This is concerning, but does not immediately imply that the agents we train will...

Feb 28, 2023•56