TL;DR There has been a lot of discussion on Lesswrong on concerns about deceptive AI, much of which has been philosophical. We have now written a paper that proves that deception is one of two failure modes when using RLHF improperly. It's called “When Your AIs Deceive You: Challenges with...
This is the appendix to Natural Abstractions: Key Claims, Theorems, and Critiques. It contains additional details that we expect are only relevant to some readers. We also have a pdf with more mathematical details, which contains the proofs of the Telephone and generalized KPD theorems, which is different content than...
TL;DR: We distill John Wentworth’s Natural Abstractions agenda by summarizing its key claims: the Natural Abstraction Hypothesis—many cognitive systems learn to use similar abstractions—and the Redundant Information Hypothesis—a particular mathematical description of natural abstractions. We also formalize proofs for several of its theoretical results. Finally, we critique the agenda’s progress...
Preface Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort. Thanks to Erik Jenner who explained to me the basic intuition for why an advanced RL agent may evade the discussed corrigibility measure. I also thank Alex Turner, Magdalena Wache, and Walter Laurito for...
Introduction Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort. Thanks to Magdalena Wache for giving feedback on a recent version, and to Alex Turner for giving feedback on an early version of this article. When thinking about shard theory, I noticed that my...
Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort. The following is a short slack dialogue between Leon Lang, Quintin Pope, and Peli Grietzer that emerged as part of the SERI-MATS stream on shard theory. Alex Turner encouraged us to share it. To follow...
Preface The following text is my submission for the AI Safety Public Materials contest. In it, I try to lay out the importance of AI Safety Research to people who, according to the winning conditions of the contest, have not yet engaged with AI Safety, Lesswrong, or effective altruism. As...