Leon Lang

Message

I'm a last-year PhD student at the University of Amsterdam working on AI Safety and Alignment, and specifically safety risks of Reinforcement Learning from Human Feedback (RLHF). Previously, I also worked on abstract multivariate information theory and equivariant deep learning. https://langleon.github.io/

2010

132

201

Leon Lang

Natural Abstractions: Key Claims, Theorems, and Critiques

TL;DR: We distill John Wentworth’s Natural Abstractions agenda by summarizing its key claims: the Natural Abstraction Hypothesis—many cognitive systems learn to use similar abstractions—and the Redundant Information Hypothesis—a particular mathematical description of natural abstractions. We also formalize proofs for several of its theoretical results. Finally, we critique the agenda’s progress to date, alignment relevance, and current research methodology. Author Contributions: Erik wrote a majority of the post and developed the breakdown into key claims. Leon formally proved the gKPD theorem and wrote most of the mathematical formalization section and appendix. Lawrence formally proved the Telephone theorem and wrote most of the related work section. All of us were involved in conceptual discussions and various small tasks. Epistemic Status: We’re not John Wentworth, though we did confirm our understanding with him in person and shared a draft of this post with him beforehand. Appendices: We have an additional appendix post and technical pdf containing further details and mathematical formalizations. We refer to them throughout the post at relevant places. This post is long, and for many readers we recommend using the table of contents to skip to only the parts they are most interested in (e.g. the Key high-level claims to get a better sense for what the Natural Abstraction Hypothesis says, or our Discussion for readers already very familiar with natural abstractions who want to see our views). Our Conclusion is also a decent 2-min summary of the entire post. Introduction The Natural Abstraction Hypothesis (NAH) says that our universe abstracts well, in the sense that small high-level summaries of low-level systems exist, and that furthermore, these summaries are “natural”, in the sense that many different cognitive systems learn to use them. There are also additional claims about how these natural abstractions should be formalized. We thus split up the Nat

247Mar 16, 2023

Disentangling Shard Theory into Atomic Claims

86Jan 13, 2023

[Paper Blogpost] When Your AIs Deceive You: Challenges with Partial Observability in RLHF

51Oct 22, 2024

[Appendix] Natural Abstractions: Key Claims, Theorems, and Critiques

48Mar 16, 2023

Leon Lang

Message

2010

132

201

[Paper Blogpost] When Your AIs Deceive You: Challenges with Partial Observability in RLHF

TL;DR There has been a lot of discussion on Lesswrong on concerns about deceptive AI, much of which has been philosophical. We have now written a paper that proves that deception is one of two failure modes when using RLHF improperly. It's called “When Your AIs Deceive You: Challenges with...

Oct 22, 202451

[Appendix] Natural Abstractions: Key Claims, Theorems, and Critiques

This is the appendix to Natural Abstractions: Key Claims, Theorems, and Critiques. It contains additional details that we expect are only relevant to some readers. We also have a pdf with more mathematical details, which contains the proofs of the Telephone and generalized KPD theorems, which is different content than...

Mar 16, 202348

Natural Abstractions: Key Claims, Theorems, and Critiques

Mar 16, 2023247

Experiment Idea: RL Agents Evading Learned Shutdownability

Preface Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort. Thanks to Erik Jenner who explained to me the basic intuition for why an advanced RL agent may evade the discussed corrigibility measure. I also thank Alex Turner, Magdalena Wache, and Walter Laurito for...

Jan 16, 202331

Disentangling Shard Theory into Atomic Claims

Introduction Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort. Thanks to Magdalena Wache for giving feedback on a recent version, and to Alex Turner for giving feedback on an early version of this article. When thinking about shard theory, I noticed that my...

Jan 13, 202386

A Short Dialogue on the Meaning of Reward Functions

Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort. The following is a short slack dialogue between Leon Lang, Quintin Pope, and Peli Grietzer that emerged as part of the SERI-MATS stream on shard theory. Alex Turner encouraged us to share it. To follow...

Nov 19, 202245

Distribution Shifts and The Importance of AI Safety

Preface The following text is my submission for the AI Safety Public Materials contest. In it, I try to lay out the importance of AI Safety Research to people who, according to the winning conditions of the contest, have not yet engaged with AI Safety, Lesswrong, or effective altruism. As...

Sep 29, 202217

Load More (7/8)

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Leon Lang

Leon Lang

Leon Lang

Natural Abstractions: Key Claims, Theorems, and Critiques

Disentangling Shard Theory into Atomic Claims

[Paper Blogpost] When Your AIs Deceive You: Challenges with Partial Observability in RLHF

[Appendix] Natural Abstractions: Key Claims, Theorems, and Critiques

Leon Lang

[Paper Blogpost] When Your AIs Deceive You: Challenges with Partial Observability in RLHF

[Appendix] Natural Abstractions: Key Claims, Theorems, and Critiques

Natural Abstractions: Key Claims, Theorems, and Critiques

Experiment Idea: RL Agents Evading Learned Shutdownability

Disentangling Shard Theory into Atomic Claims

A Short Dialogue on the Meaning of Reward Functions

Distribution Shifts and The Importance of AI Safety

Natural Abstractions: Key Claims, Theorems, and Critiques

Disentangling Shard Theory into Atomic Claims

[Paper Blogpost] When Your AIs Deceive You: Challenges with Partial Observability in RLHF

[Appendix] Natural Abstractions: Key Claims, Theorems, and Critiques

[Paper Blogpost] When Your AIs Deceive You: Challenges with Partial Observability in RLHF

[Appendix] Natural Abstractions: Key Claims, Theorems, and Critiques

Natural Abstractions: Key Claims, Theorems, and Critiques

Experiment Idea: RL Agents Evading Learned Shutdownability

Disentangling Shard Theory into Atomic Claims

A Short Dialogue on the Meaning of Reward Functions

Distribution Shifts and The Importance of AI Safety