ARC explores the challenge of extracting information from AI systems that isn't directly observable in their outputs, i.e "eliciting latent knowledge." They present a hypothetical AI-controlled security system to demonstrate how relying solely on visible outcomes can lead to deceptive or harmful results. The authors argue that developing methods to reveal an AI's full understanding of a situation is crucial for ensuring the safety and reliability of advanced AI systems.
Imagine if all computers in 2020 suddenly became 12 orders of magnitude faster. What could we do with AI then? Would we achieve transformative AI? Daniel Kokotajlo explores this thought experiment as a way to get intuition about AI timelines.
Daniel Kokotajlo presents his best attempt at a concrete, detailed guess of what 2022 through 2026 will look like, as an exercise in forecasting. It includes predictions about the development of AI, alongside changes in the geopolitical arena.
Nate Soares moderates a long conversation between Richard Ngo and Eliezer Yudkowsky on AI alignment. The two discuss topics like "consequentialism" as a necessary part of strong intelligence, the difficulty of alignment, and potential pivotal acts to address existential risk from advanced AI.
A vignette in which AI alignment turns out to be hard, society handles AI more competently than expected, and the outcome is still worse than hoped.
This post tells a few different stories in which humanity dies out as a result of AI technology, but where no single source of human or automated agency is the cause.
What's the type signature of an agent? John Wentworth proposes Selection Theorems as a way to explore this question. Selection Theorems tell us what agent type signatures will be selected for in broad classes of environments. This post outlines the concept and how to work on it.
Paul Christiano describes his research methodology for AI alignment. He focuses on trying to develop algorithms that can work "in the worst case" - i.e. algorithms for which we can't tell any plausible story about how they could lead to egregious misalignment. He alternates between proposing alignment algorithms and trying to think of ways they could fail.
Larger language models (LMs) like GPT-3 are certainly impressive, but nostalgebraist argues that their capabilities may not be quite as revolutionary as some claim. He examines the evidence around LM scaling and argues we should be cautious about extrapolating current trends too far into the future.
Nate Soares gives feedback to Joe Carlsmith on his paper "Is power-seeking AI an existential risk?". Nate agrees with Joe's conclusion of at least a 5% chance of catastrophe by 2070, but thinks this number is much too low. Nate gives his own probability estimates and explains various points of disagreement.
The RL algorithm "EfficientZero" achieves better-than-human performance on Atari games after only 2 hours of gameplay experience. This seems like a major advance in sample efficiency for reinforcement learning. The post breaks down how EfficientZero works and what its success might mean.