Paul Christiano paints a vivid and disturbing picture of how AI could go wrong, not with sudden violent takeover, but through a gradual loss of human control as AI systems optimize for the wrong things and develop influence-seeking behaviors.
The original draft of Ayeja's report on biological anchors for AI timelines. The report includes quantitative models and forecasts, though the specific numbers were still in flux at the time. Ajeya cautions against wide sharing of specific conclusions, as they don't yet reflect Open Philanthropy's official stance.
A story in nine parts about someone creating an AI that predicts the future, and multiple people who wonder about the implications. What happens when the predictions influence what future happens?
Will AGI progress gradually or rapidly? I think the disagreement is mostly about what happens before we build powerful AGI.
I think weaker AI systems will already have radically transformed the world. This is strategically relevant because I'm imagining AGI strategies playing out in a world where everything is already going crazy, while other people are imagining AGI strategies playing out in a world that looks kind of like 2018 except that someone is about to get a decisive strategic advantage.
Historically people worried about extinction risk from artificial intelligence have not seriously considered deliberately slowing down AI progress as a solution. Katja Grace argues this strategy should be considered more seriously, and that common objections to it are incorrect or exaggerated.
A "good project" in AGI research needs:1) Trustworthy command, 2) Research closure, 3) Strong operational security, 4) Commitment to the common good, 5) An alignment mindset, and 6) Requisite resource levels.
The post goes into detail on what minimal, adequate, and good performance looks like.
A fictional story about an AI researcher who leaves an experiment running overnight.
Imagine if all computers in 2020 suddenly became 12 orders of magnitude faster. What could we do with AI then? Would we achieve transformative AI? Daniel Kokotajlo explores this thought experiment as a way to get intuition about AI timelines.
Ajeya Cotra, Daniel Kokotajlo, and Ege Erdil discuss their differing AI forecasts. Key topics include the importance of transfer learning, AI's potential to accelerate R&D, and the expected trajectory of AI capabilities. They explore concrete scenarios and how observations might update their views.
Daniel Kokotajlo presents his best attempt at a concrete, detailed guess of what 2022 through 2026 will look like, as an exercise in forecasting. It includes predictions about the development of AI, alongside changes in the geopolitical arena.
"Human feedback on diverse tasks" could lead to transformative AI, while requiring little innovation on current techniques. But it seems likely that the natural course of this path leads to full blown AI takeover.
Tom Davidson analyzes AI takeoff speeds – how quickly AI capabilities might improve as they approach human-level AI. He puts ~25% probability on takeoff lasting less than 1 year, and ~50% on it lasting less than 3 years. But he also argues we should assign some probability to takeoff lasting more than 5 years.
Richard Ngo lays out the core argument for why AGI could be an existential threat: we might build AIs that are much smarter than humans, that act autonomously to pursue large-scale goals, whose goals conflict with ours, leading them to take control of humanity's future. He aims to defend this argument in detail from first principles.
In the span of a few years, some minor European explorers (later known as the conquistadors) encountered, conquered, and enslaved several huge regions of the world. Daniel Kokotajlo argues this shows the plausibility of a small AI system rapidly taking over the world, even without overwhelming technological superiority.
In thinking about AGI safety, I’ve found it useful to build a collection of different viewpoints from people that I respect, such that I can think from their perspective. I will often try to compare what an idea feels like when I put on my Paul Christiano hat, to when I put on my Scott Garrabrant hat. Recently, I feel like I’ve gained a "Chris Olah" hat, which often looks at AI through the lens of interpretability.
The goal of this post is to try to give that hat to more people.
A vignette in which AI alignment turns out to be hard, society handles AI more competently than expected, and the outcome is still worse than hoped.
Katja Grace provides a list of counterarguments to the basic case for existential risk from superhuman AI systems. She examines potential gaps in arguments about AI goal-directedness, AI goals being harmful, and AI superiority over humans. While she sees these as serious concerns, she doesn't find the case for overwhelming likelihood of existential risk convincing based on current arguments.
Eric Drexler's CAIS model suggests that before we get to a world with monolithic AGI agents, we will already have seen an intelligence explosion due to automated R&D. This reframes the problems of AI safety and has implications for what technical safety researchers should be doing. Rohin reviews and summarizes the model
GDP isn't a great metric for AI timelines or takeoff speed because the relevant events (like AI alignment failure or progress towards self-improving AI) could happen before GDP growth accelerates visibly. Instead, we should focus on things like warning shots, heterogeneity of AI systems, risk awareness, multipolarity, and overall "craziness" of the world.
This post tells a few different stories in which humanity dies out as a result of AI technology, but where no single source of human or automated agency is the cause.
The field of AI alignment is growing rapidly, attracting more resources and mindshare each year. As it grows, more people will be incentivized to misleadingly portray themselves or their projects as more alignment-friendly than they are. Adam proposes "safetywashing" as the term for this
Larger language models (LMs) like GPT-3 are certainly impressive, but nostalgebraist argues that their capabilities may not be quite as revolutionary as some claim. He examines the evidence around LM scaling and argues we should be cautious about extrapolating current trends too far into the future.
AI safety researchers have different ideas of what success would look like. This post explores five different AI safety "success stories" that researchers might be aiming for and compares them along several dimensions.
Nate Soares gives feedback to Joe Carlsmith on his paper "Is power-seeking AI an existential risk?". Nate agrees with Joe's conclusion of at least a 5% chance of catastrophe by 2070, but thinks this number is much too low. Nate gives his own probability estimates and explains various points of disagreement.
Instead, it's the point of no return—the day we AI risk reducers lose the ability to significantly reduce AI risk. This might happen years before classic milestones like "World GWP doubles in four years" and "Superhuman AGI is deployed."
Eliezer Yudkowsky recently criticized the OpenPhil draft report on AI timelines. Holden Karnofsky thinks Eliezer misunderstood the report in important ways, and defends the report's usefulness as a tool for informing (not determining) AI timelines.
The practice of extrapolating AI timelines based on biological analogies has a long history of not working. Eliezer argues that this is because the resource gets consumed differently, so base-rate arguments from resource consumption end up quite unhelpful in real life.
Timelines are inherently very difficult to predict accurately, until we are much closer to AGI.