Key points: 0. Advertisement: We have an IMO-nice position paper which argues that AI Testing Should Account for Sophisticated Strategic Behaviour, and that we should think about evaluation (also) through game-theoretic lens. (See this footnote for an example: [1].) Tell your friends! On framing evaluation as one of many tools:...
Join us at the fifth Human-Aligned AI Summer School in Prague from 22nd to 25th July 2025! Update: We have now confirmed our speaker list with excellent speakers -- see below! Update: We still have capacity for more excellent participants as of late June. Please help us spread the word...
Abstract: In an effort to inform the discussion surrounding existential risks from AI, we formulate Extinction-level Goodhart’s Law as "Virtually any goal specification, pursued to the extreme, will result in the extinction[1] of humanity'', and we aim to understand which formal models are suitable for investigating this hypothesis. Note that...
Summary: Formally defining Extinction-level Goodhart's Law is tricky, because formal environments don't contain any actual humans that could go extinct. But we can do it using the notion of an interpretation mapping, which sends outcomes in the abstract environment to outcomes in the real world. We can then the truth...
This post overlaps with our recent paper Extinction Risk from AI: Invisible to Science?. tl;dr: In AI safety, we are worried about certain problems with using powerful AI. (For example, the difficulty of value specification, instrumental convergence, and the possibility that a misaligned AI will come up with takeover strategies...
This post overlaps with our recent paper Extinction Risk from AI: Invisible to Science?. Summary: If you want to use a mathematical model to evaluate an argument, that model needs to allow for dynamics that are crucial to the argument. For example, if you want to evaluate the argument that...
This post overlaps with our recent paper Extinction Risk from AI: Invisible to Science?. tl;dr: With claims such as "optimisation towards a misspecified goal will cause human extinction", we should be more explicit about the order of quantifiers (and the quantities) of the underlying concepts. For example, do we mean...