Let’s say I know how to build / train a human-level (more specifically, John von Neumann level) AGI. And let’s say that we (and/or the AGI itself) have already spent a few years on making the algorithm work better and more efficiently.
Question: How much compute will it take to run this AGI?
(NB: I said "running" an AGI, not training / programming an AGI. I'll talk a bit about “training compute” at the very end.)
Answer: I don’t know. But that doesn’t seem to be stopping me from writing this post. ¯\_(ツ)_/¯ My current feeling—which I can easily imagine changing after discussion (which is a major reason I'm writing this!)—seems to be:
Work done with Ramana Kumar, Sebastian Farquhar (Oxford), Jonathan Richens, Matt MacDermott (Imperial) and Tom Everitt.
Our DeepMind Alignment team researches ways to avoid AGI systems that knowingly act against the wishes of their designers. We’re particularly concerned about agents which may be pursuing a goal that is not what their designers want.
These types of safety concerns motivate developing a formal theory of agents to facilitate our understanding of their properties and avoid designs that pose a safety risk. Causal influence diagrams (CIDs) aim to be a unified theory of how design decisions create incentives that shape agent behaviour to illuminate potential risks before an agent is trained and inspire better agent designs with more appealing alignment properties.
Our new paper, Discovering Agents, introduces new ways of tackling these issues, including:
In this post I’m going to describe my basic justification for working on RLHF in 2017-2020, which I still stand behind. I’ll discuss various arguments that RLHF research had an overall negative impact and explain why I don’t find them persuasive.
I'll also clarify that I don't think research on RLHF is automatically net positive; alignment research should address real alignment problems, and we should reject a vague association between "RLHF progress" and "alignment progress."
Here are some background views about alignment I held in 2015 and still hold today. I expect disagreements about RLHF will come down to disagreements about this background:
This post is part of the work done at Conjecture.
Thanks to Eric Winsor, Daniel Braun, Andrea Miotti and Connor Leahy for helpful comments and feedback on the draft versions of this post.
There has been a lot of debate and discussion recently in the AI safety community about whether AGI will likely optimize for fixed goals or be a wrapper mind. The term wrapper mind is largely a restatement of the old idea of a utility maximizer, with AIXI as a canonical example. The fundamental idea is that there is an agent with some fixed utility function which it maximizes without any kind of feedback which can change its utility function.
Rather, the optimization process is assumed to be 'wrapped around' some core and unchanging utility function. The capabilities core...
This is a lightly edited transcript of a chatroom conversation between Scott Alexander and Eliezer Yudkowsky last year, following up on the Late 2021 MIRI Conversations. Questions discussed include "How hard is it to get the right goals into AGI systems?" and "In what contexts do AI systems exhibit 'consequentialism'?".
@ScottAlexander ready when you are
Okay, how do you want to do this?
If you have an agenda of Things To Ask, you can follow it; otherwise I can start by posing a probing question or you can?
We've been very much winging it on these and that has worked... as well as you have seen it working!
Okay. I'll post from my agenda. I'm assuming we both have the right to edit logs
In this note I will discuss some computations and observations that I have seen in other posts about "basin broadness/flatness". I am mostly working off the content of the posts Information Loss --> Basin flatness and Basin broadness depends on the size and number of orthogonal features. I will attempt to give one rigorous and unified narrative for core mathematical parts of these posts and I will also attempt to explain my reservations about some aspects of these approaches. This post started out as a series of comments that I had already made on the posts, but I felt it may be worthwhile for me to spell out my position and give my own explanations.
Work completed while author was a SERI MATS scholar under the mentorship of...