Communications lead at MIRI. Unless otherwise indicated, my posts and comments here reflect my own views, and not necessarily my employer's.
This doesn't seem like it belongs on a "list of good heuristics", though!
I helped make this list in 2016 for a post by Nate, partly because I was dissatisfied with Scott's list (which includes people like Richard Sutton, who thinks worrying about AI risk is carbon chauvinism):
Stuart Russell’s Cambridge talk is an excellent introduction to long-term AI risk. Other leading AI researchers who have expressed these kinds of concerns about general AI include Francesca Rossi (IBM), Shane Legg (Google DeepMind), Eric Horvitz (Microsoft), Bart Selman (Cornell), Ilya Sutskever (OpenAI), Andrew Davison (Imperial College London), David McAllester (TTIC), and Jürgen Schmidhuber (IDSIA).
These days I'd probably make a different list, including people like Yoshua Bengio. AI risk stuff is also sufficiently in the Overton window that I care more about researchers' specific views than about "does the alignment problem seem nontrivial to you?". Even if we're just asking the latter question, I think it's more useful to list the specific views and arguments of individuals (e.g., note that Rossi is more optimistic about the alignment problem than Russell), list the views and arguments of the similarly prominent CS people who think worrying about AGI is silly, and let people eyeball which people they think tend to produce better reasons.
One of the main explanations of the AI alignment problem I link people to.
When I read this post, it struck me as a remarkably good introduction to logical induction, and the whole discussion seemed very core to the formal-epistemology projects on LW and AIAF.
My intuition is that it'd probably be pretty easy to create an aligned superhuman AI if we knew how to create non-singular, mis-aligned superhuman AIs, and had cheap, robust methods to tell if a particular AI was misaligned.
This sounds different from how I model the situation; my views agree here with Nate's (emphasis added):
I would rephrase 3 as "There are many intuitively small mistakes one can make early in the design process that cause resultant systems to be extremely difficult to align with operators’ intentions.” I’d compare these mistakes to the “small” decision in the early 1970s to use null-terminated instead of length-prefixed strings in the C programming language, which continues to be a major source of software vulnerabilities decades later.
I’d also clarify that I expect any large software product to exhibit plenty of actually-trivial flaws, and that I don’t expect that AGI code needs to be literally bug-free or literally proven-safe in order to be worth running. Furthermore, if an AGI design has an actually-serious flaw, the likeliest consequence that I expect is not catastrophe; it’s just that the system doesn’t work. Another likely consequence is that the system is misaligned, but in an obvious ways that makes it easy for developers to recognize that deployment is a very bad idea. The end goal is to prevent global catastrophes, but if a safety-conscious AGI team asked how we’d expect their project to fail, the two likeliest scenarios we’d point to are "your team runs into a capabilities roadblock and can't achieve AGI" or "your team runs into an alignment roadblock and can easily tell that the system is currently misaligned, but can’t figure out how to achieve alignment in any reasonable amount of time."
My current model of 'the default outcome if the first project to develop AGI is highly safety-conscious, is focusing on alignment, and has a multi-year lead over less safety-conscious competitors' is that the project still fails, because their systems keep failing their tests but they don't know how to fix the deep underlying problems (and may need to toss out years of work and start from scratch in order to have a real chance at fixing them). Then they either (a) lose their lead, and some other project destroys the world; (b) decide they have to ignore some of their tests, and move ahead anyway; or (c) continue applying local patches without understanding or fixing the underlying generator of the test failures, until they or their system find a loophole in the tests and sneak by.
I don't think any of this is inevitable or impossible to avoid; it's just the default way I currently visualize things going wrong for AGI developers with a strong interest in safety and alignment.
Possibly you'd want to rule out (c) with your stipulation that the tests are "robust"? But I'm not sure you can get tests that robust. Even in the best-case scenario where developers are in a great position to build aligned AGI and successfully do so, I'm not imagining post-hoc tests that are robust to a superintelligence trying to game them. I'm imagining that the developers have a prior confidence from their knowledge of how the system works that every part of the system either lacks the optimization power to game any relevant tests, or will definitely not apply any optimization to trying to game them.
Or do you think the discontinuity will be more in the realm of embedded agency style concerns (and how does this make it less safe, instead of just dysfunctional?)
This in particular doesn't match my model. Quoting some relevant bits from Embedded Agency:
So I'm not talking about agents who know their own actions because I think there's going to be a big problem with intelligent machines inferring their own actions in the future. Rather, the possibility of knowing your own actions illustrates something confusing about determining the consequences of your actions—a confusion which shows up even in the very simple case where everything about the world is known and you just need to choose the larger pile of money.
But it’s not that I’m imagining real-world embedded systems being “too Bayesian” and this somehow causing problems, if we don’t figure out what’s wrong with current models of rational agency. It’s certainly not that I’m imagining future AI systems being written in second-order logic! In most cases, I’m not trying at all to draw direct lines between research problems and specific AI failure modes.
What I’m instead thinking about is this: We sure do seem to be working with the wrong basic concepts today when we try to think about what agency is, as seen by the fact that these concepts don’t transfer well to the more realistic embedded framework.
This is also the topic of The Rocket Alignment Problem.
One option that's smaller than link posts might be to mention in the AF/LW version of the newsletter which entries are new to AIAF/LW as far as you know; or make comment threads in the newsletter for those entries. I don't know how useful these would be either, but it'd be one way to create common knowledge 'this is currently the one and only place to discuss these things on LW/AIAF'.
Agents need to consider multiple actions and choose the one that has the best outcome. But we're supposing that the code representing the agent's decision only has one possible output. E.g., perhaps an agent is going to choose between action A and action B, and will end up choosing A. Then a sufficiently close examination of the agent's source code will reveal that the scenario "the agent chooses B" is logically inconsistent. But then it's not clear how the agent can reason about the desirability of "the agent chooses B" while evaluating its outcomes, if not via some mechanism for nontrivially reasoning about outcomes of logically inconsistent situations.
The comment starting "The main datapoint that Rob left out..." is actually by Nate Soares. I cross-posted it to LW from an email conversation.
The above is the full Embedded Agency sequence, cross-posted from the MIRI website so that it's easier to find the text version on AIAF/LW (via search, sequences, author pages, etc.).
Scott and Abram have added a new section on self-reference to the sequence since it was first posted, and slightly expanded the subsequent section on logical uncertainty and the start of the robust delegation section.