Implications of Quantum Computing for Artificial Intelligence Alignment Research

re: impotance of oversight

I do not think we really disagree on this point. I also believe that looking at the state of the computer is not as important as having an understanding of how the program is going to operate and how to shape its incentives. 

Maybe this could be better emphasized, but the way I think about this article is showing that even the strongest case for looking at the intersection of quantum computing and AI alignment does not look very promising. 


re: How quantum computing will affect ML

I basically agree that the most plausible way QC can affect AI aligment is by providing computational speedups - but I think this mostly changes the timelines rather than violating any specific assumptions in usual AI alignment research.

Relatedly, I am bullish that we will see better than quadratic speedups (ie Grover) - to get better-than-quadratic speedups you need to surpass many challenges that right now it is not clear can be surpassed outside of very contrived problem setup [REF].

In fact I think that the speedups will not even be quadratic because you "lose" the quadratic speedup when parallelizing quantum computing (in the sense that the speedup does not scale quadratically with the number of cores).

Suggestions of posts on the AF to review

Suggestion 1: Utility != reward by Vladimir Mikulik. This post attempts to distill the core ideas of mesa alignment. This kind of distillment increases the surface area of AI Alignment, which is one of the key bottlenecks of the area (that is, getting people familiarized with the field, motivated to work on it and with a handle on some open questions to work on). I would like an in-depth review because it might help us learn how to do it better!

Suggestion 2: me and my coauthor Pablo Moreno would be interested in feedback in our post about quantum computing and AI alignment. We do not think that the ideas of the paper are useful in the sense of getting us closer to AI alignment, but I think it is useful to have signpost explaining why avenues that might seem attractive to people coming into the field are not worth exploring, while introducing them to the field in a familiar way (in this case our audience are quantum computing experts). One thing that confuses me is that some people have approached me after publishing the post asking me why I think that quantum computing is useful for AI alignment, so I'd be interested in feedback on what went wrong on the communication process given the deflationary nature of the article. 

AGI safety from first principles: Goals and Agency

I think this helped me a lot understand you a bit better - thank you

Let me try paraphrasing this:

> Humans are our best example of a sort-of-general intelligence. And humans have a lazy, satisfying, 'small-scale' kind of reasoning that is mostly only well suited for activities close to their 'training regime'. Hence AGIs may also be the same - and in particular if AGIs are trained with Reinforcement Learning and heavily rewarded for following human intentions this may be a likely outcome.

Is that pointing in the direction you intended?

AGI safety from first principles: Goals and Agency

Let me try to paraphrase this: 

In the first paragraph you are saying that "seeking influence" is not something that a system will learn to do if that was not a possible strategy in the training regime. (but couldn't it appear as an emergent property? Certainly humans were not trained to launch rockets - but they nevertheless did?)

In the second paragraph you are saying that common sense sometimes allows you to modify the goals you were given (but for this to apply to AI ststems, wouldn't they need have common sense in the first place, which kind of assumes that the AI is already aligned?)

In the third paragraph it seems to me that you are saying that humans have some goals that have an built-in override mechanism in them - eg in general humans have a goal of eating delicious cake, but they will forego this goal in the interest of seeking water if they are about ot die of dehydratation (but doesn't this seem to be a consequence of these goals being just instrumental things  that proxy the complex thing that humans actually care about?)

I think I am confused because I do not understand your overall point, so the three paragraphs seem to be saying wildly different things to me.

AGI safety from first principles: Goals and Agency

I notice I am surprised you write

However, the link from instrumentally convergent goals to dangerous influence-seeking is only applicable to agents which have final goals large-scale enough to benefit from these instrumental goals

and not address the "Riemman disaster" or "Paperclip maximizer" examples [1]

  • Riemann hypothesis catastrophe. An AI, given the final goal of evaluating the Riemann hypothesis, pursues this goal by transforming the Solar System into “computronium” (physical resources arranged in a way that is optimized for computation)— including the atoms in the bodies of whomever once cared about the answer.
  • Paperclip AI. An AI, designed to manage production in a factory, is given the final goal of maximizing the manufacture of paperclips, and proceeds by converting first the Earth and then increasingly large chunks of the observable universe into paperclips.

Do you think that the argument motivating these examples is invalid?

Do you disagree with the claim that even systems with very modest and specific goals will have incentives to seek influence to perform their tasks better? 

Prisoners' Dilemma with Costs to Modeling

I have been thinking about this research direction for ~4 days.

No interesting results, though it was a good exercise to calibrate how much do I enjoy researching this type of stuff.

In case somebody else wants to dive into it, here are some thoughts I had and resources I used:


  • The definition of depth given in the post seems rather unnatural to me. This is because I expected it would be easy to relate the depth of two agents to the rank of the world of a Kripke chain where the fixed points representing their behavior will stabilize. Looking at Zachary Gleit's proof of the fixed point theorem (see The Logic of Provability, chapter 8, by G. Boolos) we can relate the modal degree of a fixed point to the number of modal operators that appear in the modalized formula to be fixed. I thought I could go through Gleit's proof counting the number of boxes that appear in the fixed points, and then combine that with my proof of the generalized fixed point theorem to derive the relationship between the number of boxes appearing in the definition of two agents and the modal degree of the fixed points that appear during a match. This ended up being harder than what I anticipated, because naively counting the number of boxes that appear in Gleit's proof makes really fast growing formulas appear and its hard to combine them through the induction of the generalized theorem proof.


  • The Logic of Provability, by G. Boolos. Has pretty much everything you need to know about modal logic. Recommended reading chapters 1,3,4,5,6,7,8.
  • Fixed point theorem of provability logic, by J. Sevilla. An in depth explanation I wrote in Arbital some years ago.
  • Modal Logic in the Wolfram Language, by J. Sevilla. A working implementation of Modal Combat, with some added utilities. It is hugely inefficient and Wolfram is not a good choice because license issues, but may be useful to somebody who wants to compute the result of a couple combats or read about modal combat at introductory level. You can open the attached notebook in the Wolfram Programming Lab.

Thank you Scott for writing this post, it has been useful to get a glimpse of how to do research.