Suggestion 1: Utility != reward by Vladimir Mikulik. This post attempts to distill the core ideas of mesa alignment. This kind of distillment increases the surface area of AI Alignment, which is one of the key bottlenecks of the area (that is, getting people familiarized with the field, motivated to work on it and with a handle on some open questions to work on). I would like an in-depth review because it might help us learn how to do it better!
Suggestion 2: me and my coauthor Pablo Moreno would be interested in feedback in our post about quantum computing and AI alignment. We do not think that the ideas of the paper are useful in the sense of getting us closer to AI alignment, but I think it is useful to have signpost explaining why avenues that might seem attractive to people coming into the field are not worth exploring, while introducing them to the field in a familiar way (in this case our audience are quantum computing experts). One thing that confuses me is that some people have approached me after publishing the post asking me why I think that quantum computing is useful for AI alignment, so I'd be interested in feedback on what went wrong on the communication process given the deflationary nature of the article.
I think this helped me a lot understand you a bit better - thank you
Let me try paraphrasing this:
> Humans are our best example of a sort-of-general intelligence. And humans have a lazy, satisfying, 'small-scale' kind of reasoning that is mostly only well suited for activities close to their 'training regime'. Hence AGIs may also be the same - and in particular if AGIs are trained with Reinforcement Learning and heavily rewarded for following human intentions this may be a likely outcome.
Is that pointing in the direction you intended?
Let me try to paraphrase this:
In the first paragraph you are saying that "seeking influence" is not something that a system will learn to do if that was not a possible strategy in the training regime. (but couldn't it appear as an emergent property? Certainly humans were not trained to launch rockets - but they nevertheless did?)
In the second paragraph you are saying that common sense sometimes allows you to modify the goals you were given (but for this to apply to AI ststems, wouldn't they need have common sense in the first place, which kind of assumes that the AI is already aligned?)
In the third paragraph it seems to me that you are saying that humans have some goals that have an built-in override mechanism in them - eg in general humans have a goal of eating delicious cake, but they will forego this goal in the interest of seeking water if they are about ot die of dehydratation (but doesn't this seem to be a consequence of these goals being just instrumental things that proxy the complex thing that humans actually care about?)
I think I am confused because I do not understand your overall point, so the three paragraphs seem to be saying wildly different things to me.
I notice I am surprised you write
However, the link from instrumentally convergent goals to dangerous influence-seeking is only applicable to agents which have final goals large-scale enough to benefit from these instrumental goals
and not address the "Riemman disaster" or "Paperclip maximizer" examples [1]
- Riemann hypothesis catastrophe. An AI, given the final goal of evaluating the Riemann hypothesis, pursues this goal by transforming the Solar System into “computronium” (physical resources arranged in a way that is optimized for computation)— including the atoms in the bodies of whomever once cared about the answer.
- Paperclip AI. An AI, designed to manage production in a factory, is given the final goal of maximizing the manufacture of paperclips, and proceeds by converting first the Earth and then increasingly large chunks of the observable universe into paperclips.
Do you think that the argument motivating these examples is invalid?
Do you disagree with the claim that even systems with very modest and specific goals will have incentives to seek influence to perform their tasks better?
I have been thinking about this research direction for ~4 days.
No interesting results, though it was a good exercise to calibrate how much do I enjoy researching this type of stuff.
In case somebody else wants to dive into it, here are some thoughts I had and resources I used:
Thoughts:
Resources:
Thank you Scott for writing this post, it has been useful to get a glimpse of how to do research.
re: impotance of oversight
I do not think we really disagree on this point. I also believe that looking at the state of the computer is not as important as having an understanding of how the program is going to operate and how to shape its incentives.
Maybe this could be better emphasized, but the way I think about this article is showing that even the strongest case for looking at the intersection of quantum computing and AI alignment does not look very promising.
re: How quantum computing will affect ML
I basically agree that the most plausible way QC can affect AI aligment is by providing computational speedups - but I think this mostly changes the timelines rather than violating any specific assumptions in usual AI alignment research.
Relatedly, I am bullish that we will see better than quadratic speedups (ie Grover) - to get better-than-quadratic speedups you need to surpass many challenges that right now it is not clear can be surpassed outside of very contrived problem setup [REF].
In fact I think that the speedups will not even be quadratic because you "lose" the quadratic speedup when parallelizing quantum computing (in the sense that the speedup does not scale quadratically with the number of cores).