I didn't like this post. At the time, I didn't engage with it very much. I wrote a mildly critical comment (which is currently the top-voted comment, somewhat to my surprise) but I didn't actually engage with the idea very much. So it seems like a good idea to say something now.
The main argument that this is valuable seems to be: this captures a common crux in AI safety. I don't think it's my crux, and I think other people who think it is their crux are probably mistaken. So from my perspective it's a straw-man of the view it&... (Read more)(Click to expand thread. ⌘/CTRL+F to Expand All)Cmd/Ctrl F to expand all comments on this post
I think that strictly speaking this post (or at least the main thrust) is true, and proven in the first section. The title is arguably less true: I think of 'coherence arguments' as including things like 'it's not possible for you to agree to give me a limitless number of dollars in return for nothing', which does imply some degree of 'goal-direction'.
I think the post is important, because it constrains the types of valid arguments that can be given for 'freaking out about goal-directedness', for lack of a better term. In my mind, it provokes various follo
In this essay, ricraz argues that we shouldn't expect a clean mathematical theory of rationality and intelligence to exist. I have debated em about this, and I continue to endorse more or less everything I said in that debate. Here I want to restate some of my (critical) position by building it from the ground up, instead of responding to ricraz point by point.
When should we expect a domain to be "clean" or "messy"? Let's look at everything we know about science. The "cleanest" domains are mathematics and fundamental physics. There, we have crisply defined
I've been pleasantly surprised by how much this resource has caught on in terms of people using it and referring to it (definitely more than I expected when I made it). There were 30 examples on the list when was posted in April 2018, and 20 new examples have been contributed through the form since then. I think the list has several properties that contributed to wide adoption: it's fun, standardized, up-to-date, comprehensive, and collaborative.
Some of the appeal is that it's fun to read about AI cheating at tasks in unexpected ways (I&apo... (Read more)(Click to expand thread. ⌘/CTRL+F to Expand All)Cmd/Ctrl F to expand all comments on this post
A year later, I continue to agree with this post; I still think its primary argument is sound and important. I'm somewhat sad that I still think it is important; I thought this was an obvious-once-pointed-out point, but I do not think the community actually believes it yet.
I particularly agree with this sentence of Daniel's review:
I think the post is important, because it constrains the types of valid arguments that can be given for 'freaking out about goal-directedness', for lack of a better term."
"Constraining the types of v... (Read more)(Click to expand thread. ⌘/CTRL+F to Expand All)Cmd/Ctrl F to expand all comments on this post
In this essay Paul Christiano proposes a definition of "AI alignment" which is more narrow than other definitions that are often employed. Specifically, Paul suggests defining alignment in terms of the motivation of the agent (which should be, helping the user), rather than what the agent actually does. That is, as long as the agent "means well", it is aligned, even if errors in its assumptions about the user's preferences or about the world at large lead it to actions that are bad for the user.
Rohin Shah's comment on the essay (which I believe is endorsed
This is my post.
I've spent much of the last year thinking about the pedagogical mistakes I made here, and am writing the Reframing Impact sequence to fix them. While this post recorded my 2018-thinking on impact measurement, I don't think it communicated the key insights well. Of course, I'm glad it seems to have nonetheless proven useful and exciting to some people!
If I were to update this post, it would probably turn into a rehash of Reframing Impact. Instead, I'll just briefly state the argument as I would present it today.
In this essay, Rohin sets out to debunk what ey perceive as a prevalent but erroneous idea in the AI alignment community, namely: "VNM and similar theorems imply goal-directed behavior". This is placed in the context of Rohin's thesis that solving AI alignment is best achieved by designing AI which is not goal-directed. The main argument is: "coherence arguments" imply expected utility maximization, but expected utility maximization does not imply goal-directed behavior. Instead, it is a vacuous constraint, since any agent policy can be regarded as maximiz
I hadn't realized this post was nominated, partially because of my comment, so here's a late review. I basically continue to agree with everything I wrote then, and I continue to like this post for those reasons, and so I support including it in the LW Review.
Since writing the comment, I've come across another argument for thinking about intent alignment -- it seems like a "generalization" of assistance games / CIRL, which itself seems like a formalization of an aligned agent in a toy setting. In assistance games, the agent explici... (Read more)(Click to expand thread. ⌘/CTRL+F to Expand All)Cmd/Ctrl F to expand all comments on this post
Review by the author:
I continue to endorse the contents of this post.
I don't really think about the post that much, but the post expresses a worldview that shapes how I do my research - that agency is a mechanical fact about the workings of a system.
To me, the main contribution of the post is setting up a question: what's a good definition of optimisation that avoids the counterexamples of the post? Ideally, this definition would refer or correspond to the mechanistic properties of the system, so that people could somehow statically determine whether a giv
Daniel Filan's bottle cap example was featured prominently in "Risks from Learned Optimization" for good reason. I think it is a really clear and useful example of why you might want to care about the internals of an optimization algorithm and not just its behavior, and helped motivate that framing in the "Risks from Learned Optimization" paper.
Reading Alex Zhu's Paul agenda FAQ was the first time I felt like I understood Paul's agenda in its entirety as opposed to only understanding individual bits and pieces. I think this FAQ was a major contributing factor in me eventually coming to work on Paul's agenda.
I actually have some understanding of what MIRI's Agent Foundations work is about
Robustness to scale is still one of my primary explanations for why MIRI-style alignment research is useful, and why alignment work in general should be front-loaded. I am less sure about this specific post as an introduction to the concept (since I had it before the post, and don't know if anyone got it form this post), but think that the distillation of concepts floating around meatspace to clear reference works is one of the important functions of LW.
This post, and TurnTrout's work in general, have taken the impact measure approach far beyond what I thought was possible, which turned out to be both a valuable lesson for me in being less confident about my opinions around AI Alignment, and valuable in that it helped me clarify and think much better about a significant fraction of the AI Alignment problem.
I've since discussed TurnTrout's approach to impact measures with many people.
I think this post, together with Abram's other post "Towards a new technical explanation" actually convinced me that a bayesian approach to epistemology can't work in an embedded context, which was a really big shift for me.
Abram's writing and illustrations often distill technical insights into accessible, fun adventures. I've come to appreciate the importance and value of this expository style more and more over the last year, and this post is what first put me on this track. While more rigorous communication certainly has its place, clearly communicating the key conceptual insights behind a piece of work makes those insights available to the entire community.
I think that the development of Attainable Utility Preservation was significantly more progress on impact measures than (at the time) I thought would ever be possible (though RR also deserves some credit here). I also think it significantly clarified my thoughts on what impact is and how instrumental convergence works.
This post actually got me to understand how logical induction works, and also caused me to eventually give up on bayesianism as the foundation of epistemology in embedded contexts (together with Abram's other post on the untrollable mathematician).
This post (and the rest of the sequence) was the first time I had ever read something about AI alignment and thought that it was actually asking the right questions. It is not about a sub-problem, it is not about marginal improvements. Its goal is a gears-level understanding of agents, and it directly explains why that's hard. It's a list of everything which needs to be figured out in order to remove all the black boxes and Cartesian boundaries, and understand agents as well as we understand refrigerators.