In this essay Paul Christiano proposes a definition of "AI alignment" which is more narrow than other definitions that are often employed. Specifically, Paul suggests defining alignment in terms of the motivation of the agent (which should be, helping the user), rather than what the agent actually does. That is, as long as the agent "means well", it is aligned, even if errors in its assumptions about the user's preferences or about the world at large lead it to actions that are bad for the user.
Rohin Shah's comment on the essay (which I believe is endorsed
I think it was important to have something like this post exist. However, I now think it's not fit for purpose. In this discussion thread, rohinmshah, abramdemski and I end up spilling a lot of ink about a disagreement that ended up being at least partially because we took 'realism about rationality' to mean different things. rohinmshah thought that irrealism would mean that the theory of rationality was about as real as the theory of liberalism, abramdemski thought that irrealism would mean that the theory of rationality would be about as real as the theo
In this essay, ricraz argues that we shouldn't expect a clean mathematical theory of rationality and intelligence to exist. I have debated em about this, and I continue to endorse more or less everything I said in that debate. Here I want to restate some of my (critical) position by building it from the ground up, instead of responding to ricraz point by point.
When should we expect a domain to be "clean" or "messy"? Let's look at everything we know about science. The "cleanest" domains are mathematics and fundamental physics. There, we have crisply defined
I didn't like this post. At the time, I didn't engage with it very much. I wrote a mildly critical comment (which is currently the top-voted comment, somewhat to my surprise) but I didn't actually engage with the idea very much. So it seems like a good idea to say something now.
The main argument that this is valuable seems to be: this captures a common crux in AI safety. I don't think it's my crux, and I think other people who think it is their crux are probably mistaken. So from my perspective it's a straw-man of the view it&... (read more)
I hadn't realized this post was nominated, partially because of my comment, so here's a late review. I basically continue to agree with everything I wrote then, and I continue to like this post for those reasons, and so I support including it in the LW Review.
Since writing the comment, I've come across another argument for thinking about intent alignment -- it seems like a "generalization" of assistance games / CIRL, which itself seems like a formalization of an aligned agent in a toy setting. In assistance games, the agent explici... (read more)
Note: this is on balance a negative review of the post, at least least regarding the question of whether it should be included in a "Best of LessWrong 2018" compilation. I feel somewhat bad about writing it given that the author has already written a review that I regard as negative. That being said, I think that reviews of posts by people other than the author are important for readers looking to judge posts, since authors may well have distorted views of their own works.
I've been pleasantly surprised by how much this resource has caught on in terms of people using it and referring to it (definitely more than I expected when I made it). There were 30 examples on the list when was posted in April 2018, and 20 new examples have been contributed through the form since then. I think the list has several properties that contributed to wide adoption: it's fun, standardized, up-to-date, comprehensive, and collaborative.
Some of the appeal is that it's fun to read about AI cheating at tasks in unexpected ways (I&apo... (read more)
See next year's post here.
In this essay, Rohin sets out to debunk what ey perceive as a prevalent but erroneous idea in the AI alignment community, namely: "VNM and similar theorems imply goal-directed behavior". This is placed in the context of Rohin's thesis that solving AI alignment is best achieved by designing AI which is not goal-directed. The main argument is: "coherence arguments" imply expected utility maximization, but expected utility maximization does not imply goal-directed behavior. Instead, it is a vacuous constraint, since any agent policy can be regarded as maximiz
A year later, I continue to agree with this post; I still think its primary argument is sound and important. I'm somewhat sad that I still think it is important; I thought this was an obvious-once-pointed-out point, but I do not think the community actually believes it yet.
I particularly agree with this sentence of Daniel's review:
I think the post is important, because it constrains the types of valid arguments that can be given for 'freaking out about goal-directedness', for lack of a better term."
"Constraining the types of valid arguments" is exactly the... (read more)
Daniel Filan's bottle cap example was featured prominently in "Risks from Learned Optimization" for good reason. I think it is a really clear and useful example of why you might want to care about the internals of an optimization algorithm and not just its behavior, and helped motivate that framing in the "Risks from Learned Optimization" paper.
Reading Alex Zhu's Paul agenda FAQ was the first time I felt like I understood Paul's agenda in its entirety as opposed to only understanding individual bits and pieces. I think this FAQ was a major contributing factor in me eventually coming to work on Paul's agenda.
I actually have some understanding of what MIRI's Agent Foundations work is about
I think this post, together with Abram's other post "Towards a new technical explanation" actually convinced me that a bayesian approach to epistemology can't work in an embedded context, which was a really big shift for me.
Robustness to scale is still one of my primary explanations for why MIRI-style alignment research is useful, and why alignment work in general should be front-loaded. I am less sure about this specific post as an introduction to the concept (since I had it before the post, and don't know if anyone got it from this post), but think that the distillation of concepts floating around meatspace to clear reference works is one of the important functions of LW.
This post, and TurnTrout's work in general, have taken the impact measure approach far beyond what I thought was possible, which turned out to be both a valuable lesson for me in being less confident about my opinions around AI Alignment, and valuable in that it helped me clarify and think much better about a significant fraction of the AI Alignment problem.
I've since discussed TurnTrout's approach to impact measures with many people.
Abram's writing and illustrations often distill technical insights into accessible, fun adventures. I've come to appreciate the importance and value of this expository style more and more over the last year, and this post is what first put me on this track. While more rigorous communication certainly has its place, clearly communicating the key conceptual insights behind a piece of work makes those insights available to the entire community.
I think that the development of Attainable Utility Preservation was significantly more progress on impact measures than (at the time) I thought would ever be possible (though RR also deserves some credit here). I also think it significantly clarified my thoughts on what impact is and how instrumental convergence works.
This post (and the rest of the sequence) was the first time I had ever read something about AI alignment and thought that it was actually asking the right questions. It is not about a sub-problem, it is not about marginal improvements. Its goal is a gears-level understanding of agents, and it directly explains why that's hard. It's a list of everything which needs to be figured out in order to remove all the black boxes and Cartesian boundaries, and understand agents as well as we understand refrigerators.
This post actually got me to understand how logical induction works, and also caused me to eventually give up on bayesianism as the foundation of epistemology in embedded contexts (together with Abram's other post on the untrollable mathematician).