Communications lead at MIRI. Unless otherwise indicated, my posts and comments here reflect my own views, and not necessarily my employer's.
Maybe changing the title would prime people less to have the wrong interpretation? E.g., to 'Coherence arguments require that the system care about something'.
Even just 'Coherence arguments do not entail goal-directed behavior' might help, since colloquial "imply" tends to be probabilistic, but you mean math/logic "imply" instead. Or 'Coherence theorems do not entail goal-directed behavior on their own'.
Some skepticism from Eliezer here: https://twitter.com/ESRogs/status/1337869362678571008
I've copied over comments by MIRI's Evan Hubinger and Eliezer Yudkowsky on a slightly earlier draft of Ajeya's post — as a separate post, since it's a lot of text.
Previously linked here: https://www.alignmentforum.org/posts/wsBpJn7HWEPCJxYai/excerpt-from-arbital-solomonoff-induction-dialogue
Two examples of MIRI talking about orthogonality, instrumental convergence, etc.: "Five Theses, Two Lemmas, and a Couple of Strategic Implications" (2013) and "So Far: Unfriendly AI Edition" (2016). The latter is closer to how I'd start a discussion with a random computer scientist today, if they thought AGI alignment isn't important to work on and I wanted to figure out where the disagreement lies.
I think "Five Theses..." is basically a list of 'here are the five things Ray Kurzweil is wrong about'. A lot of people interested in AGI early on held Kurzweilian views: humans will self-improve to keep up with AI; sufficiently smart AGI will do good stuff by default; value isn't fragile; etc. 'AGI built with no concern for safety or alignment' is modeled like a person from a foreign culture, or like a sci-fi alien race with bizarre but beautiful cosmopolitan values — not like the moral equivalent of a paperclip maximizer.
I think orthogonality, instrumental convergence, etc. are also the key premises Eliezer needed to learn. Eliezer initially dismissed the importance of alignment research because he thought moral truths were inherently motivating, so any AGI smart enough to learn what's moral would end up promoting good outcomes. Visualizing human values as just one possible goal in a vast space of possibilities, noticing that there's no architecture-invariant causal mechanism forcing modeled goals to leak out into held goals, and thinking about obviously bad goals like "just keep making paperclips" helps undo that specific confusion.
I agree that a fair number of people in the early days over-updated based on "other people are wrong" logic.
After seeing this post last month, Eliezer mentioned to me that he likes your recent posts, and would want to spend money to make more posts like this exist, if that were an option.
(I've poked Richard about this over email already, but wanted to share the Eliezer-praise here too.)
I agree with this post.
May be useful to include in the review with some of the comments, or with a postmortem and analysis by Ben (or someone).
I don't think the discussion stands great on its own, but it may be helpful for:
Seems like a good starting point for discussion. Researchers need to have some picture of what AI alignment is "for," in order to think about what research directions look most promising.
I want to see more attempts to answer this question. Also related to another post I nominated: https://www.lesswrong.com/posts/PKy8NuNPknenkDY74/soft-takeoff-can-still-lead-to-decisive-strategic-advantage