Rob Bensinger

Communications lead at MIRI. Unless otherwise indicated, my posts and comments here reflect my own views, and not necessarily my employer's.

Comments

Coherence arguments imply a force for goal-directed behavior

Maybe changing the title would prime people less to have the wrong interpretation? E.g., to 'Coherence arguments require that the system care about something'.

Even just 'Coherence arguments do not entail goal-directed behavior' might help, since colloquial "imply" tends to be probabilistic, but you mean math/logic "imply" instead. Or 'Coherence theorems do not entail goal-directed behavior on their own'.

The case for aligning narrowly superhuman models

I've copied over comments by MIRI's Evan Hubinger and Eliezer Yudkowsky on a slightly earlier draft of Ajeya's post — as a separate post, since it's a lot of text.

Distinguishing claims about training vs deployment

Two examples of MIRI talking about orthogonality, instrumental convergence, etc.: "Five Theses, Two Lemmas, and a Couple of Strategic Implications" (2013) and "So Far: Unfriendly AI Edition" (2016). The latter is closer to how I'd start a discussion with a random computer scientist today, if they thought AGI alignment isn't important to work on and I wanted to figure out where the disagreement lies.

I think "Five Theses..." is basically a list of 'here are the five things Ray Kurzweil is wrong about'. A lot of people interested in AGI early on held Kurzweilian views: humans will self-improve to keep up with AI; sufficiently smart AGI will do good stuff by default; value isn't fragile; etc. 'AGI built with no concern for safety or alignment' is modeled like a person from a foreign culture, or like a sci-fi alien race with bizarre but beautiful cosmopolitan values — not like the moral equivalent of a paperclip maximizer.

I think orthogonality, instrumental convergence, etc. are also the key premises Eliezer needed to learn. Eliezer initially dismissed the importance of alignment research because he thought moral truths were inherently motivating, so any AGI smart enough to learn what's moral would end up promoting good outcomes. Visualizing human values as just one possible goal in a vast space of possibilities, noticing that there's no architecture-invariant causal mechanism forcing modeled goals to leak out into held goals, and thinking about obviously bad goals like "just keep making paperclips" helps undo that specific confusion.

I agree that a fair number of people in the early days over-updated based on "other people are wrong" logic.

Commentary on AGI Safety from First Principles

After seeing this post last month, Eliezer mentioned to me that he likes your recent posts, and would want to spend money to make more posts like this exist, if that were an option.

(I've poked Richard about this over email already, but wanted to share the Eliezer-praise here too.)

Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More

May be useful to include in the review with some of the comments, or with a postmortem and analysis by Ben (or someone).

I don't think the discussion stands great on its own, but it may be helpful for:

  • people familiar with AI alignment who want to better understand some human factors behind 'the field isn't coordinating or converging on safety'.
  • people new to AI alignment who want to use the views of leaders in the field to help them orient.
AI Safety "Success Stories"

Seems like a good starting point for discussion. Researchers need to have some picture of what AI alignment is "for," in order to think about what research directions look most promising.

Load More