Coding day in and out on LessWrong 2.0. You can reach me at firstname.lastname@example.org
Yeah, I think Open AI tried to do some empirical work, but approximately just produced capability progress, in my current model of the world (though I also think the incentive environment there was particularly bad). I feel confused about the "learning to summarize from human feedback" work, and currently think it was overall bad for the world, but am not super confident (in general I feel very confused about the sign of RLHF research).
I think Rohin Shah doesn't think of himself as having produced empirical work that helps with AI Alignment, but only to have produced empirical work that might help others to be convinced of AI Alignment. That is still valuable, but I think it should be evaluated on a different dimension.
I haven't gotten much out of work by Geoffrey Irving or Jan Leike (and don't think I know many other people who have, or at least haven't really heard a good story for how their work actually helps). I would actually be interested if someone could give some examples of how this research helped them.
Hmm, there might be some mismatch of words here. Like, most of the work so far on the problem has been theoretical. I am confused how you could not be excited about the theoretical work that established the whole problem, the arguments for why it's hard, and that helped us figure out at least some of the basic parameters of the problem. Given that (I think) you currently think AI Alignment is among the global priorities, you presumably think the work that allowed you to come to believe that (and that allowed others to do the same) was very valuable and important.
My guess is you are somehow thinking of work like Superintelligence, or Eliezer's original work, or Evan's work on inner optimization as something different than "theoretical work"?
I mostly disagreed with bullet point two. The primary result of "empirical AI Alignment research" that I've seen in the last 5 years has been a lot of capabilities gain, with approximately zero in terms of progress on any AI Alignment problems. I agree more with the "in the long run there will be a lot of empirical work to be done", but right now on the margin, we have approximately zero traction on useful empirical work, as far as I can tell (outside of transparency research).
FWIW, I had a mildly negative reaction to this title. I agree with you, but I feel like the term "PSA" should be reserved for things that are really very straightforward and non-controversial, and I feel like it's a bit of a bad rhetorical technique to frame your arguments as a PSA. I like the overall content of the post, but feel like a self-summarizing post title like "Most AI Alignment research is not parallelizable" would be better.
I think this is a great comment that feels to me like it communicated a better intuition for why corrigibility might be natural than anything else I've read so far.
If Australia was pursuing a strategy of "lock down irrespective of cost", then I don't think it makes sense to describe the initial response as competent. It just happened to be right in this case, but in order for the overall response to helpful, it has to be adaptive to the actual costs. I agree that the early response on its own would have indicated a potentially competent decision-making algorithm, but the later followup showed that the algorithm seems to have mostly been correct on accident, and not on-purpose.
I do appreciate the link to the GDP cost article. I would have to look into the methodology more to comment on that, but it certainly seems like an interest analysis and suggestive result.
Australia seems to have suffered a lot more from the pandemic than the U.S., paying much more in the cost of lockdown than even a relatively conservative worst-case estimate would have been for the costs of an uncontrolled COVID pandemic. I don't know about the others, but given that you put Australia on this list, I don't currently trust the others to have acted sensibly.
(Mod note: I turned on two-factor voting. @Evan: Feel free to ask me to turn it off.)
Ok, but why isn't it better to have Godzilla fighting Mega-Godzilla instead of leaving Mega-Godzilla unchallenged?
Yeah, OK, I think this distinction makes sense, and I do feel like this distinction is important.
Having settled this, my primary response is:
Sure, I guess it's the most prototypical catastrophic action until we have solved it, but like, even if we solve it, we haven't solved the problem where the AI does actually get a lot smarter than humans and takes a substantially more "positive-sum" action and kills approximately everyone with the use of a bioweapon, or launches all the nukes, or develops nanotechnology. We do have to solve this problem first, but the hard problem is the part where it seems hard to stop further AI development without having a system that is also capable of killing all (or approximately all) the humans, so calling this easy problem the "prototypical catastrophic action" feels wrong to me. Solving this problem is necessary, but not sufficient for solving AI Alignment, and while it is this stage and earlier stages where I expect most worlds to end, I expect most worlds that make it past this stage to not survive either.
I think given this belief, I would think your new title is more wrong than the current title (I mean, maybe it's "mostly", because we are going to die in a low-dignity way as Eliezer would say, but it's not obviously where most of the difficulty lies).