Oliver Habryka

Coding day in and out on LessWrong 2.0. You can reach me at habryka@lesswrong.com

Wiki Contributions

Comments

Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover

Yeah, I think Open AI tried to do some empirical work, but approximately just produced capability progress, in my current model of the world (though I also think the incentive environment there was particularly bad). I feel confused about the "learning to summarize from human feedback" work, and currently think it was overall bad for the world, but am not super confident (in general I feel very confused about the sign of RLHF research).

I think Rohin Shah doesn't think of himself as having produced empirical work that helps with AI Alignment, but only to have produced empirical work that might help others to be convinced of AI Alignment. That is still valuable, but I think it should be evaluated on a different dimension.

I haven't gotten much out of work by Geoffrey Irving or Jan Leike (and don't think I know many other people who have, or at least haven't really heard a good story for how their work actually helps). I would actually be interested if someone could give some examples of how this research helped them.

Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover

Hmm, there might be some mismatch of words here. Like, most of the work so far on the problem has been theoretical. I am confused how you could not be excited about the theoretical work that established the whole problem, the arguments for why it's hard, and that helped us figure out at least some of the basic parameters of the problem. Given that (I think) you currently think AI Alignment is among the global priorities, you presumably think the work that allowed you to come to believe that (and that allowed others to do the same) was very valuable and important.

My guess is you are somehow thinking of work like Superintelligence, or Eliezer's original work, or Evan's work on inner optimization as something different than "theoretical work"?

Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover

I mostly disagreed with bullet point two. The primary result of "empirical AI Alignment research" that I've seen in the last 5 years has been a lot of capabilities gain, with approximately zero in terms of progress on any AI Alignment problems. I agree more with the "in the long run there will be a lot of empirical work to be done", but right now on the margin, we have approximately zero traction on useful empirical work, as far as I can tell (outside of transparency research).

A note about differential technological development

FWIW, I had a mildly negative reaction to this title. I agree with you, but I feel like the term "PSA" should be reserved for things that are really very straightforward and non-controversial, and I feel like it's a bit of a bad rhetorical technique to frame your arguments as a PSA. I like the overall content of the post, but feel like a self-summarizing post title like "Most AI Alignment research is not parallelizable" would be better.

Let's See You Write That Corrigibility Tag

I think this is a great comment that feels to me like it communicated a better intuition for why corrigibility might be natural than anything else I've read so far.

Where I agree and disagree with Eliezer

If Australia was pursuing a strategy of "lock down irrespective of cost", then I don't think it makes sense to describe the initial response as competent. It just happened to be right in this case, but in order for the overall response to helpful, it has to be adaptive to the actual costs. I agree that the early response on its own would have indicated a potentially competent decision-making algorithm, but the later followup showed that the algorithm seems to have mostly been correct on accident, and not on-purpose.

I do appreciate the link to the GDP cost article. I would have to look into the methodology more to comment on that, but it certainly seems like an interest analysis and suggestive result.

Where I agree and disagree with Eliezer

Australia seems to have suffered a lot more from the pandemic than the U.S., paying much more in the cost of lockdown than even a relatively conservative worst-case estimate would have been for the costs of an uncontrolled COVID pandemic. I don't know about the others, but given that you put Australia on this list, I don't currently trust the others to have acted sensibly.

A transparency and interpretability tech tree

(Mod note: I turned on two-factor voting. @Evan: Feel free to ask me to turn it off.)

Godzilla Strategies

Ok, but why isn't it better to have Godzilla fighting Mega-Godzilla instead of leaving Mega-Godzilla unchallenged?

The prototypical catastrophic AI action is getting root access to its datacenter

Yeah, OK, I think this distinction makes sense, and I do feel like this distinction is important. 

Having settled this, my primary response is: 

Sure, I guess it's the most prototypical catastrophic action until we have solved it, but like, even if we solve it, we haven't solved the problem where the AI does actually get a lot smarter than humans and takes a substantially more "positive-sum" action and kills approximately everyone with the use of a bioweapon, or launches all the nukes, or develops nanotechnology. We do have to solve this problem first, but the hard problem is the part where it seems hard to stop further AI development without having a system that is also capable of killing all (or approximately all) the humans, so calling this easy problem the "prototypical catastrophic action" feels wrong to me. Solving this problem is necessary, but not sufficient for solving AI Alignment, and while it is this stage and earlier stages where I expect most worlds to end, I expect most worlds that make it past this stage to not survive either.

I think given this belief, I would think your new title is more wrong than the current title (I mean, maybe it's "mostly", because we are going to die in a low-dignity way as Eliezer would say, but it's not obviously where most of the difficulty lies).

Load More