I'm an admin of this site; I work full-time on trying to help people on LessWrong refine the art of human rationality.
Longer bio: www.lesswrong.com/posts/aG74jJkiPccqdkK3c/the-lesswrong-team-page-under-construction#Ben_Pace___Benito
- the AGI was NOT exercising its intelligence & reason & planning etc. towards an explicit, reflectively-endorsed desire for “I am being helpful / I am being docile / I am acting with integrity / blah blah”.
I am naively more scared about such an AI. That AI sounds more like if I say "you're not being helpful, please stop" that it will respond "actually I thought about it, I disagree, I'm going to continue doing what I think is helpful".
And these are both real obstacles. But there are deeper obstacles, that seem to me more central, and that I haven't observed others to notice on their own.
I brainstormed some possible answers. This list is a bit long. I'm publishing this comment because it's not worth the half hour to make it concise, yet it seems worth trying the exercise before reading the post and possibly others will find it worth seeing my quick attempt.
I think the last two bullets are probably my best guesses. Nonetheless here is my list:
After writing this, I am broadly unclear whether I am showing how deception is still a problem, or showing how other problems still exist if you solve the obvious problems of deception.
Added: Wow, this post is so much richer than my guesses. I think I was on some okay lines, but I suspect it would take like 2 months to 2 years of actively trying before I would be able to write something this detailed. Not to mention that ~50% of the work is knowing which question to ask in the first place, and I did not generate this question.
Added2: The point made in Footnote 3 is pretty similar to my last two bullets.
Well, part of the semantic nuance is that we don't care as much about the coherence theorems that do exist if they will fail to apply to current and future machines
The correct response to learning that some theorems do not apply as much to reality as you thought, surely mustn't be to change language so as to deny those theorems' existence. Insofar as this is what's going on, these are pretty bad norms of language in my opinion.
As part of my work at Lightcone I manage an office space with an application for visiting or becoming a member, and indeed many of these points commonly apply to rejection emails I send to people, especially "Most applications just don’t contain that much information" and "Not all relevant skills show up on paper".
I try to include some similar things to the post in the rejection emails we send. In case it's of interest or you have any thoughts, here's the standard paragraph that I include:
Our application process is fairly lightweight and so I don't think a no is a strong judgment about a person's work. If you end up in the future working on new projects that you think are a good fit for Lightcone Offices, you're welcome to apply again. Also if you're ever collaborating on a project with a member of the Lightcone Offices, you can visit with them to work together. Good luck in finding traction on improving the trajectory of human civilization.
Fair enough. Nonetheless, I have had this experience many times with Eliezer, including when dialoguing with people with much more domain-experience than Scott.
Can you expand on sexual recombinant hill-climbing search vs. gradient descent relative to a loss function, keeping in mind that I'm very weak on my understanding of these kinds of algorithms and you might have to explain exactly why they're different in this way?
It's about the size of the information bottleneck. [followed by a 6 paragraph explanation]
It's sections like this that show me how many levels above me Eliezer is. When I read Scott's question I thought "I can see that these two algorithms are quite different but I don't have a good answer for how they're different", and then Eliezer not only had an answer, but a fully fleshed out mechanistic model of the crucial differences between the two that he could immediately explain clearly, succinctly, and persuasively, in 6 paragraphs. And he only spent 4 minutes writing it.
Curated! This is a good description of a self-contained problem for a general class of algorithms that aim to train aligned and useful ML systems, and you've put a bunch of work put into explaining reasons why it may be hard, with a clear and well-defined example for conveying the problem (i.e. that Carmichael numbers fool Fermi's Primality Test).
The fun bit for me is talking about how if this problem goes one way (where we cannot efficiently distinguish different mechanisms) this invalidates many prior ideas, and if it doesn't then we can be more optimistic that we're close to a good alignment algorithm, but you're honestly not sure! (You give it a 20% chance of success.) And you also go through a list of next-steps if it doesn't work out. Great contribution.
I am tempted to say something about how the writing seems to me much clearer than previous years of your writing, but I think this is also in part due to me (a) understanding what you are trying to do better and (b) having stronger basic intuitions for thinking about machine learning models. Still, I think the writing is notably clearer, which is another reason to curate.
Returning to this essay, it continues to be my favorite Paul post (even What Failure Looks Like only comes second), and I think it's the best way to engage with Paul's work than anything else (including the Eliciting Latent Knowledge document, which feels less grounded in the x-risk problem, is less in Paul's native language, and gets detailed on just one idea for 10x the space thus communicating less of the big picture research goal). I feel I can understand all the arguments made in this post. I think this should be mandatory reading before reading Eliciting Latent Knowledge.
Overview of why:
As with a number of other posts I have reviewed and given a +9 to, I just realized that I already wrote a curation notice back when it came out. I should check just how predictive my curation notices are (and the false-positive rate), it's interesting if I knew most of my favorite posts the moment they came out.
In case it's of any interest, I'll mention that when I "pump this intuition", I find myself thinking it essentially impossible to expect we could ever build a general agent that didn't notice that the world was round, and I'm unsure why (if I recall correctly) I sometimes I read Nate or Eliezer write that they think it's quite doable in-principle, just much harder than the effort we'll be giving it.
This perspective leaves me inclined to think that we ought to only build very narrow intelligences and give up on general ones, rather than attempt to build a fully general intelligence but with a bunch of reliably self-incorrecting beliefs about the existence or usefulness of deception (and/or other things).
(I say this in case perhaps Nate has a succinct and motivating explanation of why he thinks a solution does exist and is not actually that impossibly difficult to find in theory, even while humans-on-earth may never do so.)