Ben Pace

I'm an admin of this site; I work full-time on trying to help people on LessWrong refine the art of human rationality.

Longer bio:


AI Alignment Writing Day 2019
AI Alignment Writing Day 2018

Wiki Contributions

Load More


And if you block any one path to the insight that the earth is round, in a way that somehow fails to cripple it, then it will find another path later, because truths are interwoven. Tell one lie, and the truth is ever-after your enemy.

In case it's of any interest, I'll mention that when I "pump this intuition", I find myself thinking it essentially impossible to expect we could ever build a general agent that didn't notice that the world was round, and I'm unsure why (if I recall correctly) I sometimes I read Nate or Eliezer write that they think it's quite doable in-principle, just much harder than the effort we'll be giving it. 

This perspective leaves me inclined to think that we ought to only build very narrow intelligences and give up on general ones, rather than attempt to build a fully general intelligence but with a bunch of reliably self-incorrecting beliefs about the existence or usefulness of deception (and/or other things).

(I say this in case perhaps Nate has a succinct and motivating explanation of why he thinks a solution does exist and is not actually that impossibly difficult to find in theory, even while humans-on-earth may never do so.)

  • the AGI was NOT exercising its intelligence & reason & planning etc. towards an explicit, reflectively-endorsed desire for “I am being helpful / I am being docile / I am acting with integrity / blah blah”.

I am naively more scared about such an AI. That AI sounds more like if I say "you're not being helpful, please stop" that it will respond "actually I thought about it, I disagree, I'm going to continue doing what I think is helpful".

And these are both real obstacles. But there are deeper obstacles, that seem to me more central, and that I haven't observed others to notice on their own.

I brainstormed some possible answers. This list is a bit long. I'm publishing this comment because it's not worth the half hour to make it concise, yet it seems worth trying the exercise before reading the post and possibly others will find it worth seeing my quick attempt.

I think the last two bullets are probably my best guesses. Nonetheless here is my list:

  • Just because an AI isn't consciously deceptive, doesn't mean it won't deceive you, and doesn't mean it won't be adversarial against you. There are many types of goodhart, and many types of adversarial behavior.
  • It might have a heuristic to gather resources for itself, and it's not even illegible, it's not adversarial, and it's not deceptive, and then someday that impulse kills you.
  • There is the boring problem of "the AI just stops working ", because it's turning down its human-modeling component generally, or because it has to do human modeling and so training it not to do deception is super duper expensive because you have to repeatedly train against loads and loads and loads of specific edge cases where thinking about humans turns into deception.
  • The AI stops thinking deceptive thoughts about humans, but still does catastrophic things. For example, an AI thinking about nanotech, may still build nanobots that kill everyone, and you just weren't smart enough to train it not to / ask the right questions.
  • The AI does things you just don't understand. For example it manipulates the market in strange ways but at the end your profits go up, so you let it go, even though it's not doing anything deceptive. Just because it's not understandably adversarial doesn't mean it isn't doing adversarial action. "What are you doing?" "I'm gathering resources for the company's profits to go up." "Are you lying to me right now?" "No, I'm making the company's profits go up." "How does this work?" "I can't explain, too complicated." "...well, my profits are going up, so alright then."
  • Like humans for whom deception is punished, it may simply self-deceive, in ways that aren't conscious.
  • I think there's a broad class of "just because code isn't consciously deceptive, doesn't mean it isn't adversarial, and doesn't mean you won't be deceived by it".
  • A little bit of code that reads <after power level == 1 million execute this other bit of code> doesn't involve any human modeling at all, and it still could kill you. 
  • For instance, if the AI thinks "If I ever get enough power to do more training runs on myself, do so, this will probably help me raise profits for the human" then you're just dead because it will not pro-actively train deception out of itself. Like, it has to notice in the first place that it might become deceptive in that situation, which is a super out-of-distribution thing to think. It has to build a whole model of reflection and cognition and being adversarial to notice that this isn't something humans want.
  • I think there's a general problem where once we have superintelligences taking agentic action, to make sure they don't screw up (e.g. training themselves, training new agents, etc) they actually have to build a whole model of the alignment problem themselves, and maybe even solve it, in order to themselves continue to count as 'aligned', which is way way more complex than just training out legibly deceptive thoughts. Making sure an AI does not later become deceptive via some reasonable agentic action requires it to model the alignment problem in some detail.

After writing this, I am broadly unclear whether I am showing how deception is still a problem, or showing how other problems still exist if you solve the obvious problems of deception.

Added: Wow, this post is so much richer than my guesses. I think I was on some okay lines, but I suspect it would take like 2 months to 2 years of actively trying before I would be able to write something this detailed. Not to mention that ~50% of the work is knowing which question to ask in the first place, and I did not generate this question.

Added2: The point made in Footnote 3 is pretty similar to my last two bullets.

Well, part of the semantic nuance is that we don't care as much about the coherence theorems that do exist if they will fail to apply to current and future machines

The correct response to learning that some theorems do not apply as much to reality as you thought, surely mustn't be to change language so as to deny those theorems' existence. Insofar as this is what's going on, these are pretty bad norms of language in my opinion.

As part of my work at Lightcone I manage an office space with an application for visiting or becoming a member, and indeed many of these points commonly apply to rejection emails I send to people, especially "Most applications just don’t contain that much information" and "Not all relevant skills show up on paper".

I try to include some similar things to the post in the rejection emails we send. In case it's of interest or you have any thoughts, here's the standard paragraph that I include:

Our application process is fairly lightweight and so I don't think a no is a strong judgment about a person's work. If you end up in the future working on new projects that you think are a good fit for Lightcone Offices, you're welcome to apply again. Also if you're ever collaborating on a project with a member of the Lightcone Offices, you can visit with them to work together. Good luck in finding traction on improving the trajectory of human civilization.

Fair enough. Nonetheless, I have had this experience many times with Eliezer, including when dialoguing with people with much more domain-experience than Scott.


Can you expand on sexual recombinant hill-climbing search vs. gradient descent relative to a loss function, keeping in mind that I'm very weak on my understanding of these kinds of algorithms and you might have to explain exactly why they're different in this way?


It's about the size of the information bottleneck. [followed by a 6 paragraph explanation]

It's sections like this that show me how many levels above me Eliezer is. When I read Scott's question I thought "I can see that these two algorithms are quite different but I don't have a good answer for how they're different", and then Eliezer not only had an answer, but a fully fleshed out mechanistic model of the crucial differences between the two that he could immediately explain clearly, succinctly, and persuasively, in 6 paragraphs. And he only spent 4 minutes writing it.

Curated! This is a good description of a self-contained problem for a general class of algorithms that aim to train aligned and useful ML systems, and you've put a bunch of work put into explaining reasons why it may be hard, with a clear and well-defined example for conveying the problem (i.e. that Carmichael numbers fool Fermi's Primality Test).

The fun bit for me is talking about how if this problem goes one way (where we cannot efficiently distinguish different mechanisms) this invalidates many prior ideas, and if it doesn't then we can be more optimistic that we're close to a good alignment algorithm, but you're honestly not sure! (You give it a 20% chance of success.) And you also go through a list of next-steps if it doesn't work out. Great contribution.

I am tempted to say something about how the writing seems to me much clearer than previous years of your writing, but I think this is also in part due to me (a) understanding what you are trying to do better and (b) having stronger basic intuitions for thinking about machine learning models. Still, I think the writing is notably clearer, which is another reason to curate.

Returning to this essay, it continues to be my favorite Paul post (even What Failure Looks Like only comes second), and I think it's the best way to engage with Paul's work than anything else (including the Eliciting Latent Knowledge document, which feels less grounded in the x-risk problem, is less in Paul's native language, and gets detailed on just one idea for 10x the space thus communicating less of the big picture research goal). I feel I can understand all the arguments made in this post. I think this should be mandatory reading before reading Eliciting Latent Knowledge.

Overview of why:

  • The motivation behind most of proposals Paul has spent a lot of time (iterated amplification, imitative generalization) on are explained clearly and succinctly.
  • For a quick summary, this involves 
    • A proposal for useful ML-systems designed with human feedback
    • An argument that the human-feedback ML-systems will have flaws that kill you
    • A proposal for using ML assistants to debug the original ML system
    • An argument that the ML systems will not be able to understand the original human-feedback ML-systems
    • A proposal for training the human-feedback ML-systems in a way that requires understandability
    • An argument that this proposal is uncompetitive
    • ??? (I believe the proposals in the ELK document are the next step here)
  • A key problem when evaluating very high-IQ, impressive, technical work, is that it is unclear which parts of the work you do not understand because you do not understand an abstract technical concept, and which parts are simply judgment calls based on the originator of the idea. This post shows very clearly which is which — many of the examples and discussions are technical, but the standard for "plausible failure story" and "sufficiently precise algorithm" and "sufficiently doomed" are all judgment calls, as are the proposed solutions. I'm not even sure I get on the bus at step 1, that the right next step is to consider ML assistants, it seems plausible this is already begging-the-question of how to have aligned assistants, and maybe the whole system will recurse later but that's not something I'd personally ever gamble on. I still find the ensuing line of attack interesting.
  • The post has a Q&A addressing natural and common questions, which is super helpful.

As with a number of other posts I have reviewed and given a +9 to, I just realized that I already wrote a curation notice back when it came out. I should check just how predictive my curation notices are (and the false-positive rate), it's interesting if I knew most of my favorite posts the moment they came out.

Load More