Ben Pace

I'm an admin of this site; I work full-time on trying to help people on LessWrong refine the art of human rationality. (Longer bio.)


AI Alignment Writing Day 2019
AI Alignment Writing Day 2018

Wiki Contributions

Load More


I don't want to double the comment count I submit to Recent Discussion, so I'll just update this comment with the things I've cut.

12/06/2023 Comment on Originality vs. Correctness

It's fun to take the wins of one culture and apply them to the other, people are very shocked that you found some hidden value to be had (though it often isn't competitive value / legible to the culture). And if you manage to avoid some terrible decison people speak about how wise you are to have noticed.

(Those are the best cases, often of course people are like "this is odd, I'm going to pretend I didn't see this" and then move on.)

For too long, I have erred on the side of writing too much. 

The first reason I write is in order to find out what I think.

This often leaves my writing long and not very defensible.

However, editing the whole thing is so much extra work after I already did all the work figuring out what I think.

Sometimes it goes well if I just scrap the whole thing and concisely write my conclusion.

But typically I don't want to spend the marginal time.

Another reason my writing is too long is because I have extra thoughts I know most people won't find useful. 

But I've picked up a heuristic that says it's good to share actual thinking because sometimes some people find it surprisingly useful, so I hit publish anyway.

Nonetheless, I endeavor to write shorter.

So I think I shall experiment with cutting the bits off of comments that represent me thinking aloud, but aren't worth the space in the local conversation.

And I will put them here, as the dregs of my cognition. I shall hopefully gather data over the next month or two and find out whether they are in fact worthwhile.

I just gave this a re-read, I forgot what a trip it is to read the thoughts of Eliezer Yudkowsky. It continues to be some of my favorite stuff in recent years written on LessWrong.

It's hard to relate to the world with a level of mastery over basic ideas as Eliezer has. I don't mean with this to vouch that his perspective is certainly correct, but I believe it is at least possible, and so I think he aspires to a knowledge of reality that I rarely if ever aspire to. Reading it inspires me to really think about how the world works, and really figure out what I know and what I don't. +9

(And the smart people dialoguing with him here are good sports for keeping up their side of the argument.)

They are not being treated worse than foot soldiers, because they do not have an enemy army attempting to murder them during the job. (Unless 'foot soldiers' itself more commonly used as a metaphor for 'grunt work' and I'm not aware of that.)

I am surprised to see the Open Philanthropy network taking all of the powerful roles here.

The initial Trustees are:

In case it's not apparent:

  • Jason Matheny started an org with over $40M of OpenPhil funding (link to the biggest grant)
  • Kanika Bahl runs Evidence Action, an org that OpenPhil has donated over $100M to over the past 7 years
  • Neil Buddy Shah was the Managing Director at GiveWell for 2.5 years. GiveWell is the organization that OpenPhil grew out of between 2011 and 2014 and continued to maintain close ties with over the decade (e.g. sharing an office for ~5 more years, deferring substantially to GiveWell recommendations for somewhere in the neighborhood of $500M of Global Health grants, etc).
  • Paul Christiano is technically the least institutionally connected to OpenPhil. Most commenters here are quite aware of Paul's relationship to Open Philanthropy, but to state some obvious legible things for those who are not: Paul is married to Ajeya Cotra, who leads the AI funding at OpenPhil, has been a technical advisor to OpenPhil in some capacity since 2012 (see him mentioned in the following pages: 1, 2, 3, 4, 5, 6), and Holden Karnofsky has collaborated with  / advised Paul's organization ARC (collaboration mentioned here and shown as an advisor here).
  • Zach Robinson worked at OpenPhil for 3 years, over half of which was spent as Chief of Staff.

And, as a reminder for those who've forgotten, Holden Karnofsky's wife Daniela Amodei is President of Anthropic (and so Holden is the brother-in-law of Dario Amodei).

I think the argument here basically implies that language models will not produce any novel, useful concepts in any existing industries or research fields that get substantial adoption (e.g. >10% of ppl use it, or a widely cited paper) in those industries, in the next 3 years, and if it did this, then the end would be nigh (or much nigher).

To be clear, you might get new concepts from language models about language if you nail some Chris Olah style transparency work, but the language model itself will not output ones that aren't about language in the text.

It's... possible this is actually the single best example of a public doublecrux writeup that I know of?

This sentence was confusing to me given that the post does not mention 'double crux', but I mentioned it to someone and they said to think of it as the mental motion and not the explicit format, and that makes more sense to me.

And if you block any one path to the insight that the earth is round, in a way that somehow fails to cripple it, then it will find another path later, because truths are interwoven. Tell one lie, and the truth is ever-after your enemy.

In case it's of any interest, I'll mention that when I "pump this intuition", I find myself thinking it essentially impossible to expect we could ever build a general agent that didn't notice that the world was round, and I'm unsure why (if I recall correctly) I sometimes I read Nate or Eliezer write that they think it's quite doable in-principle, just much harder than the effort we'll be giving it. 

This perspective leaves me inclined to think that we ought to only build very narrow intelligences and give up on general ones, rather than attempt to build a fully general intelligence but with a bunch of reliably self-incorrecting beliefs about the existence or usefulness of deception (and/or other things).

(I say this in case perhaps Nate has a succinct and motivating explanation of why he thinks a solution does exist and is not actually that impossibly difficult to find in theory, even while humans-on-earth may never do so.)

  • the AGI was NOT exercising its intelligence & reason & planning etc. towards an explicit, reflectively-endorsed desire for “I am being helpful / I am being docile / I am acting with integrity / blah blah”.

I am naively more scared about such an AI. That AI sounds more like if I say "you're not being helpful, please stop" that it will respond "actually I thought about it, I disagree, I'm going to continue doing what I think is helpful".

And these are both real obstacles. But there are deeper obstacles, that seem to me more central, and that I haven't observed others to notice on their own.

I brainstormed some possible answers. This list is a bit long. I'm publishing this comment because it's not worth the half hour to make it concise, yet it seems worth trying the exercise before reading the post and possibly others will find it worth seeing my quick attempt.

I think the last two bullets are probably my best guesses. Nonetheless here is my list:

  • Just because an AI isn't consciously deceptive, doesn't mean it won't deceive you, and doesn't mean it won't be adversarial against you. There are many types of goodhart, and many types of adversarial behavior.
  • It might have a heuristic to gather resources for itself, and it's not even illegible, it's not adversarial, and it's not deceptive, and then someday that impulse kills you.
  • There is the boring problem of "the AI just stops working ", because it's turning down its human-modeling component generally, or because it has to do human modeling and so training it not to do deception is super duper expensive because you have to repeatedly train against loads and loads and loads of specific edge cases where thinking about humans turns into deception.
  • The AI stops thinking deceptive thoughts about humans, but still does catastrophic things. For example, an AI thinking about nanotech, may still build nanobots that kill everyone, and you just weren't smart enough to train it not to / ask the right questions.
  • The AI does things you just don't understand. For example it manipulates the market in strange ways but at the end your profits go up, so you let it go, even though it's not doing anything deceptive. Just because it's not understandably adversarial doesn't mean it isn't doing adversarial action. "What are you doing?" "I'm gathering resources for the company's profits to go up." "Are you lying to me right now?" "No, I'm making the company's profits go up." "How does this work?" "I can't explain, too complicated." "...well, my profits are going up, so alright then."
  • Like humans for whom deception is punished, it may simply self-deceive, in ways that aren't conscious.
  • I think there's a broad class of "just because code isn't consciously deceptive, doesn't mean it isn't adversarial, and doesn't mean you won't be deceived by it".
  • A little bit of code that reads <after power level == 1 million execute this other bit of code> doesn't involve any human modeling at all, and it still could kill you. 
  • For instance, if the AI thinks "If I ever get enough power to do more training runs on myself, do so, this will probably help me raise profits for the human" then you're just dead because it will not pro-actively train deception out of itself. Like, it has to notice in the first place that it might become deceptive in that situation, which is a super out-of-distribution thing to think. It has to build a whole model of reflection and cognition and being adversarial to notice that this isn't something humans want.
  • I think there's a general problem where once we have superintelligences taking agentic action, to make sure they don't screw up (e.g. training themselves, training new agents, etc) they actually have to build a whole model of the alignment problem themselves, and maybe even solve it, in order to themselves continue to count as 'aligned', which is way way more complex than just training out legibly deceptive thoughts. Making sure an AI does not later become deceptive via some reasonable agentic action requires it to model the alignment problem in some detail.

After writing this, I am broadly unclear whether I am showing how deception is still a problem, or showing how other problems still exist if you solve the obvious problems of deception.

Added: Wow, this post is so much richer than my guesses. I think I was on some okay lines, but I suspect it would take like 2 months to 2 years of actively trying before I would be able to write something this detailed. Not to mention that ~50% of the work is knowing which question to ask in the first place, and I did not generate this question.

Added2: The point made in Footnote 3 is pretty similar to my last two bullets.

Load More