Ben Pace

I'm an admin of this site; I work full-time on trying to help people on LessWrong refine the art of human rationality.

Longer bio:


AI Alignment Writing Day 2019
AI Alignment Writing Day 2018

Wiki Contributions

Load More


This sentence is an exception, but there aren't a lot of naturally occurring examples.

No strong claim either way, but as a datapoint I do somewhat often use the phrase "I hereby invite you to <event>" or "I hereby <request> something of you" to help move from 'describing the world' to 'issuing an invitation/command/etc'.

Thanks for the link, I'll aim to give that podcast a listen, it's relevant to a bunch of my current thinking.

I quite appreciate that you're up front about the distinction between "we shipped things we like" and "we made progress on alignment". Some of the stuff I've read (janus's posts, infohazard policy) has been pretty thoughtful (even while I disagreed with some of it) and I was definitely worried that the work would be much worse. I recall my main recommendation to you when you visited the Bay (and my offices) was "publish your ideas so I and others can evaluate them", and I'm definitely glad you've done a bunch of that. And I agree that you can do even more :)

You publishing products sounds potentially like the beginning of the end, I'm not sure. Everyone should not really be contributing to AI hype or proving how profitable ML can be. I think maybe there's ways to do it ethically if your product line isn't at all branded as ML and doesn't have the possibility of revolutionizing a whole field (e.g. kickstarting dozens of competitors over the next 2 decades and creating 10s of billions of dollars), just a small market that you can own and gets you the funding you need and no more. Also, seems plausible it will take like 3-5 years of focused effort to become clearly profitable, killing your organizational focus on alignment for that period. I am not actually sure what I would do here, if rich folks like OpenPhil aren't into funding you for whatever reason (I don't know).

I'll mention here that from what little I saw, your coordination efforts seemed a bit misguided and harmful. I am left with a sense that you wanted everyone to be friends and not feel threatened by interacting with others, to feel like "come in, the water is fine, don't worry, you won't end up with people criticizing you for maybe ending civilization or self-deceiving along the way or call you unethical". While I am pretty open to fairly respectful coordination and am a strong fan of finding positive-sum trade, I care more about being frank and honest in my interactions, and a route must be found where communicating such things (insofar as that's what someone believes) isn't going to destroy or end the coordination/trade agreement. If you suspect you are the only people in a position to speak the truth about what is going on, you must not trade that away or else nobody will be able to say what is going on, and any remaining hope of successful reacting to what is actually happening is likely lost.

In order to signal that I am not trying to just be critical, I will end on a true positive note, which is that overall, from my current vantage point, your team has been far more competent and focused on alignment than I expected. It's like my... 88th percentile outcome, in that I assigned 87% to you being less competent on the alignment problem. I am positively surprised :)

Curated. All of these examples together really point quite clearly at a change in how language models behave when they're trained on RLHF, away from the "accurately predict text" story toward something else that has a very different set of biases — I am interested to read your potential follow-up with your own hypotheses. Plsu, the post is really fun to read.

I think this post and your prior post both overstate the case in some ways, but they're still great additions to my and I expect many others' thinking on this subject. I broadly feel like I've been 'needing' posts like these on LW about current ML projects, giving a grounded conceptual account of how to think about what's happening, and nobody else has been writing them, so I'm very grateful.

Does GPT-3 have no idea what letters look like?

I think there's an implication in this section that davinci will accurately describe what letters look like, or at least much more commonly/accurately than the false answers from text-davinci-002. Anybody know if that's true?

Added: I just tried, but couldn't get it to try to answer the question, it would just give more questions (completing it as though my prompt was but one item on a questionnaire).

A brain-like AGI - modeled after our one working example of efficient general intelligence - would naturally have an interpretable inner monologue we could monitor.

This doesn’t have much to do with whether a mind is understandable. Most of my cognition is not found in the verbal transcript of my inner monologue, partly as I’m not that verbal a thinker, but mostly because most of my cognition is in my nonverbal System 1.

I was trying to figure out whether someone who is just here for the month of November should apply. I think the answer is no, but I am broadly a bit confused when this is a commitment for.

Also, are people going through as cohorts or will they start with the training week whenever they show up, not necessarily in-sync with anyone else? 

Also, is the idea to be doing self-directed research by default, or research in collaboration with Redwood staff by default? I don't know what my default action is day-to-day during this program. Do I have to come in with a bunch of research plans already?

Now's your chance to figure out what the next few obstacles are without my giving you spoilers first. Feel free to post your list under spoiler tags in the comment section.

Hm, no strong hunches here. Bad ideas babble:

  • It may somehow learn about the world I'm in, learn I'm in a bad negotiation position (e.g. because my rival AI company is about to release their paperclip maximizer), and precommit to only giving me at most 0.00001% of the universe, a bad deal that I will grudgingly accept.
  • I mean, I don't know if this counts, but perhaps you've only understood it well enough to legibly understand that it will trade with you given certain constraints, but if its ontology shifts, or other universes become accessible via acausal trade, or even if the trade it gives you is N galaxies and then later on much more of the universe becomes available... what I'm saying is that there's many ways to mess up this trade in the details.
  • It may have designed itself to avoid thinking about something that it can use to its advantage later, such as other copies of itself or other agents, such that it will build paperclip maximizers later, and then they will kill it and just optimize the universe for paperclips. (This is similar to the previous bullet point.)
  • I guess my other thought is forms of 'hackability' that aren't the central case of being hacked, but the fact is that I'm a human which is more like a "mess" than it is like a "clean agent" and so sometimes I will make trades that at other times I would not make, and it will make a trade that at the time I like but does not represent my CEV at all. Like, I have to figure out what I actually want to trade with it. Probably this is easy but quite possibly I would mess this up extremely badly (e.g. if I picked hedonium).

My money is on roughly the first idea is what Nate will talk about next, that it is just a better negotiator than me even with no communication, because I'm in a bad position otherwise. 

  • Like, if I have no time-pressure, then I get to just wait until I've done more friendly AI research, and I needn't let this paperclip maximizer out of the box. But if I do have time pressure, then that's a worse negotiation position on my end, and all paperclippers I invent can each notice this and all agree with each other to only offer a certain minimum amount of value.
  • I do note that in a competitive market, many buyers rises the price, and if I'm repeatedly able to re-roll on who I've got in the box (roll one is a paperclipper, roll two is a diamond maximizer, roll three is a smiley face maximizer, etc) they have some reason to outbid each other in how much of the universe I get, and potentially I can get the upper hand. But if they're superintelligences, likely there's some schelling fence they can calculate mathematically that they all hit on.

K, I will stop rambling now.

Individuals who shape the world, are often those who have ended up being optimizers. 

It sounds like you find that claim disturbing, but I don't think it's all bad.

I'm interested in more of a sense of what mistake you think people are making, because I think caring about something strong enough to change who you are around it can be a very positive force in the world.

Load More