Ben Pace — AI Alignment Forum

I'm an admin of LessWrong. Here are a few things about me.

I generally feel more hopeful about a situation when I understand it better.
I have signed no contracts nor made any agreements whose existence I cannot mention.
I believe it is good to take responsibility for accurately and honestly informing people of what you believe in all conversations; and also good to cultivate an active recklessness for the social consequences of doing so.
It is wrong to directly cause the end of the world. Even if you are fatalistic about what is going to happen.

Randomly: If you ever want to talk to me about anything you like for an hour, I am happy to be paid $1k for an hour of doing that.

(Longer bio.)

I think the main thing I want to convey is that I think you're saying that LWers (of which I am one) have a very low opinion of the integrity of people at Anthropic, but what I'm actually saying that their integrity is no match for the forces that they are being tested with.

I don't need to be able to predict a lot of fine details about individuals' decision-making in order to be able to have good estimates of these two quantities, and comparing them is the second-most question relating to whether it's good to work on capabilities at Anthropic. (The first one is a basic ethical question about working on a potentially extinction-causing technology that is not much related to the details of which capabilities company you're working on.)

What's an example decision or two where you would want to ask yourself whether they should get more or less open-ended power? I'm not sure what you're thinking of.

Not the main thrust of the thread, but for what it's worth, I find it somewhat anti-helpful to flatten things into a single variable of "how much you trust Anthropic leadership to make decisions which are good from your perspective", and then ask how optimistic/pessimistic you are about this variable.

I think I am much more optimistic about Anthropic leadership on many axis relative to an overall survey of the US population or Western population – I expect them to be more libertarian, more in favor of free speech, more pro economic growth, more literate, more self-aware, higher IQ, and a bunch of things.

I am more pessimistic about their ability to withstand the pressures of a trillion dollar industry to shape their incentives than the people who are at Anthropic.

I believe the people working there are siloing themselves intellectually into an institution facing incredible financial incentives for certain bottom lines like "rapid AI progress is inevitable" and "it's reasonably likely we can solve alignment" and "beating China in the race is a top priority", and aren't allowed to talk to outsiders about most details of their work, and this is a key reason that I expect them to screw up their decision-making.

I am optimistic about their relative-ability to have a sensible conversation about the next 5 years and what alignment failures look like, relative to most people on earth. This is not the standard I require to expect people to not do ML training runs that lead to human extinction, but nonetheless I predict they will do relatively quite well on this axis.

I don't have a single variable here, I have a much more complicated model than this. It looks to me that collapsing questions of trust about people or groups into a single varibale of how optimistic I am about them making decisions which are good from my values has been a common question-substitution in the Effective Altruism scene, where I think people have been repeatedly hoodwinked by sociopaths due to not moving toward a more detailed model that predicts exactly where and when someone will make good vs bad decisions.

I think my front-end productivity might be up 3x? A shoggoth helped me building a stripe shop and do a ton of UI design that I would’ve been hesitant to take on myself (without hiring someone else to work with), as well as quality increase in speed of churning through front-end designs.

(This is going from “wouldn’t take on the project due to low skill” to “can take it on and deliver it in a reasonable amount of time”, which is different from “takes top programmer and speeds them up 3x”.)

Something a little different: Today I turn 28. If you might be open to do something nice for me for my birthday, I would like to request the gift of data. I have made a 2-4 min anonymous survey about me as a person, and if you have a distinct sense of me as a person (even just from reading my LW posts/comments) I would greatly appreciate you filling it out and letting me know how you see me!

Here's the survey.

It's an anonymous survey where you rate me on lots of attributes like "anxious", "honorable", "wise" and more. All multiple-choice. Two years ago I also shared a birthday survey amongst people who know me and ~70 people filled it out, and I learned a lot from it. I am very excited to see how the perception of me amongst the people I know has *changed*, and also to find out how people on LessWrong see me, so the core of this survey is ~20 of the same attributes.

In return for your kind gift, if you complete it, you get to see the aggregate ratings of me from last time!

This survey helps me understand how people see me, and recognize my blindspots, and I'm very grateful to anyone who takes a few mins to complete it. Two people completed it already and it took them 2 mins and 4 mins to complete it. (There are many further optional questions but it says clearly when the main bit is complete.)

I of course intend to publish the (aggregate) data in a LW post and talk about what I've learned from it :-)

Further detail on this: Cotra has more recently updated at least 5x against her original 2020 model in the direction of faster timelines.

Greenblatt writes:

Here are my predictions for this outcome:
25th percentile: 2 year (Jan 2027)
50th percentile: 5 year (Jan 2030)

Cotra replies:

My timelines are now roughly similar on the object level (maybe a year slower for 25th and 1-2 years slower for 50th)

This means 25th percentile for 2028 and 50th percentile for 2031-2.

The original 2020 model assigns 5.23% by 2028 and 9.13% | 10.64% by 2031 | 2032 respectively. Each time a factor of ~5x.

However, the original model predicted the date by which it was affordable to train a transformative AI model. This is a leading a variable on such a model actually being built and trained, pushing back the date by some further number of years, so view the 5x as bounding, not pinpointing, the AI timelines update Cotra has made.

Third, we gave the model a secret “scratchpad”—a workspace where it could record its step-by-step reasoning. We told the model to use the scratchpad to reason about what it should do (as far as the model was aware, the scratchpad was invisible to users and to Anthropic). This allowed us to observe whether the model planned to fake alignment, and how it would go about doing so.

This feels like amusingly like tricking a child. "Remember kiddo, you can reason out loud about where you're going to hide and I won't hear it. Now let's play hide and seek!"

I don't know how to quickly convey why I find this point so helpful, but I find this to be a helpful pointer to a key problem, and the post is quite short, and I hope someone else positively votes on it. +4.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Sequences

Posts

Wikitag Contributions

Comments