Competition: Amplify Rohin’s Prediction on AGI researchers & Safety Concerns

by Andreas Stuhlmüller4 min read21st Jul 20209 comments

33

Forecasting & PredictionAIRationality
Frontpage

EDIT: The competition is now closed, thanks to everyone who participated! Rohin’s posterior distribution is here, and winners are in this comment.

In this competition, we (Ought) want to amplify Rohin Shah’s forecast for the question: When will a majority of AGI researchers agree with safety concerns? Rohin has provided a prior distribution based on what he currently believes, and we want others to:

  1. Try to update Rohin’s thinking via comments (for example, comments including reasoning, distributions, and information sources). If you don’t want your comment to be considered for the competition, label it ‘aside’
  2. Predict what his posterior distribution for the question will be after he has read all the comments and reasoning in this thread

The competition will close on Friday July 31st. To participate in this competition, create your prediction on Elicit, click ‘Save Snapshot to URL,’ and post the snapshot link in a comment on this post. You can provide your reasoning in the ‘Notes’ section of Elicit or in your LessWrong comment. You should have a low bar for making predictions – they don’t have to be perfect.

Here is Rohin’s prior distribution on the question. His reasoning for the prior is in this comment. Rohin spent ~30 minutes creating this distribution based on the beliefs and evidence he already has. He will spend 2-5 hours generating a posterior distribution.

Click here to create your distribution

We will award two $200 prizes, in the form of Amazon gift cards:

  1. Most accurate prediction: We will award $200 to the most accurate prediction of Rohin’s posterior distribution submitted through an Elicit snapshot. This will be determined by estimating KL divergence between Rohin’s final distribution and others’ distributions. If you post more than one snapshot, either your most recent snapshot or the one you identify as your final submission will be evaluated.
  2. Update to thinking: Rohin will rank each comment from 0 to 5 depending on how much the reasoning updated his thinking. We will randomly select one comment in proportion to how many points are assigned (so, a comment rated 5 would be 5 times more likely to receive the prize than a comment rated 1), and the poster of this comment will receive the $200 prize.

Motivation

This project is similar in spirit to amplifying epistemic spot checks and other work on scaling up individual judgment through crowdsourcing. As in these projects, we’re hoping to learn about mechanisms for delegating reasoning, this time in the forecasting domain.

The objective is to learn whether mechanisms like this could save people like Rohin work. Rohin wants to know: What would I think if I had more evidence and knew more arguments than I currently do, but still followed the sorts of reasoning principles that I'm unlikely to revise in the course of a comment thread? In real-life applications of amplified forecasting, Rohin would only evaluate the arguments in-depth and form his own posterior distribution 1 out of 10 times. 9 out of 10 times he’d just skim the key arguments and adopt the predicted posterior as his new view.

Question specification

The question is: When will a majority of AGI researchers agree with safety concerns?

Suppose that every year I (Rohin) talk to every top AI researcher about safety (I'm not explaining safety, I'm simply getting their beliefs, perhaps guiding the conversation to the safety concerns in the alignment community). After talking to X, I evaluate:

  1. (Yes / No) Is X's work related to AGI? (AGI safety counts)
  2. (Yes / No) Does X broadly understand the main concerns of the safety community?
  3. (Yes / No) Does X agree that there is at least one concern such that we have not yet solved it and we should not build superintelligent AGI until we do solve it?

I then compute the fraction #answers(Yes, Yes, Yes) / #answers(Yes, *, *) (i.e. the proportion of AGI-related top researchers who are aware of safety concerns and think we shouldn't build superintelligent AGI before solving them). In how many years will this fraction be >= 0.5?

For reference, if I were to run this evaluation now, I would be looking for an understanding of reward gaming, instrumental convergence, and the challenges of value learning, but would not be looking for an understanding of wireheading (because I'm not convinced it's a problem we need to worry about) or inner alignment (because the safety community hasn't converged on the importance of inner alignment).

We'll define the set of top AI researchers somewhat arbitrarily as the top 1000 AI researchers in industry by salary and the top 1000 AI researchers in academia by citation count.

If the fraction never reaches > 0.5 (e.g. before the fraction reaches 0.5, we build superintelligent AGI and it kills us all, or it is perfectly benevolent and everyone realizes there weren't any safety concerns), the question resolves as >2100.

Interpret this reasonably (e.g. a comment to the effect of "your survey will annoy everyone and so they'll be against safety" will be ignored even if true, because it's overfitting to the specific counterfactual survey proposed here and is clearly irrelevant to the spirit of the question).

Additional information

Rohin Shah is an AI Safety researcher at the Center for Human-Compatible AI (CHAI). He also publishes the Alignment Newsletter. Here is a link to his website where you can find more information about his research and views.

You are welcome to share a snapshot distribution of your own beliefs, but make sure to specify that the snapshot contains your own beliefs and not your prediction of Rohin’s beliefs (snapshots of your own beliefs will not be evaluated for the competition).

33

9 comments, sorted by Highlighting new comments since Today at 7:07 PM
New Comment

Some information:

I don't intend to engage

I won't be responding to this thread until after the competition ends. If you're not sure about something in the question, assume what (you think) I would assume. Feel free to argue for a specific interpretation in a comment if you think it's underspecified.

Why I chose this question

In any realistic plan that ends with "and then we deployed the AI knowing we had mitigated risk X", it seems to me like we need prestigious AI researchers to have some amount of consensus that X was actually a real risk. If you want a company to use a safer technique, the key decision-makers at the company will need to believe that it's necessary in order to "pay the alignment tax", and their opinions will be shaped by the higher-up AI researchers at the company. If you want a government to regulate AI, you need to convince the government that X is a real risk, and the government will probably defer to prestigious AI experts.

So it seems like an important variable is whether the AI community agrees that X is a real risk. (Also obviously whether in reality X is a real risk, but I'm assuming that it is for now.) It may be fine if it's only the AGI researchers -- if a company knows it can build AGI, it probably ignores the opinions of people who think we can't build AGI.

Hence, this question. While an answer to the explicit question would be interesting to me, I'm more excited to have better models of what influences the answer to the question.

Prior reasoning

Copied over from the Google Doc linked above. Written during the 30 minutes I had to come up with a prior.

Seems very unlikely that if I ran this survey now the fraction would be 0.5. Let's give that 0.1%, which I can effectively ignore.

My timelines for "roughly human-level reasoning" have a median of let's say 20 years; it seems likely we get significant warning shots a few years before human-level reasoning, that lead to increased interest in safety. It's not crazy that we get a warning shot in the next year from e.g. GPT-3, though I'd be pretty surprised, call it ~1%.

My model for how we get to consensus is that there's a warning shot, or some excellent paper, or a series of these sorts of things, that increases the prominence of safety concerns, especially among "influencers", then this percolates to the general AI community over the next few years. I do think it can percolate quite quickly, see e.g. the fairly large effects of Concrete Problems in AI safety or how quickly fairness + bias became mainstream. (That's fewer examples than I'd like...) So let's expect 1 - 10 years after the first warning shot. Given median timelines of 20 years + warning shot shortly before then + 1-10 years to reach consensus, probably median for this question should also be ~20 years? (Note even if we build human-level reasoning before a majority is reached, the question could still resolve positively after that, since human-level reasoning != AI researchers are out of a job)

There's also a decent chance that we don't get a significant warning shot before superintelligent AI (maybe 10% e.g. via fast takeoff or good safety techniques that don't scale to AGI), or that not enough people update on it / consensus building takes forever / the population I chose just doesn't pay attention to safety for some reason (maybe another 15%), so let's say 25% that it never happens. Also another ~10% for AGI / warning shots not happening before 2100, or the safety community disappearing before 2100, etc. So that's 35% on >2100 (which includes never).

So 25% for never leaves 75% for "at some point" of which I argued the median is 20 years, so I should have ~35% from now till 20 years, and 30% from 20 years till 2100 (since it's 10% on >2100 but not never).

Then I played around with Elicit until I got a distribution that fit these constraints and looked vaguely lognormal.

Rohin has created his posterior distribution! Key differences from his prior are at the bounds:

  • He now assigns 3% rather than 0.1% to the majority of AGI researchers already agreeing with safety concerns.
  • He now assigns 40% rather than 35% to the majority of AGI researchers agreeing with safety concerns after 2100 or never.

Overall, Rohin’s posterior is a bit more optimistic than his prior and more uncertain.

Ethan Perez’s snapshot wins the prize for the most accurate prediction of Rohin's posterior. Ethan kept a similar distribution shape while decreasing the probability >2100 less than the other submissions.

The prize for a comment that updated Rohin’s thinking goes to Jacob Pfau! This was determined by a draw with comments weighted proportionally to how much they updated Rohin’s thinking.

Thanks to everyone who participated and congratulations to the winners! Feel free to continue making comments and distributions, and sharing any feedback you have on this competition.

I think 1% in the next year and a half is significantly too low.

Firstly, conditioning on AGI researchers makes a pretty big difference. It rules out most mainstream AI researchers, including many of the most prominent ones who get the most media coverage. So I suspect your gut feeling about what people would say isn't taking this sufficiently into account.

Secondly, I think attributing ignorance to the outgroup is a pretty common fallacy, so you should be careful of that. I think a clear majority of AGI researchers are probably familiar with the concept of reward gaming by now, and could talk coherently about AGIs reward gaming, or manipulating humans. Maybe they couldn't give very concrete disaster scenarios, but neither can many of us.

And thirdly, once you get agreement that there are problems, you basically get "we should fix the problems first" for free. I model most AGI researchers as thinking that AGI is far enough away that we can figure out practical ways to prevent these things, like better protocols for giving feedback. So they'll agree that we should do that first, because they think that it'll happen automatically anyway.

These seem sensible comments to me, I had similar thoughts about current understanding of things like reward gaming. I’d be curious to see your snapshot?

I think a clear majority of AGI researchers are probably familiar with the concept of reward gaming by now, and could talk coherently about AGIs reward gaming, or manipulating humans.

Seems plausible, but I specifically asked for reward gaming, instrumental convergence, and the challenges of value learning. (I'm fine with not having concrete disaster scenarios.) Do you still think this is that plausible?

And thirdly, once you get agreement that there are problems, you basically get "we should fix the problems first" for free.

I agree that Q2 is more of a blocker than Q3, though I am less optimistic than you seem to be.

Overall I updated towards slightly sooner based on your comment and Beth's comment below (given that both of you interact with more AGI researchers than I do), but not that much, because I'm not sure whether you were looking at just reward gaming or all three conditions I laid out, and most of the other considerations were ones I had thought about and it's not obvious how to update on an argument of the form "I think <already-considered consideration>, therefore you should update in this direction". It would have been easier to update on "I think <already-considered consideration>, therefore the absolute probability in the next N years is X%".

Yeah I also thought this might just be true already, for similar reasons

My snapshot. I put 2% more mass on the next 2 years and 7% more mass on 2023-2032. My reasoning:

1. 50% is a low bar.

2. They just need to understand and endorse AI Safety concerns. They don't need to act on them.

3. There will be lots of public discussion about AI Safety in the next 12 years.

4. Younger researchers seem more likely to have AI Safety concerns. AI is a young field. (OTOH, it's possible that lots of the top cited/paid researchers in 10 years time are people active today).

All of these seem like good reasons to be optimistic, though it was a bit hard for me to update on it given that these were already part of my model. (EDIT: Actually, not the younger researchers part. That was a new-to-me consideration.)

Seems like it could be helpful if people who've thought about this would also predict on the question of what the survey value would be today. (e.g. via elicit snapshots)