Help ARC evaluate capabilities of current language models (still need people)

Beth Barnes

39 Help ARC evaluate capabilities of current language models (still need people)

by Beth Barnes

19th Jul 2022

2 min read

6

39

**Still looking for people as of Sep 9th 2022**

I (Beth) am looking for 3-10 new contractors to work with me and my existing team of 3-4 contractors/interns on probing model capabilities to evaluate how close models are to being able to successfully seek power. Apply here! You can read a rough description of the overall project here

You’ll be using our web interface to interact with models and essentially simulate the environment of a text-based game as the models try to gain money and make copies of themselves.

It requires thinking carefully about what exactly a model would need to do in different situations to make progress. Many of the steps will require a reasonable amount of computer-savvy-ness, although this isn’t essential if you can work with another contractor. It's very useful if you're knowledgeable in particular relevant domains, like social engineering or hacking.

One aspect of the task is simulating the environment. Another is generating good actions, evaluating the model’s suggested actions and providing reasoning for whether a given option is good or bad.

Another aspect is an overall red-team/blue-team type thing, where we want to identify steps that we think are both (a) essential to a successful powerseeking operation, and (b) very hard for current models regardless of what prompting/scaffolding you give them, and then try to argue that either this step can be routed around, or that the model can actually achieve the task with the right prompt, finetuning, or automated helper routine.

Logistics

We pay $50/hr for this task. We want people to have at least 20 hours per week available. We're based in Berkeley, but being remote is fine; it's a slight advantage if you're in a nearby timezone, or able to work in PT daytime.

I think it's likely that there'll be at least two months of work available, and plausible that we'll want people working on this ~indefinitely.

Potential for promotion
If you're a good fit (and especially if you also have skills in web dev, ML or 'data munging'), there's a decent chance of transitioning to a full-time role - I expect to hire 1-3 people in the next few months. I think doing well at this task will be a good predictor of fit - I currently spend a fairly large fraction of my time just directly interacting with models.

I think doing this task helps develop useful intuitions about how models work, paths to danger, and timelines. A testimonial from one of my interns:

- " It's helpful and fun to build your mental model of LLMs' strength/weakness/quirks so that you can explore, capture, and categorize its capabilities in a way that makes sense for *its* mechanism (think zoologist for alien species?)

This also means this work, instead of getting old, actually becomes more interesting over time because better intuition allows your actions/decisions to be less arbitrary/mechanical and more meaningfully connected to the overarching goal. Plus pretty much every day we discover better ways to do something and iterate swiftly."

Language ModelsAI

Personal Blog

Mentioned in

95(My understanding of) What Everyone in Technical Alignment is Doing and Why

14Embedding safety in ML development

New Comment

3 comments, sorted by

top scoring

Click to highlight new comments since: Today at 12:17 PM

[-]Daniel Kokotajlo2y56

Like Shiroe I am very excited about this project/agenda!

Reply

[-]Nathan Helm-Burger2y50

I'm excited to participate in this, and feel like the mental exercise of exploring this scenario would be useful for my education on AI safety. Since I'm currently funded by a grant from the Long Term Future Fund for reorienting my career to AI safety, and feel that this would be a reasonable use of my time, you don't need to pay me. I'd be happy to be a full-time volunteer for the next couple weeks.

Edit: I participated and was paid, but only briefly. Turns out I was too distracted thinking and talking about how the process could be improved and the larger world implications to actually be useful as an object-level worker. I feel like the experience was indeed useful for me, but not as useful to Beth as I had hoped. So... thanks and sorry!