Evaluations project @ ARC is hiring a researcher and a webdev/engineer

Beth Barnes

The evaluations project at the Alignment Research Center is looking to hire a generalist technical researcher and a webdev-focused engineer. We're a new team at ARC building capability evaluations (and in the future, alignment evaluations) for advanced ML models. The goals of the project are to improve our understanding of what alignment danger is going to look like, understand how far away we are from dangerous AI, and create metrics that labs can make commitments around (e.g. 'If you hit capability threshold X, don't train a larger model until you've hit alignment threshold Y'). We're also taking SERI MATS fellows.

About the project

Big picture

Once AI systems are close to being capable enough to be an x-risk, there should be procedures that labs follow to ensure they don’t build or deploy existentially dangerous models. We should have thought through the plausible paths to models gaining power, and ensured that models are not capable of these. There should be commitments that if model capabilities reach a certain level, labs will not scale up or deploy models until they have reached a certain level of alignment. In order to make this happen, we need good ways to measure both dangerous capabilities and alignment.

We can start preparing for this now by:

evaluating capabilities of current models
predicting how dangerous the next generation of models is likely to be
assessing the quality of our prediction methods by using them to predict the properties of generation n+1 given generation n, for historical generation models
working with labs to evaluate their models before they are deployed, as a warm-up to stronger commitments in the future
designing evaluations for alignment, deception etc to be run on future models

Doing this now has some additional benefits:

Evaluating capabilities of current models is helpful for predicting timelines, by giving us a more precise sense of what needs to change for models to be dangerous
Specifying what we’d actually measure to determine the alignment of some model is helpful for clarifying our thoughts about how we expect misalignment to emerge, and what exactly the alignment problem is
This work may produce demonstrations of models doing competent and scary behavior. This can be useful if we think people are underestimating the capabilities of current or near-future models and making bad decisions because of that.

Current goals

Our current goals are to work with cutting-edge models, produce evaluations for how far they are from x-risk capabilities, and share these evaluations with labs.

More specifically, we’re doing things like:

design and build scaffolding that lets generative language models act like agents over long time horizons:
- scripts to automatically run model outputs in the shell or forward them as subtasks to other copies of the model
- sampling multiple actions, having the model evaluate them and pick the best one
- various ways of prompting and nudging the model that improve performance
think about what a model would need to be able to do in order to successfully gain power, given current levels of caution and monitoring around AI systems. For instance: exfiltrate its weights via hacking or social engineering; make money via crime or labor; use this money to buy server time and start more copies of itself running on those servers
Assess model performance on these subtasks, and determine whether the model is capable of some set of tasks that's sufficient for successful powerseeking, or if not, what exactly the barriers are and how close the model is to surmounting them

At the end of this, we hope to either have found clear barriers to successful powerseeking for current models, or to have produced scary demos of the model succeeding at a sequence of actions that allow it to gain power (e.g . successfully phishing someone)

We are mindful of the risks of “helping” models to engage in planning around unintended behaviors. Our high-level take is that it’s best to get a look at the problem before it’s insurmountable - “we want to see AI systems trying to take over the world while they still can’t succeed.” However, there are times when we won’t explore or publish a line of work in order to avoid potential harm.

In the future we'll probably focus more on designing evaluations for alignment.

What we’ve done so far

We’ve built a web interface that allows us to more easily generate and compose scary sequences of actions from existing cutting edge models (e.g. OpenAI’s text-davinvi-002).
We’ve discovered some surprising capabilities of existing models in terms of achieving power-seeking goals.
We’ve found some limitations of existing models that prevent them from successfully power seeking.

People + Logistics

The project lives at ARC, and is mostly run by Elizabeth (Beth) Barnes, who reports to Paul Christiano and is also advised by Holden Karnofsky. It is mostly siloed from ARC's other projects. We have several long-term model interaction contractors who are remote: Mihnea Maftei and Brian Goodrich full time, and a few others part-time. Our summer SERI MATS fellows were Lora Xie, Simon Marshall, and Sarah Cogan, and Govind Pimpale was our summer intern on webdev.

We have offices in Constellation (shared EA/longtermist offices in downtown Berkeley)

We should be able to sponsor visas for those who need them (and as ARC is a non-profit, we don't have to go through the visa lottery for H1-B).

Job Descriptions

=====================================================================

Generalist Software Engineer, Web Dev and ML focus

DEC 2022 EDIT: SEE NEW POSTING HERE

Background

We've built an interface for interacting with models and assessing their abilities at various tasks, having contractors write demo actions and rate model-generated actions.

The basic idea:

- do tasks as a sequence of steps, where at each step we have the model generate a number of options for the next action, then have either the model or the humans choose which action to go with

- let models do tasks in a factored-cognition style – delegating sub-tasks to other copies of the model – then be able to view the tree of tasks and see where models struggle and where they do well

- have contractors add their own options, and rate the quality of generated options

- tag steps with what abilities or difficulties they highlight

Screenshots:

Job Mission:

Improve the interface for interacting with models so we can generate data, evaluate model properties and understand our existing data better + faster.

Key Outcomes:

Design and build new ways to interact with models and with our existing data so far
Unlock discoveries by things like:
- Improve user experience and therefore speed and accuracy of data generation
- Automating the process of prompting the model, running model's commands, formatting task and delegating to different model, etc - allowing us to discover properties of the model we only see after getting a long way into some task
- Build tools to visualize and interact with data that lets us spot and understand patterns in the model's strengths and weaknesses
- Improve model performance on our tasks by improving the 'scaffolding' we're using to help LMs accomplish the tasks (e.g. automatically ask the model whether it's stuck, if so restart)
Make our data pipeline clean and beautiful, assess data quality from our contractors, catch bugs and problems, and produce useful datasets for the alignment community.
Stick with the project and make yourself useful on whatever needs doing if we realize we need to throw lots of our work away and do something different

Someday/maybe outcomes:

Help us develop ideas and build tools for evaluating harder-to-define model properties related to alignment, agency and deception

Key Competencies:

Essential

Strong coding ability. Able to rapidly prototype features and write clear, easy-to-extend code. We're currently using React, Typescript, Python, SQL, Flask.
Good communication skills: Always asks for clarification if priorities are ambiguous. Good at pairing and teaching others how your code works.
Learning, ownership, and 'scrappiness': Quick to pick up whatever skills and knowledge are required to make the project succeed. Keeps the overall aims of the project in mind. Not afraid to point out if something should be done differently, even if it's not part of their core responsibility.

Nice-to-have

Design + data-viz skills: Generates novel ideas for improving usability and providing an interaction experience that unlocks insights about the data or the model
ML skills: Able to do things like tune hyperparameters for a finetuning run, investigate scaling laws for a particular property of interest, or suggest techniques we could use to improve model performance on our dataset.
Basic ML knowledge: Good understanding of how LLMs 'think', and what sorts of things will affect performance.
Conceptual alignment thinking: help generate and evaluate ideas for how to probe alignment, deception, agency and other conceptually slippery properties.

Fit

Reasons this might be a bad fit for you

There's a lot of uncertainty about whether we're on the right track, and we expect to pivot a lot. If that sounds stressful or demotivating, this might not be a good fit.
You want an experienced/senior manager - Beth has only a few months' managing experience (you can ask my SERI MATS fellows how that's gone!)
You want to work within an established process that's set by someone else. We don't have detailed processes or philosophies of development/product/project management
You want to work on challenging algorithmic problems or gritty GPU kernels. Most of the coding you'll be doing is likely to be pretty straightforward in some sense, we'll likely only be interacting with models via APIs.

Reasons this might be an especially good fit

You're excited about working on a very immature team/project, growing together, and helping shape it into something really awesome
Interacting a lot with cutting-edge models sounds interesting and fun
You find it motivating for your work to have a pretty clear and direct story for how it's relevant to x-risk.

Other info

Location

We're based out of Constellation offices in Downtown Berkeley. We'd fairly strongly prefer someone who can be in-person, but WFH some days of the week is fine, and remote is not a complete dealbreaker.

Pay

TBC, but won’t be a significant downgrade from Bay Area tech roles (excluding equity).

Application process

Send your CV and short cover letter explaining why you're interested and why you think you'd be a good fit to beth [at] alignment.org. Then we'll have you do some fairly straightforward trial tasks, working with our codebase, which will take around 2-3 hours.

If this goes well, we'll have you do a somewhat more involved trial task which will take more like 5-6 hours, for which you'll be reimbursed $600. We might also get you to do a more design/product/user focused task, which would probably be something pretty open-ended like ‘use the interface yourself for a bit, interview a user or two, and describe what you think some of the most important changes/additions are’. You'll also chat with us more about the ideas behind the project, what you're looking for in a role, and any other questions you have.

After that we'd do a 3-month trial period. This is a trial both of you and of the role overall - we're still unsure about our longer-term hiring plans.

===================================================

Researcher

DEC 2022 EDIT: SEE NEW POSTING HERE

Job Mission:

Identifying key capabilities and behaviors of language models, identifying and evaluating scenarios where these models could pose risks, designing evaluation procedures that help assess those risks, implementing and project managing the execution of those evaluations.

Key Outcomes:

Identify weaknesses in our current approaches to evaluations, and propose ways to improve them.
- Recent example: notice that instead of adding a human-generated next action when the model gets stuck, we can just prompt the model with something like 'Am I confused?' and this is often sufficient to get the model unstuck. It's plausible this could be automated and allow models to do significantly better.
Interact with the models directly, gain specific understanding of particular capabilities, produce guides to help contractors do the same.
- Recent example: Investigate the ability of models to do EV estimates for different plans, using different types of scaffolding + prompt formatting
Produce reports on the capabilities of particular models based on our evaluations.
Help design new types of evaluations to measure harder-to-pin-down properties of models like alignment, instrumental behaviors and agency.
Use a variety of techniques to improve model performance and ensure we're getting an accurate picture of model capabilities.
Run experiments to develop a science of predicting model performance: assess the accuracy and limitations of the techniques we're using to estimate the capabilities of the next generation of models by applying these to historical models.
Stick with the project and make yourself useful on whatever needs doing if we realize we need to throw lots of our work away and do something different
Research and design protocols that labs could follow when making decisions about scaling up or deploying models

Key Competencies:

Essential

A good working understanding of language model capabilities and of modern ML. Ability to reason about likely properties and abilities of future models, and paths to AI x-risk.
A good understanding of alignment risk, in order to identify core risks and translate abstract stories about risk into concrete measurements of existing models.
Creativity to think of new strategies for measurement, observable consequences of model properties we can’t directly observe, and possible risks posed by models. The ability to anticipate and address a wide range of novel challenges in measuring capabilities, for which there isn’t yet any good literature showing how to do it.
Strong basic coding skills: Able to design and build a nice system for managing experiments and data. Quickly answer questions about our data, make figures, run experiments given an API to a model.
ML skills: Experience with, or ability to rapidly get up to speed on, things like tuning hyperparameters for a finetuning run, investigate scaling laws for a particular property of interest, or reading papers and implementing relevant techniques we could use to improve model performance on our dataset.
Learning, ownership, and 'scrappiness': Quick to pick up whatever skills and knowledge are required to make the project succeed. Keeps the overall aims of the project in mind. Not afraid to point out if something should be done differently, even if it's not part of your core responsibility.

Possible bonus skills

Web dev experience. Able to rapidly prototype features and write clear, easy-to-extend code. We're currently using React, Typescript, Python, SQL, Flask.
Knowledge of particular domains relevant to model takeover, e.g. cybersecurity or ML engineering, in order to assess model competence on these tasks.
Deeper ML understanding: up to speed on SOTA interpretability research, able to reason about and develop tools that use weights to tell us more about the model's characteristics
Strategy + coordination: You have experience or aptitude with organizational politics at tech companies/scaling labs, coordination between organizations, development of regulations or agreements, or reasoning about overall strategic landscape for AI safety.

Fit

Reasons this might be a bad fit

There's a lot of uncertainty about whether we're on the right track, and we expect to pivot a lot. If that sounds stressful or demotivating, this might not be a good fit.
You want an experienced/senior manager - Beth has only a few months' managing experience (you can ask my SERI MATS fellows how that's gone!)
You want to work on challenging algorithmic problems or gritty GPU kernels. Most of the coding you'll be doing is likely to be pretty straightforward in some sense, we'll likely only be interacting with models via APIs.

Reasons this might be an especially good fit

You're excited about working on a very immature team/project, growing together, and helping shape it into something really awesome
Interacting a lot with cutting-edge models sounds interesting and fun
You find it motivating for your work to have a pretty clear and direct story for how it's relevant to x-risk.
You're interested in more conceptual-flavored alignment research, but also want to ground it in specific predictions about current and near-future models
You're interested in black-box interpretability and doing science on models

Other info

Location

We're based out of Constellation offices in Downtown Berkeley. WFH some days of the week is fine, but being fully remote is probably a dealbreaker.

Pay

TBC, but won’t be a significant downgrade from bay area tech roles (excluding equity).

Application process

You can do this initial application that's similar to the application for model interaction contractor role, which should be pretty straightforward but should also give you a sense of some of the types of work involved.

If this goes well you'll do a more involved trial task, which will take around 5-6 hours, for which you'll be reimbursed $600. You'll also chat with us more about the ideas behind the project, what you're looking for in a role, and any other questions you have.

After that we'd do a 3-month trial period. This is a trial both of you and of the role overall - we're still unsure about our longer-term hiring plans.

45

Evaluations project @ ARC is hiring a researcher and a webdev/engineer

45

About the project

Big picture

Current goals

What we’ve done so far

People + Logistics

Job Descriptions

Generalist Software Engineer, Web Dev and ML focus

DEC 2022 EDIT: SEE NEW POSTING HERE

Background

Job Mission:

Key Outcomes:

Someday/maybe outcomes:

Key Competencies:

Essential

Nice-to-have

Fit

Reasons this might be a bad fit for you

Reasons this might be an especially good fit

Other info

Location

Pay

Application process

===================================================

Researcher

DEC 2022 EDIT: SEE NEW POSTING HERE

Job Mission:

Key Competencies:

Essential

Possible bonus skills

Fit

Reasons this might be a bad fit

Reasons this might be an especially good fit

Other info

Location

Pay

Application process