The Alignment Research Center’s Theory team is starting a new hiring round for researchers with a theoretical background. Please apply here.

Update September 2023: we are currently accepting applications on a rolling basis. Our capacity for processing applications is likely to fluctuate, and it may take us longer to get back to you than during our official hiring rounds.

What is ARC’s Theory team?

The Alignment Research Center (ARC) is a non-profit whose mission is to align future machine learning systems with human interests. The high-level agenda of the Theory team (not to be confused with the Evals team) is described by the report on Eliciting Latent Knowledge (ELK): roughly speaking, we’re trying to design ML training objectives that incentivize systems to honestly report their internal beliefs.

For the last year or so, we’ve mostly been focused on an approach to ELK based on formalizing a kind of heuristic reasoning that could be used to analyze neural network behavior, as laid out in our paper on Formalizing the presumption of independence. Our research has reached a stage where we’re coming up against concrete problems in mathematics and theoretical computer science, and so we’re particularly excited about hiring researchers with relevant background, regardless of whether they have worked on AI alignment before. See below for further discussion of ARC’s current theoretical research directions.

Who is ARC looking to hire?

Compared to our last hiring round, we have more of a need for people with a strong theoretical background (in math, physics or computer science, for example), but we remain open to anyone who is excited about getting involved in AI alignment, even if they do not have an existing research record.

Ultimately, we are excited to hire people who could contribute to our research agenda. The best way to figure out whether you might be able to contribute is to take a look at some of our recent research problems and directions:

  • Some of our research problems are purely mathematical, such as these matrix completion problems – although note that these are unusually difficult, self-contained and well-posed (making them more appropriate for prizes).
  • Some of our other research is more informal, as described in some of our recent blog posts such as Finding gliders in the game of life.
  • A lot of our research occupies a middle ground between fully-formalized problems and more informal questions, such as fixing the problems with cumulant propagation described in Appendix D of Formalizing the presumption of independence.

What is working on ARC’s Theory team like?

ARC’s Theory team is led by Paul Christiano and currently has 2 other permanent team members, Mark Xu and Jacob Hilton, alongside a varying number of temporary team members (recently anywhere from 0–3).

Most of the time, team members work on research problems independently, with frequent check-ins with their research advisor (e.g., twice weekly). The problems described above give a rough indication of the kind of research problems involved, which we would typically break down into smaller, more manageable subproblems. This work is often somewhat similar to academic research in pure math or theoretical computer science.

In addition to this, we also allocate a significant portion of our time to higher-level questions surrounding research prioritization, which we often discuss at our weekly group meeting. Since the team is still small, we are keen for new team members to help with this process of shaping and defining our research.

ARC shares an office with several other groups working on AI alignment such as Redwood Research, so even though the Theory team is small, the office is lively with lots of AI alignment-related discussion.

What are ARC’s current theoretical research directions?

ARC’s main theoretical focus over the last year or so has been on preparing the paper Formalizing the presumption of independence and on follow-up work to that. Roughly speaking, we’re trying to develop a framework for “formal heuristic arguments” that can be used to reason about the behavior of neural networks. This framework can be thought of as a confluence of two existing approaches:

Human understandable

Machine verifiable

Confident and final

Formal proof

Uncertain and defeasible

Mechanistic interpretability

Formal heuristic argument

This research direction can be framed in a couple of different ways:

  • As a formalization of mechanistic interpretability: Mechanistic interpretability is a research field seeking to reverse-engineer the weights of neural networks into human-understandable programs. A number of the field's central concepts, such as a “feature”, are currently defined informally. Putting the field onto more of a formal footing could bring clarity to the methods and goals of the field, remove the need to have humans or human-like systems in the loop, and elucidate how interpretability could be applied to solve downstream problems.
  • As a way of dealing with out-of-distribution generalization failures: We think that a formal heuristic argument that explains a neural network’s training set performance could be used to flag new datapoints that trigger unusual behavior inside the model. We have been calling this approach “mechanistic anomaly detection”, since it can be thought of as a way to detect anomalies in the model’s internal activations at inference time. Further details are given in this blog post.

Over the coming weeks we'll post an update on our progress and current focus, as well as an AMA with research staff.

Hiring process

Our current interview process involves:

  • 3-hour take-home test involving math and computer science puzzles
  • 30-minute non-technical phone call
  • 1-day onsite interview

We will compensate candidates for their time when this is logistically possible.

We will keep applications open until at least the end of August 2023, and will aim to get a final decision back within 6 weeks of receiving an application. Update September 2023: we are currently accepting applications on a rolling basis. Our capacity for processing applications is likely to fluctuate, and it may take us longer to get back to you than during our official hiring rounds.

Employment details

ARC is based in Berkeley, California, and we would prefer people who can work full-time from our office, but we are open to discussing remote or part-time arrangements in some circumstances. We can sponsor visas and are H-1B cap-exempt.

We are accepting applications for both visiting researcher (1–3 months) and full-time positions. The intention of the visiting researcher position is to assess potential fit for a full-time role, and we expect to invite around half of visiting researchers to join full-time. We are also able to offer straight-to-full-time positions, but we anticipate that we will only be able to do this for people with a legible research track-record.

Salaries are in the $150k–400k range for most people depending on experience.

Further information

If you have any questions about anything in this post, please ask in the comments section, email, or stay tuned for our future posts and AMA.

New Comment
4 comments, sorted by Click to highlight new comments since: Today at 5:28 PM

I think that perhaps as a result of a balance of pros and cons, I initially was not very motivated to comment (and haven't been very motivated to engage much with ARC's recent work).  But I decided maybe it's best to comment in a way that gives a better signal than silence. 

I've generally been pretty confused about Formalizing the presumption of Independence and, as the post sort of implies, this is sort of the main advert that ARC have at the moment for the type of conceptual work that they are doing, so most of what I have to say is meta stuff about that. 

Disclaimer a) I have not spent a lot of time trying to understand everything in the paper. and b) As is often the case, this comment may come across as overly critical, but it seems highest leverage to discuss my biggest criticisms, i.e. the things that if they were addressed may cause me to update to the point I would more strongly recommend people applying etc.

I suppose the tldr is that the main contribution of the paper claims to be the framing of a set of open problems, but I did not find the paper able to convince me that the problems are useful ones or that they would be interesting to answer.

I can try to explain a little more: It seemed odd that the "potential" applications to ML were mentioned very briefly in the final appendix of the paper, when arguably the potential impact or usefulness of the paper really hinges on this. As a reader, it might seem natural to me that the authors would have already asked and answered - before writing the paper - questions like "OK so what if I had this formal heuristic estimator? What exactly can I use it for? What can I actually (or even practically) do with it?" Some of what was said in the paper was fairly vague stuff like: 

If successful, it may also help improve our ability to verify reasoning about complex questions, like those emerging in modern machine learning, for which we expect formal proof to be impossible. 

In my opinion, it's also important to bear in mind that the criteria of a problem being 'open' is a poor proxy for things like usefulness/interestingness. (obviously those famous number theory problems are open, but so are loads of random mathematical statements). The usefulness/interestingness of course comes because people recognize various other valuable things too like:  That the solution would seem to require new insights into X and therefore a proof would 'have to be' deeply interesting in its own right; or that the truth of the statement implies all sorts of other interesting things; or that the articulation of the problem itself has captured and made rigorous some hitherto messy confusion, or etc. etc.  Perhaps more of these things need to be made explicit in order to argue more effectively that ARC's stating of these open problems about heuristic estimators is an interesting contribution in itself?

To be fair, in the final paragraph of the paper there are some remarks that sort of admit some of what I'm saying:

Neither of these applications [to avoiding catastrophic failures or to ELK] is straightforward, and it should not be obvious that heuristic arguments would allow us to achieve either goal.

But practically it means that when I ask myself something like: 'Why would I drop whatever else I'm working on and work on this stuff?' I find it quite hard to answer in a way that's not basically just all deference to some 'vision' that is currently undeclared (or as the paper says "mostly defer[red]" to "future articles").

Having said all this I'll reiterate again that there are lots of clear pros to a job like this and I do think that there is important work to be done that is probably not too dissimilar from the kind being talked about in Formalizing the presumption of Independence and in this post.


I think this is a reasonable perception and opinion. We’ve written a little bit about how heuristic estimators might help with ELK (MAD and ELK and finding gliders), but that writing is not particularly clear and doesn’t present a complete picture.

We’ve mostly been focused on finding heuristic estimators, because I am fairly convinced they would be helpful and think that designing them is our key technical risk. But now that we are hiring again I think it’s important for us to explain publicly why they would be valuable, and to generally motivate and situate our work.

I hope to write up a reasonable pitch sometime over the next few weeks.

In the original document we also mention a non-ELK application, namely using a heuristic estimator for adversarial training, which is significantly more straightforward. I think this is helpful for validating the intuitive story that heuristic estimators would overcome limitations of black box training, and in some sense I think that ELK and  together are the two halves of the alignment problem and so solving both is very exciting. That said, I've considered this in less detail than the ELK application. I'll try to give a bit more detail on this in the child comment.

Discussing the application of heuristic estimators to adversarial training:

Suppose you have a trusted specification C for catastrophic behavior, and you want to train a model that never violates C.  For simplicity, I’ll assume that your model M maps observations x to actions y, and C takes an (x, y) pair and determine whether it’s catastrophic. So we want a model for which which C(x, M(x)) is very rarely true on the deployment distribution.

You could care about this if you have a trusted weak model which can check for catastrophic behavior given enough time (but which is too expensive to continuously oversee your model at inference time), as in iterated amplification. You might also care if you’ve gotten a specification by solving ELK, but want to train a model that never does something bad (rather than merely detecting bad behavior at runtime, when pausing could be catastrophic). In general I think that "we can detect bad behavior but the model does a treacherous turn anyway" is a plausible failure mode to address.

A heuristic estimator lets you asses the probability that a given model M violates C for any distribution D, i.e. . You can produce estimates even when (i) the probability is very small, (ii) you can’t efficiently draw samples from D.

So we search for a distribution D on which M is particularly likely to behave catastrophically, estimate the log of the catastrophe probability under D, and then take a gradient descent step on M to reduce that probability (assuming that the estimator is differentiable, which is true for examples we have considered and appears to follow from natural coherence properties). This approach can hope to achieve error probabilities much lower than (1 / training set size), and we can ask about how M would behave given observations that we can recognize but can’t synthesize in the lab (which would otherwise be natural triggers for takeover attempts). In theory this overcomes the big limitations for adversarial training.

If you actually had a heuristic estimator you could immediately test this application. Detecting small catastrophe probabilities is particularly straightforward. In this setting D can be a neural net adversary---you can initialize with an LM asked to produce cases where M behaves badly, and then fine-tune D to optimize the catastrophe probability.

What’s the difficulty level of the take home? Are you looking for math Olympiad level theorists, or just folks who can handle graduate level math/cs?