Stephen McAleese

Computer science master's student interested in AI and AI safety.

Wiki Contributions


I think this section of the post is slightly overstating the opportunity cost of doing a PhD. PhD students typically spend most of their time on research so ideally, they should be doing AI safety research during the PhD (e.g. like Stephen Casper). If the PhD is in an unrelated field or for the sake of upskilling then there is a more significant opportunity cost relative to working directly for an AI safety organization.

Thank you for explaining PPO. In the context of AI alignment, it may be worth understanding in detail because it's the core algorithm at the heart of RLHF. I wonder if any of the specific implementation details of PPO or how it's different from other RL algorithms have implications for AI alignment. To learn more about PPO and RLHF, I recommend reading this paper: Secrets of RLHF in Large Language Models Part I: PPO.

LLMs aren't that useful for alignment experts because it's a highly specialized field and there isn't much relevant training data. The AI Safety Chatbot partially solves this problem using retrieval-augmented generation (RAG) on a database of articles from There also seem to be plans to fine-tune it on a dataset of alignment articles.

Nice post! The part I found most striking was how you were able to use the mean difference between outputs on harmful and harmless prompts to steer the model into refusing or not. I also like the refusal metric which is simple to calculate but still very informative.

TL;DR: Private AI companies such as Anthropic which have revenue-generating products and also invest heavily in AI safety seem like the best type of organization for doing AI safety research today. This is not the best option in an ideal world and maybe not in the future but right now I think it is.

I appreciate the idealism and I'm sure there is some possible universe where shutting down these labs would make sense but I'm quite unsure about whether doing so would actually be net-beneficial in our world and I think there's a good chance it would be net-negative in reality.

The most glaring constraint is finances. AI safety is funding-constrained so this is worth mentioning. Companies like DeepMind and OpenAI spend hundreds of millions of dollars per year on staff and compute and I doubt that would be possible in a non-profit. Most of the non-profits working on AI safety (e.g. Redwood Research) are small with just a handful of people. OpenAI changed their company from a non-profit to a capped for-profit because they realized that being a non-profit would have been insufficient for scaling their company and spending. OpenAI now generates $1 billion in revenue and I think it's pretty implausible that a non-profit could generate that amount of income.

The other alternative apart from for-profit companies and philanthropic donations is government funding. It is true that governments fund a lot of science. For example, the US government funds 40% of basic science research. And a lot of successful big science projects such as CERN and the ITER fusion project seem to be mostly government-funded. However, I would expect a lot of government-funded academic AI safety grants to be wasted by professors skilled at putting "AI safety" in their grant applications so that they can fund whatever they were going to work on anyway. Also, the fact that the US government has secured voluntary commitments from AI labs to build AI safely gives me the impression that governments are either unwilling or incapable of working on AI safety and instead would prefer to delegate it to private companies. On the other hand, the UK has a new AI safety institute and a language model task force.

Another key point is research quality. In my opinion, the best AI safety research is done by the big labs. For example, Anthropic created constitutional AI and they also seem to be a leader in interpretability research. I think empirical AI safety work and AI capabilities work involve very similar skills (coding etc.) and therefore it's not surprising that leading AI labs also do the best empirical AI safety work. There are several other reasons for explaining why big AI labs do the best empirical AI safety work. One is talent. Top labs have the money to pay high salaries which attracts top talent. Work in big labs also seems more collaborative than in academia which seems important for large projects. Many top projects have dozens of authors (e.g. the Llama 2 paper). Finally, there is compute. Right now, only big labs have the infrastructure necessary to do experiments on leading models. Doing experiments such as fine-tuning large models requires a lot of money and hardware. For example, this paper by DeepMind on reducing sycophancy apparently involved fine-tuning the 540B PaLM model which is probably not possible for most independent and academic researchers right now and consequently, they usually have to work with smaller models such as Llama-2-7b. However, the UK is investing in some new public AI supercomputers which hopefully will level the playing field somewhat. If you think theoretical work (e.g. agent foundations) is more important than empirical work then big labs have less of an advantage. Though DeepMind is doing some of that too.

Thanks for the post! I think it does a good job of describing key challenges in AI field-building and funding.

The talent gap section describes a lack of positions in industry organizations and independent research groups such as SERI MATS. However, there doesn't seem to be much content on the state of academic AI safety research groups. So I'd like to emphasize the current and potential importance of academia for doing AI safety research and absorbing talent. The 80,000 Hours AI risk page says that there are several academic groups working on AI safety including the Algorithmic Alignment Group at MIT, CHAI in Berkeley, the NYU Alignment Research Group, and David Krueger's group in Cambridge.

The AI field as a whole is already much larger than the AI safety field so I think analyzing the AI field is useful from a field-building perspective. For example, about 60,000 researchers attended AI conferences worldwide in 2022. There's an excellent report on the state of AI research called Measuring Trends in Artificial Intelligence. The report says that most AI publications come from the 'education' sector which is probably mostly universities. 75% of AI publications come from the education sector and the rest are published by non-profits, industry, and governments. Surprisingly, the top 9 institutions by annual AI publication count are all Chinese universities and MIT is in 10th place. Though the US and industry are still far ahead in 'significant' or state-of-the-art ML systems such as PaLM and GPT-4.

What about the demographics of AI conference attendees? At NeurIPS 2021, the top institutions by publication count were Google, Stanford, MIT, CMU, UC Berkeley, and Microsoft which shows that both industry and academia play a large role in publishing papers at AI conferences.

Another way to get an idea of where people work in the AI field is to find out where AI PhD students go after graduating in the US. The number of AI PhD students going to industry jobs has increased over the past several years and 65% of PhD students now go into industry but 28% still go into academic jobs.

Only a few academic groups seem to be working on AI safety and many of the groups working on it are at highly selective universities but AI safety could become more popular in academia in the near future. And if the breakdown of contributions and demographics of AI safety will be like AI in general, then we should expect academia to play a major role in AI safety in the future. Long-term AI safety may actually be more academic than AI since universities are the largest contributor to basic research whereas industry is the largest contributor to applied research.

So in addition to founding an industry org or facilitating independent research, another path to field-building is to increase the representation of AI safety in academia by founding a new research group though this path may only be tractable for professors.

I agree that the difficulty of the alignment problem can be thought of as a diagonal line on the 2D chart above as you described.

This model may make having two axes instead of one unnecessary. If capabilities and alignment scale together predictably, then high alignment difficulty is associated with high capabilities, and therefore the capabilities axis could be unnecessary.

But I think there's value in having two axes. Another way to think about your AI alignment difficulty scale is like a vertical line in the 2D chart: for a given level of AI capability (e.g. pivotal AGI), there is uncertainty about how hard it would be to align such an AGI because the gradient of the diagonal line intersecting the vertical line is uncertain.

Instead of a single diagonal line, I now think the 2D model describes alignment difficulty in terms of the gradient of the line. An optimistic scenario is one where AI capabilities are scaled and few additional alignment problems arise or existing alignment problems do not become more severe because more capable AIs naturally follow human instructions and learn complex values. A highly optimistic possibility is that increased capabilities and alignment are almost perfectly correlated and arbitrarily capable AIs are no more difficult to align than current systems. Easy worlds correspond to lines in the 2D chart with low gradients and low-gradient lines intersect the vertical line corresponding to the 1D scale at a low point.

A pessimistic scenario can be represented in the chart as a steep line where alignment problems rapidly crop up as capabilities are increased. For example, in such hard worlds, increased capabilities could make deception and self-preservation much more likely to arise in AIs. Problems like goal misgeneralization might persist or worsen even in highly capable systems. Therefore, in hard worlds, AI alignment difficulty increases rapidly with capabilities and increased capabilities do not have helpful side effects such as the formation of natural abstrations that could curtail the increasing difficulty of the AI alignment problem. In hard worlds, since AI capabilities gains cause a rapid increase in alignment difficulty, the only way to ensure that alignment research keeps up with the rapidly increasing difficulty of the alignment problem is to limit progress in AI capabilities.

I highly recommend this interview with Yann LeCun which describes his view on self-driving cars and AGI.

Basically, he thinks that self-driving cars are possible with today's AI but would require immense amounts of engineering (e.g. hard-wired behavior for corner cases) because today's AI (e.g. CNNs) tends to be brittle and lacks an understanding of the world.

My understanding is that Yann thinks we basically need AGI to solve autonomous driving in a reliable and satisfying way because the car would need to understand the world like a human to drive reliably.

My summary of the podcast


The superalignment team is OpenAI’s new team, co-led by Jan Leike and Ilya Sutskever, for solving alignment for superintelligent AI. One of their goals is to create a roughly human-level automated alignment researcher. The idea is that creating an automated alignment researcher is a kind of minimum viable alignment project where aligning it is much easier than aligning a more advanced superintelligent sovereign-style system. If the automated alignment researcher creates a solution to aligning superintelligence, OpenAI can use that solution for aligning superintelligence.

The automated alignment researcher

The automated alignment researcher is expected to be used for running and evaluating ML experiments, suggesting research directions, and helping with explaining conceptual ideas. The automated researcher needs two components: a model capable enough to do alignment research, which will probably be some kind of advanced language model, and alignment for the model. Initially, they’ll probably start with relatively weak systems with weak alignment methods like RLHF and then scale up both using a bootstrapping approach where the model increases alignment and then the model can be scaled as it becomes more aligned. The end goal is to be able to convert compute into alignment research so that alignment research can be accelerated drastically. For example, if 99% of tasks were automated, research would be ~100 times faster.

The superalignment team

The automated alignment researcher will be built by the superalignment team which currently has 20 people and could have 30 people by the end of the year. OpenAI also plans to allocate 20% of its compute to the superalignment team. Apart from creating the automated alignment researcher, the superalignment team will continue doing research on feedback-based approaches and scalable oversight. Jan emphasizes that the superalignment team will still be needed even if there is an automated alignment researcher because he wants to keep humans in the loop. He also wants to avoid the risk of creating models that seek power, self-improve, deceive human overseers, or exfiltrate (escape).

Why Jan is optimistic

Jan is generally optimistic about the plan succeeding and estimates that it has an ~80% chance of succeeding even though Manifold only gives the project a 22% chance of success. He gives 5 reasons for being optimistic:

  • LLMs understand human intentions and morality much better than other kinds of agents such as RL game-playing agents. For example, often you can simply ask them to behave a certain way.
  • Seeing how well RLHF worked. For example, training agents to play Atari games works almost as well as using the reward signal. RLHF-aligned LLMs are much more aligned than base models.
  • It’s possible to iterate and improve alignment solutions using experiments and randomized controlled trials.
  • Evaluating research is easier than generating it.
  • The last reason is a bet on language models. Jan thinks many alignment tasks can be formulated as text-in-text-out tasks.

Controversial statements/criticisms

Jan criticizing interpretability research:

I think interpretability is neither necessary, nor sufficient. I think there is a good chance that we could solve alignment purely behaviorally without actually understanding the models internally. And I think, also, it’s not sufficient where: if you solved interpretability, I don’t really have a good story of how that would solve superintelligence alignment, but I also think that any amount of non-trivial insight we can gain from interpretability will be super useful or could potentially be super useful because it gives us an avenue of attack.

Jan criticizing alignment theory research:

I think there’s actually a lot more scope for theory work than people are currently doing. And so I think for example, scalable oversight is actually a domain where you can do meaningful theory work, and you can say non-trivial things. I think generalization is probably also something where you can say… formally using math, you can make statements about what’s going on (although I think in a somewhat more limited sense). And I think historically there’s been a whole bunch of theory work in the alignment community, but very little was actually targeted at the empirical approaches we tend to be really excited [about] now. And it’s also a lot of… Theoretical work is generally hard because you have to… you’re usually either in the regime where it’s too hard to say anything meaningful, or the result requires a bunch of assumptions that don’t hold in practice. But I would love to see more people just try … And then at the very least, they’ll be good at evaluating the automated alignment researcher trying to do it.

Ideas for complementary research

Jan also gives some ideas that would be complementary to OpenAI’s alignment research agenda:

  • Creating mechanisms for eliciting values from society.
  • Solving current problems like hallucinations, jailbreaking, or mode-collapse (repetition) in RLHF-trained models.
  • Improving model evaluations and evals to measure capabilities and alignment.
  • Reward model interpretability.
  • Figuring out how models generalize. For example, figuring out how to generalize alignment for easy-to-supervised tasks to hard tasks.
Load More