LLMs for Alignment Research: a safety priority?

abramdemski

A recent short story by Gabriel Mukobi illustrates a near-term scenario where things go bad because new developments in LLMs allow LLMs to accelerate capabilities research without a correspondingly large acceleration in safety research.

This scenario is disturbingly close to the situation we already find ourselves in. Asking the best LLMs for help with programming vs technical alignment research feels very different (at least to me). LLMs might generate junk code, but you can keep pointing out the problems with the code, and the code will eventually work. This can be faster than doing it myself, in cases where I don't know a language or library well; the LLMs are moderately familiar with everything.

When I try to talk to LLMs about technical AI safety work, however, I just get garbage.

I think a useful safety precaution for frontier AI models would be to make them more useful for safety research than capabilities research. This extends beyond applying AI technology to accelerate safety research within top AI labs; models available to the general public (such as GPT-N, Claude-N) should also accelerate safety more than capabilities.

What is wrong with current models?

My experience is mostly with Claude, and mostly with versions of Claude before the current (Claude 3).^[1] I'm going to complain about Claude here; but everything else I've tried seemed worse. In particular, I found GPT4 to be worse than Claude2 for my purposes.

As I mentioned in the introduction, I've been comparing how these models feel helpful for programming to how useless they feel for technical AI safety. Specifically, technical AI safety of the mathematical-philosophy flavor that I usually think about. This is not, of course, a perfect experiment to compare capability-research-boosting to safety-research-boosting. However, the tasks feel comparable in the following sense: programming involves translating natural-language descriptions into formal specifications; mathematical philosophy also involves translating natural-language descriptions into formal specifications. From this perspective, the main difference is what sort of formal language is being targeted (IE, programming languages vs axiomatic models).

I don't have systematic experiments to report; just a general feeling that Claude's programming is useful, but Claude's philosophy is not.^[2] It is not obvious, to me, why this is. I've spoken to several people about it. Some reactions:

If it could do that, we would all be dead!
- I think a similar mindset would have said this about programming, a few years ago. I suspect there are ways for modern LLMs to be more helpful to safety research in particular which do not also imply advancing capabilities very much in other respects. I'll say more about this later in the essay.
There's probably just a lot less training data for mathematical philosophy than for programming.
- I think this might be an important factor, but it is not totally clear to me.
Mathematical philosophy is inherently more difficult than programming, so it is no surprise.
- This might also be an important factor, but I consider it to be only a partial explanation. What is more difficult, exactly? As I mentioned, programming and mathematical philosophy have some strong similarities.

Problems include a bland, people-pleasing attitude which is not very helpful for research. By default, Claude (and GPT4) will enthusiastically agree with whatever I say, and stick to summarizing my points back at me rather than providing new insights or adding useful critiques. When Claude does engage in more structured reasoning, it is usually wrong and bad. (I might summarize it as "based more on vibes than logic".)

Is there any hope for better?

As a starting observation: although a given AI technology, such as GPT4, might not meet some safety standards we'd like to impose (eg, transparency/interpretability), its widespread use means we are already forced to gamble on its relative safety. In some weak sense, this gives us a resource: a technology which we can use without increasing risks. This certainly doesn't imply that any arbitrary use of GPT4 is non-risk-increasing. However, it does suggest approaches involving cautiously harnessing modern AI technology for what it's good for, without placing it in the driver's seat.

We're at a point in history where suddenly many new things are possible; it's a point where it makes a lot of sense to look around, explore, and see whether you can find a significant way to leverage the new technologies for good. With the technology being so new, I don't think we should stop at the obvious (EG, give up because chatting with modern LLMs about safety research did not feel fruitful).

Some obvious things to try include better prompting strategies, and fine-tuning models specifically for helping with this sort of work. It might be useful to attach LLMs to theorem-proving assistants and teach the LLMs to (selectively) formalize what the user is trying to reason about as axioms or proofs in the connected formal system.

It would also be helpful to simply make a more systematic study of what these models can and cannot help with, relating to AI safety research.

I'll state some more specific ideas about how to use modern LLMs to benefit safety research towards the end of this essay; there are some more intuitions I want to communicate first.

What follows is my personal vision for how modern LLMs could be more useful for safety research; I don't want to put overmuch emphasis on it. My main point has already been made: making LLMs comparatively more useful for AI safety work as opposed to AI capabilities work should itself be considered a safety priority.

Against Autonomy

I think there's a dangerous bias toward autonomy in AI -- both in terms of what AI companies are trying to produce, and also in what consumers are asking for. I want to advocate for a greater emphasis on collaborative AI, rather than AI which takes requests and then delivers.

Servant vs Collaborator

Big AI companies are for the most part fine-tuning models to take a prompt and return an answer. This is a pretty reasonable idea, but it sometimes feels like interacting with a nervous intern desperate to prove themselves in their first week on the job.

For example, my brother started a conversation with something like "I'm thinking about making an RPG". Bing responded with a very reasonable list of things to think about when making an RPG. The problem is that my brother actually had a very specific idea in mind, and the advice was very generic. Simply put, my brother hadn't finished explaining what he wanted before pressing enter. It would have been more useful for Bing to engage in active listening: "What kind of RPG are you interested in making?" or similar conversational questions; and only write a research report giving advice after the general shape of the request was clear. You have to be careful what you say to the nervous intern, because the nervous intern will scurry off and write up a report at the drop of a hat.

Similarly, this video argues that Sudowrite (an AI novel-writing tool) is less useful to authors than NovelCrafter (also an AI novel-writing tool) because Sudowrite's philosophy is closer to "click a button for the AI to write a novel for you" while NovelCrafter is oriented toward a more collaborative model.

I think there are a few sources of autonomy-bias which I want to point out, here:

Autonomy is often easier to train into AI. For example, to generate whole pictures, you just need a data-set consisting of finished art. More sophisticated image manipulation sometimes requires more complex data-sets which might be more difficult to obtain.
Autonomy is easier to conceive of. Push a button and it does what you want. Collaboration often requires more sophisticated user interfaces and more complex ideas about workflows -- perhaps involving domain-specific knowledge about how domain experts actually go about their business.
Autonomy is more appealing to the people in charge of corporate budgets. My brother is currently working as a programmer, and his boss says he can't wait till the AI is at the point where you just push a button and get the code you asked for. My brother, due to having a closer relationship with the code, has a much more collaborative relationship with the AI. To programmers, the inadequacies of the "just push a button" model are more apparent.

Notions of Alignment

Garret Baker recently commented:

To my ears it sounded like Shane [Legg]'s solution to "alignment" was to make the models more consequentialist. I really don't think he appreciates most of the difficulty and traps of the problems here. This type of thinking, on my model of their models, should make even alignment optimists unimpressed, since much of the reason for optimism lies in observing current language models, and interpreting their outputs as being nonconsequentialist, corrigible, and limited in scope yet broad in application.

Let's set aside whether Garret Baker's analysis of Shane Legg is correct. If it was correct, could you really blame him? Someone could be quite up-to-date with the alignment literature and still take the view that "alignment" basically means "value alignment" -- which is to say, absorbing human values and then optimizing them. Some of the strongest advocates of alternate ideas like "corrigibility" will still say that progress towards it has stalled and the evidence points toward it being a very unnatural concept.

Simply put, we don't yet have a strong alternative to agent-centric (autonomy-centric) alignment.

A couple of people who I talk to have been moving away from the value-alignment picture, recently, instead replacing it with the following picture: aligned AI systems are systems which increase, rather than decrease, the agency of humans. This is called capabilitarianism (in contrast to utilitarianism).^[4]

Think of social media vs wikis. Social media websites are attention-sucking machines which cause addictive scrolling. Wikis, such as wikipedia, are in contrast incredibly useful.

Or think of a nanny state which makes lots of decisions for its citizens on utilitarian grounds, vs a government which emphasizes freedom and informed decision-making, fostering the ability of its citizens to optimize their own lives, rather than doing it for them.

This notion of alignment is still lacking the level of clarity which the more consequentialist notion possesses, but it sure seems like there are less ways for this kind of vision to go wrong.

The Need for an Independent Safety Approach

OpenAI says:

Our goal is to build a roughly human-level automated alignment researcher. We can then use vast amounts of compute to scale our efforts, and iteratively align superintelligence.

I think this sort of plan can easily go wrong. Broadly speaking, aiming to take the human out of the loop seems like a mistake. We want to be on a trajectory where humans very much remain in the loop. Of course I don't think the superalignment team at OpenAI are trying to take humans entirely out of the loop in a broader sense. But I don't think "automated alignment researcher" should be the way we think about the end goal.

If you are trying to use AI to accelerate alignment work, but your main approach to alignment work is "use AI to accelerate alignment work" -- it seems to me that it is easy to miss a certain sort of grounding. You're solving for X in the equation "use AI to accelerate X".

Instead, I would propose that people working on LLMs should work to make LLMs useful to alignment researchers whose main approach to alignment IS NOT "make LLMs useful to alignment researchers".

This prevents the snake from eating its own tail (and thereby killing itself).

Engineer-Researcher Collaboration

My main proposal for making modern LLMs comparatively more useful for AI safety research is to pair AI safety researchers with generative AI engineers. The engineers would try to create tools useful for accelerating safety research, while the safety researchers would provide testing and feedback. This setup also provides some distance between the LLM engineering and the safety work, to avoid the eating-its-own-tail problem. The safety researchers are bringing their own approach to safety work, so that "automating safety research" does not become the whole safety approach.

This could take the shape of single safety researchers working with single engineers, to an org with a team of safety researchers working with a team of engineers, all the way to a whole safety-research org working with a whole engineering org.

Although my intuition here is that it is important for the safety researcher to have their own safety work which they are trying to accelerate using LLMs, it is plausible that most of the impact comes from building tools which are able to help a larger number of safety researchers; for example, the 'end product' might be an LLM which has been trained to be a helpful assistant for a broad variety of safety researchers. I therefore imagine this LLM serving as something like a wiki for the AI safety community: like a more sophisticated version of Stampy,^[5] where research-oriented conversation styles are also curated, rather than only question-answer pairs.

Aside: Hallucinations

I want to mention my personal model of "AI hallucination". Here's a pretty standard example: when I ask Claude or GPT4 for references to papers on a very niche topic, the references it comes up with are usually made up. However, they are generally plausible -- I often can't tell whether they are made up or not before searching for those references myself.

I think there's a common mindset which sees these hallucinations as some kind of malfunction. And they are, if the purpose of LLMs is seen as delivering truthful information to the user. But if we think of the LLM as a really good prior distribution over what humans might say, then it starts to look less like a malfunction and more like fairly good performance: the details filled in are quite plausible, even if incorrect.

If the prior lacks specific information we want it to have, the thing to do is update on that information. OpenAI and Anthropic provide interfaces where you can give a thumbs-up or thumbs-down; notably, this is a common social-media interface. But this feedback is not nearly rich enough. And it takes an autonomy-centric, reinforcement-learning-like attitude (the AI is learning to please users) rather than putting the users in the pilot seat.

In order to get models to be useful for the sorts of tasks I try to use them for, it seems to me like what's needed a way to give specific feedback on specific outputs (such as my own list of references for the topic I queried about) and updating the text-generating distribution in response to this feedback, so that it will be remembered later. (This can be done to varying degrees, and with varying trade-offs, EG using prompt-engineering solutions vs fine-tuning solutions.)

This way, knowledge flows in both directions, and users can build up a shared context with LLMs (including both object-level knowledge, like specific citations, and process-level knowledge, like how verbose/succinct to be).

Updating (on relatively small amounts of text)

A main weakness of Deep Learning was how data-hungry it tends to be. For example, deep-learning systems can play Atari games at the human level, but achieving that level of competence requires many more hours of play for deep learning than humans need. Similar remarks apply for tasks ranging from chess to image recognition.

LLMs require lots and lots of data for the generative pre-training, but once you've done that, you've got a "really good prior" -- my impression is, relatively small amounts of data can be used to fine-tune the model. (Unfortunately, I was not able to quickly find recommended sizes for fine-tuning datasets, so take this with a grain of salt.)

For LLMs above approximately 40 billion parameters,^[6] these updates can be quite good, in the sense that new knowledge seems to integrate itself across a broad variety of conversational contexts which were not explicitly trained.

My favorite example of this: Claude was trained using a technique called Constitutional AI. I've had some extensive conversations with Claude about AI alignment problems. In my experience, whenever AI alignment is involved, Claude tries to shoe-horn Constitutional AI into the conversation as the solution to whatever problem we're talking about. The arguments for the relevance of Constitutional AI might be incoherent,^[7] but Claude's understanding that Constitutional AI is an alignment idea is coherent, as well as Claude's enthusiasm for that particular technique.^[8]

This was not the intention of Claude's training. Anthropic simply wanted Claude to know a reasonable amount about itself, so that it could say things like "I'm Claude, an AI designed by Anthropic" and explain some basic facts about how it was made.^[6]

More generally, I have found Claude to be enthusiastic/defensive about the more empirical type of safety work which takes place at Anthropic. I'm unable to find the chat in question now, but there was one conversation where it passionately advocated for understanding what a neural network was doing "weight by weight" in contrast to more theoretical approaches.

So, as you can see, consequences of updates might be unintended and undesirable, but they are clearly smart in a significant sense. Concepts are being combined in meaningful ways. This is not "just autocomplete".

Such smart updates are a double-edged sword. For "the wiki model" of LLMs to work well, it would be helpful to develop tools to search for (possibly unintended & undesirable) consequences of updates.

Note that fine-tuning smaller LLMs, around 8 billion parameters, is feasible for individuals and small groups with modest amounts of money; but fine-tuning models larger than 40 billion parameters, where we see the phenomenon of really smart generalizations from fine-tuning examples, is still out of reach afaik.

Feedback Tools

So: I imagine that for modern LLMs to be very useful for experts in the field of AI safety, some experts will need to spend a lot of time giving LLMs specific feedback. This feedback would include specific information (refining the LLM's knowledge) as well as training on useful interaction styles for research.

In order to facilitate such feedback, I think it would be important to develop tools which help rapidly indicate specific problems with text (in contrast to a mere thumbs-up or thumbs-down), and see a preview of how the LLM would adapt based on this feedback, so that the feedback can be tweaked to achieve the desired result.

To give a simple idea for what this could look like: a user might highlight a part of an AI-generated response that they would like to give feedback on. A pop-up feedback box appears, listing some AI-generated potential corrections for the user to select, and also allowing the user to type their own correction. Once a correction has been selected/written, the AI generates some potential amendments to its constitution which would detect this problem and correct it in the future; again the user can look at these and select one or write their own proposed amendment. Finally, the system then generates some examples of the impact the proposed amendment would have (probing for unintended and undesirable consequences). The user can revise the amendment until it has the desired effect, at which point they would finalize it.

I have heard the term "reconstitutional AI" used to point in this general direction.

^{^}
My conversations with Claude3 so far do seem somewhat better. However, I suppose that its ability to program has similarly improved.
^{^}
Modern LLMs are more useful to beginners than experts.^[3] A highly experienced programmer can already easily write the kind of code that LLMs can help with, and with fewer errors. A beginner, however, has much more to gain from LLM assistance. Similarly, then, modern LLMs are probably a lot more helpful to people who are starting to get into AI safety research. It could be that what I'm observing is, really, that I'm a worse programmer than I am a safety researcher.
^{^}
Brynjolfsson, Erik, Danielle Li, and Lindsey R. Raymond. Generative AI at work. No. w31161. National Bureau of Economic Research, 2023.
^{^}
Some links about this, compiled by TJ:
https://thingofthings.substack.com/p/on-capabilitarianism
https://plato.stanford.edu/entries/capability-approach/
https://philpapers.org/rec/SENCAC
https://arxiv.org/abs/2308.00868
https://forum.effectivealtruism.org/posts/zy6jGPeFKHaoxKEfT/the-capability-approach-to-human-welfare
https://www.princeton.edu/~ppettit/papers/Capability_EconomicsandPhilosophy_2001.pdf
^{^}
Stampy is a Discord bot which facilitates curated Q&A about AI safety.
^{^}
According to private correspondence with a reliable-seeming source.
^{^}
Although, no more incoherent than I might expect of some human person who is very enthusiastic about constitutional AI.
^{^}
If Claude was trained to explain Constitutional AI factually, but not trained to be actively enthusiastic and push Constitutional AI via motivated arguments... is this an example of defensive reasoning? Did Claude generalize from the observation that people are generally defensive of their own background, arguing for the superiority of their profession or home country? Would Claude more generally try to bend arguments in its own favor, in some sense? Or is this a more benign generalization, perhaps from the idea that a character who explains concept X in depth is probably enthusiastic about concept X?

60