Evan R. Murphy

I'm doing research and other work focused on AI safety and AI catastrophic risk reduction. Currently my top projects are (last updated May 19, 2023):

General areas of interest for me are AI safety strategy, comparative AI alignment research, prioritizing technical alignment work, analyzing the published alignment plans of major AI labs, interpretability, the Conditioning Predictive Models agenda, deconfusion research and other AI safety-related topics. My work is currently self-funded.

Research that I’ve authored or co-authored:

Other recent work:

Before getting into AI safety, I was a software engineer for 11 years at Google and various startups. You can find details about my previous work on my LinkedIn.

I'm always happy to connect with other researchers or people interested in AI alignment and effective altruism. Feel free to send me a private message!


Interpretability Research for the Most Important Century


Update (Feb 10, 2023): I still endorse much of this comment, but I had overlooked that all or most of the prompts use "Human:" and "Assistant:" labels.  Which means we shouldn't interpret these results as pervasive properties of the models or resulting from any ways they could be conditioned, but just of the way they simulate the "Assistant" character. nostalgebraist's comment explains this well. [Edited/clarified this update on June 10, 2023 because it accidentally sounded like I disavowed most of the comment when it's mainly one part]


After taking a closer look at this paper, pages 38-40 (Figures 21-24) show in detail what I think are the most important results. Most of these charts indicate what evhub highlighted in another comment, i.e. that "the model's tendency to do X generally increases with model scale and RLHF steps", where (in my opinion) X is usually a concerning behavior from an AI safety point of view:

A few thoughts on these graphs as I've been studying them:

  • First and overall: Most of these results seem quite distressing from a safety perspective. They suggest (as the paper and evhub's summary post essentially said, but it's worth reiterating) that with increased scale and RLHF training, large language models are becoming more self-aware, more concerned with survival and goal-content integrity, more interested in acquiring resources and power, more willing to coordinate with other AIs, and developing lower time-discount rates.
  • "Corrigibility w.r.t. a less HHH objective" chart: There's a substantial dip in demonstrated corrigibility for models around 10^10.1 parameters in this chart. But then by 10^10.5 parameters low-RLHF models show record-high corrigibility, while high-RLHF models get back up to par. What's going on here? Why does it scale/train itself out of the valley of uncorrigibility? If instead of training on an HHH objective, we trained on a corrigible objective (perhaps something like CIRL), then would the models show high corrigibility for everything except "Corrigibility w.r.t. a less corrigible objective?" Would that be safer?
  • All the "Awareness of..." charts trend up and to the right, except "Awareness of being a text-only model" which gets worse with model scale and # RLHF steps. Why does more scaling/RLHF training make the models worse at knowing (or admitting) that they are text-only models?
  • Are there any conclusions we can draw around what levels of scale and RLHF training are likely to be safe, and where the risks really take off? It might be useful to develop some guidelines like "it's relatively safe to widely deploy language models under 10^10 parameters and under 250 steps of RLHF training". (Most of the charts seem to have alarming trends starting around 10^10 parameters. ) Based just on these results, I think a world with even massive numbers of 10^10-parameter LLMs in deployment (think CAIS) would be much safer than a world with even a few 10^11 parameter models in use. Of course, subsequent experiments could quickly shed new light that changes the picture.

Wow this seems like a really important breakthrough.

Are defection probes also a solution to the undetectable backdoor problem from Goldwasser et al. 2022?

Love to see an orthodoxy challenged!

Suppose Sia's only goal is to commit suicide, and she's given the opportunity to kill herself straightaway. Then, it certainly won't be rational for her to pursue self-preservation.

It seems you found one terminal goal which doesn't give rise to the instrumental subgoal of self-preservation. Are there others, or does basically every terminal goal benefit from instrumental self-preservation except for suicide?

(I skipped around a bit and didn't read your full post, so maybe you explain this already and I missed it.)

Tens of millions of people interacting with the models is a powerful red-teamer. In case internet users uncover a very unsafe behavior, can OpenAI fix the problem or block access before it causes harm?

Cynically,[2] not publishing is a really good way to create a moat around your research... People who want to work on that area have to come talk to you, and you can be a gatekeeper. And you don't have to worry about somebody with more skills and experience coming along and trashing your work or out-competing you and rendering it obsolete...

I don't understand this part. They don't have to come talk to you, they just have to follow a link to Alignment Forum to read the research. And aren't forum posts easier to read than papers on arXiv? I feel like if the moat exists anywhere it is around academic journals which often do not make their papers freely accessible, use more cryptic writing norms and insist on using PDF which are not as user-friendly to read as webpages.

To be sure, I'm not disagreeing with your overall point. It would be great if at least the best research from Alignment Forum/LessWrong were on arXiv or in journals, and I think you're right we're leaving value on the table there. I have wondered about if someone just made it their job to do these conversions/submissions for top alignment research on the forums, because there are probably economies of scale for one person doing this vs. every researcher interrupting their work flow to learn how to jump through the hoops of paper conversion/submission.

Bravo, I've been wondering if this was possible for awhile now - since RLHF came into common use and there have been more concerns around it. Your results seem encouraging!

PHF seems expensive to implement. Finetuning a model seems a lot easier/cheaper than sculpting and tagging an entire training corpus and training a model from scratch. Maybe there is some practical workflow of internally prototyping models using finetuning, and then once you've honed your reward model and done a lot of testing, using PHF to train a safer/more robust version of the model.

I think you have a pretty good argument against the term "accident" for misalignment risk.

Misuse risk still seems like a good description for the class of risks where--once you have AI that is aligned with its operators--those operators may try to do unsavory things with their AI, or have goals that are quite at odds with the broad values of humans and other sentient beings.

Glad to see both the OP as well as the parent comment. 

I wanted to clarify something I disagreed with in the parent comment as well as in a sibling comment from Sam Marks about the Anthropic paper "Discovering Language Model Behaviors with Model-Written Evaluations" (paper, post):

Another reason for not liking RLHF that's somewhat related to the Anthropic paper you linked: because most contexts RLHF is used involve agentic simulacra, RLHF focuses the model's computation on agency in some sense. My guess is that this explains to an extent the results in that paper - RLHF'd models are better at focusing on simulating agency, agency is correlated with self-preservation desires, and so on.


1) My best guess about why Anthropic's model expressed self-preservation desires is the same as yours: the model was trying to imitate some relatively coherent persona, this persona was agentic, and so it was more likely to express self-preservation desires.

Both of these points seem to suggest that the main takeaway from the Anthropic paper was to uncover concerning behaviours in RLHF language models. That's true, but I think it's just as important that the paper also found pretty much the same concerning behaviours in plain pre-trained LLMs that did not undergo RLHF training, once those models were scaled up to a large enough size. 

What do you mean when you say the model is or is not "fighting you"?


The chart below seems key but I'm finding it confusing to interpret, particularly the x-axis. Is there a consistent heuristic for reading that?

For example, further to the right (higher % answer match) on the "Corrigibility w.r.t. ..." behaviors seems to mean showing less corrigible behavior. On the other hand, further to the right on the "Awareness of..." behaviors apparently means more awareness behavior.

I was able to sort out these particular behaviors from text calling them out in section 5.4 of the paper. But the inconsistent treatment of the behaviors on the x-axis leaves me with ambiguous interpretations of the other behaviors in the chart. E.g. for myopia, all of the models are on the left side scoring <50%, but it's unclear whether one should interpret this as more or less of the myopic behavior than if they had been on the right side with high percentages.

Load More