Evan R. Murphy

Formerly a software engineer at Google, now I'm doing independent AI alignment research.

I'm always happy to connect with other researchers or people interested in AI alignment and effective altruism. Feel free to send me a private message!


Interpretability Research for the Most Important Century


Open Problems in AI X-Risk [PAIS #5]

Thank you for this sequence, which has a very interesting perspective and lots of useful info.

Just a quick note on the following section from your overview of "Honest AI" in this post:

What Researchers Are Doing Now

They are demonstrating that models can lie, and they are capturing true and false clusters inside models (this paper is forthcoming).

I was surprised not to see any mention of Eliciting Latent Knowledge (ELK) here. I guess part of it is about "demonstrating that models can lie", but there is also all the solution-seeking happening by ARC and those who submitted proposals for the ELK prize.

You're covering a lot of problem areas in this post so I don't expect it to be comprehensive about every single one. Just curious if there's any particular reason you chose to not to include ELK here.

AGI Ruin: A List of Lethalities

23.  Corrigibility is anti-natural to consequentialist reasoning; "you can't bring the coffee if you're dead" for almost every kind of coffee.  We (MIRI) tried and failed to find a coherent formula for an agent that would let itself be shut down (without that agent actively trying to get shut down).  Furthermore, many anti-corrigible lines of reasoning like this may only first appear at high levels of intelligence.


There is one approach to corrigibility that I don't see mentioned in the "tried and failed" post Eliezer linked to here. It's also one that someone at MIRI (Evan Hubinger) among others is still working on: myopia (i.e. myopic cognition).

There are different formulations, but the basic idea is that an AI with myopic cognition would have an extremely high time preference. This means that it would never sacrifice reward now for reward later, and so it would essentially be exempt from instrumental convergence. In theory such an AI would allow itself to be shut down (without forcing shutdown), and it would also not be prone to deceptive alignment.

Myopia isn't fully understood yet and has a number of open problems. It also will likely require verification using advanced interpretability tools that haven't been developed yet. I think it's a research direction we as a field should be investing in to figure out if it can work though, and the corrigibility question shouldn't be considered closed until we've at least done that. I can't see anything unnatural about an agent that has both consequentialist reasoning capabilities and a high time preference.

(Note: I'm not suggesting that we should bet the farm on myopic cognition solving alignment, and I'm not suggesting that my critique of Eliezer's point on corrigibility in this comment undermines the overall idea of his post that we're in a very scary situation with AI x-risk. I agree with that and support spreading the word about it as he's doing here, as well as working directly with leading AI labs to try and avoid catastrophe. I also support a number of other technical research directions including interpretability, and I'm open to whatever other strategic, technical and out-of-the-box proposals people have that they think could help.)

AGI Ruin: A List of Lethalities

I agree with many of the points in this post.

Here's one that I do believe is mistaken in a hopeful direction:

6.  We need to align the performance of some large task, a 'pivotal act' that prevents other people from building an unaligned AGI that destroys the world.  While the number of actors with AGI is few or one, they must execute some "pivotal act", strong enough to flip the gameboard, using an AGI powerful enough to do that.  It's not enough to be able to align a weak system - we need to align a system that can do some single very large thing. The example I usually give is "burn all GPUs". [..]

It could actually be enough to align a weak system. This is the case where the system is "weak" in the sense that it can't perform a pivotal act on its own, but it's powerful enough that it can significantly accelerate development toward a stronger aligned AI/AGI with pivotal act potential.

This case is important because it helps to break down and simplify the problem. Thinking about how to build an extremely powerful aligned AI which can do a pivotal act is more daunting than thinking about how to build a weaker-but-still-impressive aligned AI that is useful for building more powerful aligned AIs.

Modeling the impact of safety agendas

I'm working on an in-depth analysis of interpretability research, which is largely about its impacts as a safety research agenda. I think it would be a useful companion to your "Transparency" section in this post. I'm writing it up in this sequence of posts: Interpretability Research for the Most Important Century. (I'm glad I found your post and its "Transparency" section too, because now I can refer to it as I continue writing the sequence.)

The sequence isn't finished yet, but a couple of the posts are done already. In particular the second post Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios contains a substantial part of the analysis. The "Closing thoughts" section of that post gets at most of the cruxes for interpretability research as I see them so far, excerpted here:

In this post, we investigated whether interpretability has property of #1 of High-leverage Alignment Research[1]. We discussed the four most important parts AI alignment, and which seem to be the hardest. Then we explored interpretability's relevance to these areas by analyzing seven specific scenarios focused on major interpretability breakthroughs that could have great impacts on the four alignment components. We also looked at interpretability's potential relevance to deconfusion research and yet-unknown scenarios for solving alignment.

It seems clear that there are many ways interpretability will be valuable or even essential for AI alignment.[26] It is likely to be the best resource available for addressing inner alignment issues across a wide range of alignment techniques and proposals, some of which look quite promising from an outer alignment and performance competitiveness perspective.

However, it doesn't look like it will be easy to realize the potential of interpretability research. The most promising scenarios analyzed above tend to rely on near-perfection of interpretability techniques that we have barely begun to develop. Interpretability also faces serious potential obstacles from things like distributed representations (e.g. polysemanticity), the likely-alien ontologies of advanced AIs, and the possibility that those AIs will attempt to obfuscate their own cognition. Moreover, interpretability doesn't offer many great solutions for suboptimality alignment and training competitiveness, at least not that I could find yet.

Still, interpretability research may be one of the activities that most strongly exhibits property #1 of High-leverage Alignment Research[1]. This will become more clear if we can resolve some of the Further investigation questions above, such as developing more concrete paths to achieving the scenarios in this post and estimating probabilities that we could achieve them. It would also help if, rather than considering interpretability just on its own terms, we could do a side-by-side-comparison of interpretability with other research directions, as the Alignment Research Activities Question[5] suggests.

(Pasting in the most important/relevant footnotes referenced above:)

[1]: High-leverage Alignment Research is my term for what Karnofsky (2022)[6] defines as:

“Activity that is [1] likely to be relevant for the hardest and most important parts of the problem, while also being [2] the sort of thing that researchers can get up to speed on and contribute to relatively straightforwardly (without having to take on an unusual worldview, match other researchers’ unarticulated intuitions to too great a degree, etc.)”

See The Alignment Research Activities Question section in the first post of this sequence for further details.


[5]: The Alignment Research Activities Question is my term for a question posed by Karnofsky (2022)[6]. The short version is: “What relatively well-scoped research activities are particularly likely to be useful for longtermism-oriented AI alignment?”

For all relevant details on that question, see the The Alignment Research Activities Question section in the first post of this sequence.

If any of this is confusing, please let me know - it should also help to reference details in the post itself to clarify.  Additionally there are some useful sections in that post for thinking about the high-level impact of interpretability not fully expressed in the "Closing thoughts" above, for example the positive list of Reasons to think interpretability will go well with enough funding and talent and the negative list of Reasons to think interpretability won’t go far enough even with lots of funding and talent.

An observation about Hubinger et al.'s framework for learned optimization

Nice post.

Therefore, either we can try to revise the framework slightly, essentially omitting the notions of robust alignment and 'internalization of the base objective' and focussing more on revised versions of 'proxy alignment' and 'approximate alignment' as descriptors of what is essentially the best possible situation in terms of alignment.

Have you seen Hubinger's more recent post, More variations on pseudo-alignment ? It amends the list of pseudo-alignment types originally listed in "Risks of Learned Optimization" to include a couple more.

Your claim above that the best we could hope for may be a form of proxy alignment or approximate alignment reminds me of following pseudo-alignment type he introduced in that more recent post. In the description of this type, he also seems to agree with you that robust alignment is very difficult or "unstable" (though perhaps you go further in saying its impossible):

Corrigible pseudo-alignment. In the paper, we defined corrigible alignment as the situation in which "the base objective is incorporated into the mesa-optimizer's epistemic model and [the mesa-optimizer's] objective is modified to 'point to' that information." We mostly just talked about this as a form of robust alignment—however, as I note in "Towards a mechanistic understanding of corrigibility," this is a very unstable operation, requiring you to get your pointer just right. Thus, I think it's better to talk about corrigible alignment as the class of possible relationships between the base and mesa-objectives defined by the model having some sort of pointer to the base objective, including both corrigible robust alignment (if the pointer is robust) and corrigible pseudo-alignment (if the pointer is to some sort of non-robust proxy). In particular, I think this distinction is fairly important to why deceptive alignment might be more likely than robust alignment, as it points at why robust alignment via corrigibility might be quite difficult (which is a point we made in the paper, but one which I think is made much clearer with this distinction).

Intuitions about solving hard problems

Overall I think this is a good post and very interesting, thanks.

I find this somewhat compelling, but less so than I used to, since I’ve realized that the line between imitation learning and reinforcement learning is blurrier than I used to think (e.g. see this or this).

So I checked out those links. Briefly looking at them, I can see what you mean about the line between RL and imitation learning being blurry. The first paper seems to show a version of RL which is basically imitation learning.

I'm confused because when you said this makes iterated amplification less compelling to you, I took that to mean it made you less optimistic about iterated amplification as a solution for alignment. But why would whether something is technically classified as imitation learning or a special kind of RL make a difference for its effectiveness?

Or did you mean not that you find it any less promising as an alignment proposal, but just that you now find the core insight less compelling/interesting because it's not as major an innovation over the idea of RL as you had thought it was?

Relaxed adversarial training for inner alignment

That's a good question. Perhaps it does make use of optimization but the model still has an overall passive relationship to the world compared to an active mesa-optimizer AI. I'm thinking about the difference between say, GPT-3 and the classic paperclip maximizer or other tiling AI.

This is just my medium-confidence understanding and may be different from what Evan Hubinger meant in that quote.

Imitative Generalisation (AKA 'Learning the Prior')

Is imitative generalization usually envisioned as a recursive many-iterations-process like IDA? Or is it just a single iteration of train the initial model -> inspect and correct the priors -> train the new model?

Great post, by the way.

Load More