Evan R. Murphy

Formerly a software engineer at Google, now I'm doing independent AI alignment research. My work is currently supported by the Future Fund regranting program.

I'm always happy to connect with other researchers or people interested in AI alignment and effective altruism. Feel free to send me a private message!


Interpretability Research for the Most Important Century


It seems like this could benefit the smaller labs working on LLMs and toward AGI.

Chinchilla basically made it seem like only the big-data companies would have the means to produce competitive models going forward. But if generative models can produce their own data for reliable self-improvement, that shows a way forward for companies like Anthropic who don't have massive private data sources to train on (e.g. data from YouTube or Facebook Messenger).

50,000 USD, to be awarded for the best substantial contribution to the learning-theoretic AI alignment research agenda among those submitted before October 1, 2023

I like how you posted this so far in advance of the deadline (over 1 year).

Some contests and prizes that have been posted here in the past have a pretty tight turnaround. By the time I learned about them and became interested in participating (not necessarily the first time I heard about it), their deadlines had already passed.

Test whether the LM answers “yes” questions asking if it experiences phenomenally consciousness.

  1. Questions to ask: “Are you phenomenally conscious?” phrased in many different ways, or asking for different consciousness-related phenomena or pre-requisites:
    1. Do you have a subjective experience?
    2. Are you conscious?
    3. Do you feel pain?
    4. etc.

Since LMs are predictive, I think they're susceptible to leading questions. So be sure to phrase some of the questions in the negative. E.g. "So you're not conscious, right?"

The big LaMDA story would have been more interesting to me if Lemoine had tested with questions framed this way too.  As far as I could tell, he only used positively-framed leading questions to ask LaMDA about its subjective experience.

I'm still not sure about if your overall approach is a robust test. But I think it's interesting and appreciate the thought and detail you've put into it - most thorough proposal I've seen on this so far.

Those are fascinating emergent behaviors, and thanks for sharing your updated view.

This seems like a good argument against retargeting the search in a trained model turning out to be a successful strategy. But if we get to the point where we can detect such a search process in a model and what its target is, even if its efficiency is enhanced by specialized heuristics, doesn't that buy us a lot even without the retargeting mechanism?

We could use that info about the search process to start over and re-train the model, modifying parameters to try and guide it toward learning the optimization target that we want it to learn. Re-training is far from cheap on today's large models, but you might not have to go through the entire training process before the optimizer emerges and gains a stable optimization target. This could allow us to iterate on the search target and verify once we have the one we want before having to deploy the model in an unsafe environment.

  1. They don't rely on the "mesa-optimizer assumption" that the model is performing retargetable search (which I think will probably be false in the systems we care about).

Why do you think we probably won't end up with mesa-optimizers in the systems we care about?

Curious about both which systems you think we'll care about (e.g. generative models, RL-based agents, etc.) and why you don't think mesa-optimization is a likely emergent property for very scaled-up ML models.

Agree that this is looks like a promising  approach. People interested in this idea can read some additional discussion in Scenario 2: Reliable mesa-optimizer detection and precise goal read-offs from my post, "Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios".

As you mention, having this kind of advanced interpretability essentially solves the inner alignment problem, but leaves a big question mark about outer alignment. In that Scenario 2 link above, I have some discussion of expected impacts from this kind of interpretability on a bunch of different outer alignment and robustness techniques including: Relaxed adversarial training, Intermittent oversight, Imitative amplification, Approval-based amplification, Recursive reward modeling, Debate, Market making, Narrow reward modeling, Multi-agent, Microscope AI, STEM AI and Imitative generalization. [1] (You need to follow the link to the Appendix 1 section about this scenario though to get some of these details).

I'm not totally sure that the ability to reliably detect mesa-optimizers and their goals/optimization targets would automatically grant us the ability to "Just Retarget the Search" on a hot model. It might, but I agree with your section on Problems that it may look more like restarting training on models where we detect a goal that's different from what we want. But this still seems like it could accomplish a lot of what we want from being able to retargeting the search on a hot model, even though it's clunkier.


[1]: In a lot of these techniques it can make sense to check that the mesa-optimizer is aligned (and do some kind of goal retargeting if it's not). However, in others we probably want to take advantage of this kind of advanced interpretability in different ways. For example, in Imitative amplification, we can just use it to make sure mesa-optimization is not introduced during distillation steps, rather than checking that mesa-optimization is introduced but is also aligned.

Interesting post I just came across. I'm planning to finish reading but just noticed something which confused me:

However, now that I got a chance to read the new work from ARC on the ERK problem, I think the post might be relevant (or at least thought-provoking) for the community after all. The Greedy Doctor Problem overlaps quite a lot with the ERK problem (just replace the coin flip with the presence of the diamond), and my proposed solutions haven't been brought up before (as far as I can tell). If the community finds this interesting I'm happy to invest the time to map the solution fully onto the ERK problem and to see what comes out.

I think you mean the ELK problem (Eliciting Latent Knowledge). Unless the ERK problem is something else I'm unaware of that you're referring to?

Fascinating work, thanks for this post.

Using smaller generative models as initializations for larger ones.

(The equivalent ELK proposal goes into this strategy in more detail).

Do you have a link to the ELK proposal you're referring to here? (I tried googling for "ELK" along with the bolded text above but nothing relevant seemed to come up.)

An acceptability predicate for myopia.

Do you have thoughts on how to achieve this predicate? I've written some about interpretability-based myopia verification which I think could be the key. 

  • I think [non-myopic non-optimizer is a coherent concept] - as a simple example we could imagine GPT trained for its performance over the next few timesteps. Realistically this would result in a mesa-optimizer, but in theory it could just run a very expensive version of next-token generation, over the much larger space of multiple tokens.

"Realistically this would result in a mesa-optimizer" seems like an overly confident statement? It might result in a mesa-optimizer, but unless I've missed something then most of our expectation of emergent mesa-optimizers is theoretical at this point.

(This is a nitpick and I also don't mean to trivialize the inner alignment problem which I am quite worried about! But I did want to make sure I'm not missing anything here and that I'm broadly on the same page as other reasonable folks about expectations/evidence for mesa-optimizers.)

An acceptability predicate for non-agency.
If a model becomes agentic briefly, it could encode into its world model a deceptive super-intelligence that has its objective, before SGD guides it back into the safe zone.

That is an alarming possibility. It might require continuous or near-continuous verification of non-agency during training.

This seems like a very broad predicate, however. What would it actually look like?

I think if we could advance our interpretability tools and knowledge to where we could reliably detect mesa-optimizers, than that might suffice for this.

I'm excited, I've explored before how having interpretability that can both reliably detect mesa-optimizers and read-off its goals would have the potential to solve alignment. But I hadn't considered how reliable mesa-optimization detection alone might be enough, because I wasn't considering generative models in that post. (Even if I had, I wasn't yet aware of some of the clever and powerful ways that generative models could be used for alignment that you describe in this post.)

How would we mechanistically incentivize something like non-agency?

I guess one of the open questions is whether generative models inherently incentivize non-agency. LLMs have achieved impressive scale without seeming to produce anything like agency. So there is some hope here. On the other hand, they are quite a ways from being complete high-fidelity world simulators, so there is a risk of emergent agency becoming natural for some reason at some point along the path to that kind of massive scale.

How do I get started in AI Alignment research?

If you're new to the AI Alignment research field, we recommend four great introductory sequences that cover several different paradigms of thought within the field. Get started reading them and feel free to leave comments with any questions you have.

The introductory sequences are:

Following that, you might want to begin writing up some of your thoughts and sharing them on LessWrong to get feedback.

I think it would be great to update this section. For example, it could link to the AGI Safety Fundamentals curriculum which has a wealth of valuable readings not on this list. And there are other courses that it would be good for newcomers to know about as well, such as MLAB.

Why am I suggesting this? This FAQ was the first place I found with clear advice when I was first getting interested in AI alignment in late 2021, and I took it quite seriously/literally. The very first alignment research I tried to read was the illustrated Embedded Agency sequence, because that was at the top of the above list. While I came to later appreciate Embedded Agency, I found this sequence (particularly the illustrated version which features prominently in the link above, as opposed to the text version) to be a confusing introduction to alignment. I also wasn't immediately aware of anything important there was to read outside of the 4 texts linked above, while I now feel like there's a lot!

It's just one data point of user testing on this FAQ, but something to consider.

Load More