# All of Evan R. Murphy's Comments + Replies

Interpretability

Several of the Circuits articles provide colab notebooks reproducing the results in the article, which may be helpful references if one wants to do Circuits research on vision models.

I'm starting to reproduce some results from the Circuits thread. It took me longer than expected just to find these colab notebooks so I wanted to share more specifically in case it saves anyone else some time.

The text "colab" isn't really turned up in a targeted Google search on Distill and the Circuits thread.  Also if you open a post like Visualizing Weights and ... (read more)

Beware using words off the probability distribution that generated them.

Nice post, so many hidden assumptions behind the words we use.

I wonder what are some concrete examples of this in alignment discussions, examples like your one about the probability that god exists.

One that comes to mind is a recent comment thread on one of the Late 2021 Miri Conversations posts where we were assigning probabilities to "soft takeoff" and "hard takeoff" scenarios. Then Daniel Kokotajlo realized that "soft takeoff" had to be disambiguated because in that context some people were using it to mean any kind of gradual advancement in AI capabili... (read more)

Solving Interpretability Week

I'm interested in trying a co-work call sometime but won't have time for it this week.

Thanks for sharing about Shay in this post. I had not heard of her before, what a valuable resource/way she's helping the cause of AI safety.

(As for contact, I check my LessWrong/Alignment Forum inbox for messages regularly.)

More Christiano, Cotra, and Yudkowsky on AI progress

So here y'all have given your sense of the likelihoods as follows:

• Paul: 70% soft takeoff, 30% hard takeoff
• Daniel: 30% soft takeoff, 70% hard takeoff

How would Eliezer's position be stated in these terms? Similar to Daniel's?

[AN #61] AI policy and governance, from two people in the field

This work on learning with constraints seems interesting.

Looks like the paper "Bridging Hamilton-Jacobi Safety Analysis and Reinforcement Learning" has moved so that link is currently broken. Here's a working URL: https://ieeexplore.ieee.org/document/8794107 Also one more where the full paper is more easily accessible: http://files.davidqiu.com/research/papers/2019_fisac_Bridging%20Hamilton-Jacobi%20Safety%20Analysis%20and%20Reinforcement%20Learning%20[RL][Constraints].pdf

Interpreting Yudkowsky on Deep vs Shallow Knowledge

Great investigation/clarification of this recurring idea from the ongoing Late 2021 MIRI Conversations.

• "deep knowledge is far better at saying what won’t work than at precisely predicting the correct hypothesis." - very useful takeaway

You might not like his tone in the recent discussions, but if someone has been saying the same thing for 13 years, nobody seems to get it, and t

Ngo and Yudkowsky on alignment difficulty

Richard, summarized by Richard: "Consider an AI that, given a hypothetical scenario, tells us what the best plan to achieve a certain goal in that scenario is. Of course it needs to do consequentialist reasoning to figure out how to achieve the goal. But that’s different from an AI which chooses what to say as a means of achieving its goals. [...]"

Eliezer, summarized by Richard: "The former AI might be slightly safer than the latter if you could build it, but I think people are likely to dramatically overestimate how big the effect is. The difference could

Research Agenda v0.9: Synthesising a human's preferences into a utility function

So the subtle manipulation is to compensate for those rebellious impulses making  unstable?

Why not just let the human have those moments and alter their  if that's what they think they want? Over time, then they may learn that being capricious with their AI doesn't ultimately serve them very well. But if they find out the AI is trying to manipulate them, that could make them want to rebel even more and have less trust for the AI.

Research Agenda v0.9: Synthesising a human's preferences into a utility function

This is an impressive piece of work and I'm excited about your agenda.

And maybe, in that situation, if we are confident that  is pretty safe, we'd want the AI to subtly manipulate the human's preferences towards it.

Can you elaborate on this? Why would we want to manipulate the human's preferences?

2Stuart Armstrong2moBecause our preferences are inconsistent, and if an AI says "your true preferences are UH", we're likely to react by saying "no! No machine will tell me what my preferences are. My true preferences are U′H, which are different in subtle ways".
Some Comments on Stuart Armstrong's "Research Agenda v0.9"

I found this to be interesting/valuable commentary after just reading through Stuart's agenda.

I think speculating about the human-sized logical relationships between speculative parts inside the AI is easier but less useful than speculating about the algorithm that will connect your inputs to your outputs with a big model and lots of computing power, which may or may not have your logical steps as emergent features.

With this more compute/less abstraction approach you're suggesting, do you mean that it may produce a model of the preferences that's inscrutab... (read more)

Biased reward-learning in CIRL

My informal critique of CIRL is that it assume two untrue facts: that H knows θ (ie knows their own values) and that H is perfectly rational (or noisly rational in a specific way).

This seems like a valid critique. But do you see it as a deal breaker? In my mind, these are pretty minor failings of CIRL. Because if a person is being irrational and/or can't figure out what they want, then how can we expect the AI to? (Or is there some alternative scheme which handles these cases better than CIRL?)

(Update: Stuart replied to this comment on LessWrong: https:... (read more)

Preface to the sequence on iterated amplification

Is Iterated Amplification still a current alignment paradigm that's being pursued?

I found this sequence through the FAQ under How do I get started in AI Alignment research? . I've really enjoyed reading the first few articles, but then I noticed a lot of the articles are from 2018. I found this Mar 2021 article also by Paul Christiano which makes it sound like he found some issues with Iterated Amplification and moved onto a different paradigm called Imitative Generalization.

I think iterated amplification (IDA) is a plausible algorithm to use for training superhuman ML systems. This algorithm is still not really fleshed out, there are various instantiations that are unsatisfactory in one way or another, which is why this post describes it as a research direction rather than an algorithm.

I think there are capability limits on models trained with IDA, which I tried to describe in more detail in the post Inaccessible Information. There are also limits to the size of implicit tree that you can really use, basically mirroring... (read more)

1Oliver Habryka3moThis is a very good point. IIRC Paul is working on some new blog posts that summarize his more up-to-date approach, though I don't know when they'll be done. I will ask Paul when I next run into him about what he thinks might be the best way to update the sequence.