Evan R. Murphy

Wiki Contributions

Comments

Interpretability

Several of the Circuits articles provide colab notebooks reproducing the results in the article, which may be helpful references if one wants to do Circuits research on vision models.

 

I'm starting to reproduce some results from the Circuits thread. It took me longer than expected just to find these colab notebooks so I wanted to share more specifically in case it saves anyone else some time.

The text "colab" isn't really turned up in a targeted Google search on Distill and the Circuits thread.  Also if you open a post like Visualizing Weights and do a Ctrl+F or Cmd+F search for "colab", you won't turn up any results either.

But if you open that same post and scroll down, you'll see button links to open the colab notebooks that look like this. These will also turn up in in a Ctrl+F or Cmd+F search for "notebook" in case you want to jump around to different colab examples in the Circuits thread.

Beware using words off the probability distribution that generated them.

Nice post, so many hidden assumptions behind the words we use.

I wonder what are some concrete examples of this in alignment discussions, examples like your one about the probability that god exists.

One that comes to mind is a recent comment thread on one of the Late 2021 Miri Conversations posts where we were assigning probabilities to "soft takeoff" and "hard takeoff" scenarios. Then Daniel Kokotajlo realized that "soft takeoff" had to be disambiguated because in that context some people were using it to mean any kind of gradual advancement in AI capabilities, whereas others meant it to mean specifically "GDP doubling in 4 years, then doubling in 1 year". 

Solving Interpretability Week

I'm interested in trying a co-work call sometime but won't have time for it this week.

Thanks for sharing about Shay in this post. I had not heard of her before, what a valuable resource/way she's helping the cause of AI safety.

(As for contact, I check my LessWrong/Alignment Forum inbox for messages regularly.)

More Christiano, Cotra, and Yudkowsky on AI progress

So here y'all have given your sense of the likelihoods as follows:

  • Paul: 70% soft takeoff, 30% hard takeoff
  • Daniel: 30% soft takeoff, 70% hard takeoff

How would Eliezer's position be stated in these terms? Similar to Daniel's?

[AN #61] AI policy and governance, from two people in the field

This work on learning with constraints seems interesting.

Looks like the paper "Bridging Hamilton-Jacobi Safety Analysis and Reinforcement Learning" has moved so that link is currently broken. Here's a working URL: https://ieeexplore.ieee.org/document/8794107 Also one more where the full paper is more easily accessible: http://files.davidqiu.com/research/papers/2019_fisac_Bridging%20Hamilton-Jacobi%20Safety%20Analysis%20and%20Reinforcement%20Learning%20[RL][Constraints].pdf

Interpreting Yudkowsky on Deep vs Shallow Knowledge

Great investigation/clarification of this recurring idea from the ongoing Late 2021 MIRI Conversations.

  • outside vs. inside view - I've thought about this before but hadn't read this clear a description of the differences and tradeoffs before (still catching up on Eliezer's old writings)
  • "deep knowledge is far better at saying what won’t work than at precisely predicting the correct hypothesis." - very useful takeaway

You might not like his tone in the recent discussions, but if someone has been saying the same thing for 13 years, nobody seems to get it, and their model predicts that this will lead to the end of the world, maybe they can get some slack for talking smack.

Good point and we should. Eliezer is a valuable source of ideas and experience around alignment, and it seems like he's contributed immensely to this whole enterprise.

I just hope all his smack talking doesn't turn off/away talented people coming to lend a hand on alignment. I expect a lot of people on this (AF) forum found it like me after reading all Open Phil and 80,000 Hours' convincing writing about the urgency of solving the AI alignment problem. It seems silly to have those orgs working hard to recruit people to help out, only to have them come over here and find one of the leading thinkers in the community going on frequent tirades about how much EAs suck, even though he doesn't know most of us. Not to mention folks like Paul and Richard who have been taking his heat directly in these marathon discussions!

Ngo and Yudkowsky on alignment difficulty

Richard, summarized by Richard: "Consider an AI that, given a hypothetical scenario, tells us what the best plan to achieve a certain goal in that scenario is. Of course it needs to do consequentialist reasoning to figure out how to achieve the goal. But that’s different from an AI which chooses what to say as a means of achieving its goals. [...]"

Eliezer, summarized by Richard: "The former AI might be slightly safer than the latter if you could build it, but I think people are likely to dramatically overestimate how big the effect is. The difference could just be one line of code: if we give the former AI our current scenario as its input, then it becomes the latter. 

 

How does giving the former "planner" AI the current scenario as input turn it into the latter "acting" AI? It still only outputs a plan, which then the operators can review and decide whether or not to carry out.

Also, the planner AI that Richard put forth had two inputs, not one. The inputs were: 1) a scenario, and 2) a goal. So for Eliezer (or anyone who confidently understood this part of the discussion), which goal input are you providing to the planner AI in this situation? Are you saying that the planner AI becomes dangerous when it's provided with the current scenario and any goal as inputs? 

Research Agenda v0.9: Synthesising a human's preferences into a utility function

So the subtle manipulation is to compensate for those rebellious impulses making  unstable?

Why not just let the human have those moments and alter their  if that's what they think they want? Over time, then they may learn that being capricious with their AI doesn't ultimately serve them very well. But if they find out the AI is trying to manipulate them, that could make them want to rebel even more and have less trust for the AI.

Research Agenda v0.9: Synthesising a human's preferences into a utility function

This is an impressive piece of work and I'm excited about your agenda.

And maybe, in that situation, if we are confident that  is pretty safe, we'd want the AI to subtly manipulate the human's preferences towards it.

Can you elaborate on this? Why would we want to manipulate the human's preferences?

Some Comments on Stuart Armstrong's "Research Agenda v0.9"

I found this to be interesting/valuable commentary after just reading through Stuart's agenda.

I think speculating about the human-sized logical relationships between speculative parts inside the AI is easier but less useful than speculating about the algorithm that will connect your inputs to your outputs with a big model and lots of computing power, which may or may not have your logical steps as emergent features.

With this more compute/less abstraction approach you're suggesting, do you mean that it may produce a model of the preferences that's inscrutable to humans? If so, that could be an issue for getting the human's buy-in. He talks about this some in section 4.5, that there's "the human tendency to reject values imposed upon them, just because they are imposed upon them" and the AI may need to involve the human in construction of the utility function.

- How much of the hidden details are in doing meta-reasoning? If I don't trust an AI, more steps of meta-reasoning makes me trust it even less - humans often say things about meta-reasoning that would be disastrous if implemented. What kind of amazing faculties would be required for an AI to extract partial preferences about meta-reasoning that actually made things better rather than worse? If I was better at understanding what the details actually are, maybe I'd pick on meta-reasoning more.

Which part of his post are you referring to by "meta-reasoning"? Is it the "meta-preferences"?

Load More