Neel Nanda

Wiki Contributions

Comments

A Longlist of Theories of Impact for Interpretability

Honestly, I don't understand ELK well enough (yet!) to meaningfully comment. That one came from Tao Lin, who's a better person to ask.

More Is Different for AI

I've just binge-read the entire sequence and thoroughly enjoyed it, thanks a lot for writing! I really like the framework of three anchors - thought experiments, humans, and empirical ML - and emphasising the strength and limitations of each and the need for all 3. Most discourse I've seen tends to strongly favour just one anchor, but 'all are worth paying attention to, none should be totalising' seems obviously true

My Overview of the AI Alignment Landscape: A Bird's Eye View

Thanks for the feedback! That makes sense, I've updated the intro paragraph to that section to:

There are a range of agendas proposed for how we might build safe AGI, though note that each agenda is far from a complete and concrete plan. I think of them more as a series of confusions to explore and assumptions to test, with the eventual goal of making a concrete plan. I focus on three agendas here, these are just the three I know the most about, have seen the most work on and, in my subjective judgement, the ones it is most worth newcomers to the field learning about. This is not intended to be comprehensive, see eg Evan Hubinger’s Overview of 11 proposals for building safe advanced AI for more.

Does that seem better?

For what it's worth, my main bar was a combination of 'do I understand this agenda well enough to write a summary' and 'do I associate at least one researcher and some concrete work with this agenda'. I wouldn't think of corrigibility as passing the second bar, since I've only seen it come up as a term to reason about or aim for, rather than as a fully-fledged plan for how to produce corrigible systems. It's very possible I've missed out on some important work though, and I'd love to hear pushback on this

2021 AI Alignment Literature Review and Charity Comparison

Thanks! I'm probably not going to have time to write a top-level post myself, but I liked Evan Hubinger's post about it.

2021 AI Alignment Literature Review and Charity Comparison
I do wonder if vision problems are unusually tractable here; would it be so easy to visualise what individual neurons mean in a language model?

We actually released our first paper trying to extend Circuits from vision to language models yesterday! You can't quite interpret individual neurons, but we've found some examples of where we can interpret what an individual attention head is doing.

Empirical Observations of Objective Robustness Failures

This seems like really great work, nice job! I'd be excited to see more empirical work around inner alignment.

One of the things I really like about this work is the cute videos that clearly demonstrate 'this agent is doing dumb stuff because its objective is non-robust'. Have you considered putting shorter clips of some of the best bits to Youtube, or making GIFs? (Eg, a 5-10 second clip of the CoinRun agent during train, followed by a 5-10 second clip of the CoinRun agent during test). It seemed that one of the major strengths of the CoastRunners clip was how easily shareable and funny it was, and I could imagine this research getting more exposure if it's easier to share highlights. I found the Google Drive pretty hard to navigate

[AN #149]: The newsletter's editorial policy
One or two people suggested adding links to interesting papers that I wouldn't have time to summarize. I actually used to do this when the newsletter first started, but it seemed like no one was clicking on those links so I stopped doing that. I'm pretty sure that would still be the case now so I'm not planning to restart that practice.

A possible experiment: Frame this as a 'request for summaries', link to the papers you won't get round to, but offer to publish any sufficiently good summaries of those papers that someone sends you in a future newsletter.

Also, damn! I really like the long summaries, and would be sad to see them go (though obviously you should listen to a survey of 66 people over my opinion)

AMA: Paul Christiano, alignment researcher
It's not exactly clear what you do with such a story or what the upside is, it's kind of a vague theory of change and most people have some specific theory of change they are more excited about (even if this kind of story is a bit of a public good that's useful on a broader variety of perspectives / to people who are skeptical).

Ah, interesting! I'm surprised to hear that. I was under the impression that while many researchers had a specific theory of change, it was often motivated by an underlying threat model, and that different threat models lead to different research interests.

Eg, someone worries about a future where AI control the world but are not human comprehensible, feels very different from someone worried about a world where we produce an expected utility maximiser that has a subtly incorrect objective, resulting in bad convergent instrumental goals.

Do you think this is a bad model of how researchers think? Or are you, eg, arguing that having a detailed, concrete story isn't important here, just the vague intuition for how AI goes wrong?

Load More