Recommended Sequences

AGI safety from first principles
Embedded Agency
2022 MIRI Alignment Discussion

Popular Comments

Recent Discussion

Thanks to Rohin Shah, Ajeya Cotra, Richard Ngo, Paul Christiano, Jon Uesato, Kate Woolverton, Beth Barnes, and William Saunders for helpful comments and feedback.

Evaluating proposals for building safe advanced AI—and actually building any degree of confidence in their safety or lack thereof—is extremely difficult. Previously, in “An overview of 11 proposals for building safe advanced AI,” I tried evaluating such proposals on the axes of outer alignment, inner alignment, training competitiveness, and performance competitiveness. While I think that those criteria were good for posing open questions, they didn’t lend themselves well to actually helping us understand what assumptions needed to hold for any particular proposal to work. Furthermore, if you’ve read that paper/post, you’ll notice that those evaluation criteria don’t even work for some of the proposals...

10Daniel Kokotajlo3d
I found myself coming back to this now, years later, and feeling like it is massively underrated. Idk, it seems like the concept of training stories is great and much better than e.g. "we have to solve inner alignment and also outer alignment" or "we just have to make sure it isn't scheming."  Anyone -- and in particular Evhub -- have updated views on this post with the benefit of hindsight? Should we e.g. try to get model cards to include training stories?

Anyone -- and in particular Evhub -- have updated views on this post with the benefit of hindsight?

I intuitively don't like this approach, but I have trouble articulating exactly why. I've tried to explain a bit in this comment, but I don't think I'm quite saying the right thing.

One issue I have is that it doesn't seem to nicely handle interactions between the properties of the AI and how it's used. You can have an AI which is safe when used in some ways, but not always. This could be due to approaches like control (which mostly route around mechanistic... (read more)

4Evan Hubinger1d
I still really like my framework here! I think this post ended up popularizing some of the ontology I developed here, but the unfortunate thing about that post as the one that popularized this is that it doesn't really provide an alternative.
2Daniel Kokotajlo5d
Curious to hear whether I was one of the people who contributed to this.
2Alex Turner15h
Nope! I have basically always enjoyed talking with you, even when we disagree.

Ok, whew, glad to hear.

This is a linkpost for https://arxiv.org/abs/2403.00745

Authors: János Kramár, Tom Lieberum, Rohin Shah, Neel Nanda

A new paper from the Google DeepMind mechanistic interpretability team, from core contributors János Kramár and Tom Lieberum

Tweet thread summary, paper

Abstract:

Activation Patching is a method of directly computing causal attributions of behavior to model components. However, applying it exhaustively requires a sweep with cost scaling linearly in the number of model components, which can be prohibitively expensive for SoTA Large Language Models (LLMs). We investigate Attribution Patching (AtP), a fast gradient-based approximation to Activation Patching and find two classes of failure modes of AtP which lead to significant false negatives. We propose a variant of AtP called AtP*, with two changes to address these failure modes while retaining scalability. We present the first systematic study of AtP and alternative methods for faster activation patching and show that AtP significantly outperforms all other investigated methods, with AtP* providing further significant improvement. Finally, we provide a method to bound the probability of remaining false negatives of AtP* estimates.

Warning for anyone who has ever interacted with "robosucka" or been solicited for a new podcast series in the past few years: https://www.tumblr.com/rationalists-out-of-context/744970106867744768/heads-up-to-anyone-whos-spoken-to-this-person-i

9Zac Hatfield-Dodds2d
I agree that there's no substitute for thinking about this for yourself, but I think that morally or socially counting "spending thousands of dollars on yourself, an AI researcher" as a donation would be an apalling norm. There are already far too many unmanaged conflicts of interest and trust-me-it's-good funding arrangements in this space for me, and I think it leads to poor epistemic norms as well as social and organizational dysfunction. I think it's very easy for donating to people or organizations in your social circle to have substantial negative expected value. I'm glad that funding for AI safety projects exists, but the >10% of my income I donate will continue going to GiveWell.
7Oliver Habryka2d
I think people who give up large amounts of salary to work in jobs that other people are willing to pay for from an impact perspective should totally consider themselves to have done good comparable to donating the difference between their market salary and their actual salary. This applies to approximately all safety researchers. 
4Ben Pace2d
I don’t think it applies to safety researchers at AI Labs though, I am shocked how much those folks can make.

They still make a lot less than they would if they optimized for profit (that said, I think most "safety researchers" at big labs are only safety researchers in name and I don't think anyone would philanthropically pay for their labor, and even if they did, they would still make the world worse according to my model, though others of course disagree with this).

This is a linkpost for https://gleave.me/post/why-do-phd/

Doing a PhD is a strong option to get great at developing and evaluating research ideas. These skills are necessary to become an AI safety research lead, one of the key talent bottlenecks in AI safety, and are helpful in a variety of other roles. By contrast, my impression is that currently many individuals with the goal of being a research lead pursue options like independent research or engineering-focused positions instead of doing a PhD. This post details the reasons I believe these alternatives are usually much worse at training people to be research leads.

I think many early-career researchers in AI safety are undervaluing PhDs. Anecdotally, I think it’s noteworthy that people in the AI safety community were often surprised to find out I was doing a PhD, and...

9AdamGleave2d
I'm sympathetic to a lot of this critique. I agree that prospective students should strive to find an advisor that is "good at producing clear, honest and high-quality research while acting in high-integrity ways around their colleagues". There are enough of these you should be able to find one, and it doesn't seem worth compromising. Concretely, I'd definitely recommend digging into into an advisor's research and asking their students hard questions prior to taking any particular PhD offer. Their absolutely are labs that prioritize publishing above all else, turn a blind eye to academic fraud or at least brush accidental non-replicability under the rug, or just have a toxic culture. You want to avoid those at all costs. But I disagree with the punchline that if this bar isn't satisfied then "almost any other job will be better preparation for a research career". In particular, I think there's a ton of concrete skills a PhD teaches that don't need a stellar advisor. For example, there's some remarkably simple things like having an experimental baseline, running multiple seeds and reporting confidence intervals that a PhD will absolutely drill into you. These things are remarkably often missing from research produced by those I see in the AI safety ecosystem who have not done a PhD or been closely mentored by an experienced researcher. Additionally, I've seen plenty of people do PhDs under an advisor who lacks one or more of these properties and most of them turned out to be fine researchers. Hard to say what the counterfactual is, the admission process to the PhD might be doing a lot of work here, but I think it's important to recognize the advisor is only one of many sources of mentorship and support you get in a PhD: you also have taught classes, your lab mates, your extended cohort, senior post-docs, peer review, etc. To be clear, none of these mentorship sources are perfect, but part of your job as a student is to decide who to listen to & when. If someone ca
6OliverHayman4d
How often do people not do PhDs on the basis that they don't teach you to be a good researcher? Perhaps this is different in certain circles, but almost everyone I know doesn't want to do a PhD for personal reasons (and also timelines).  The most common objections are the following: * PhDs are very depressing and not very well paid.  * Advisors do not have strong incentives to put much effort into training you and apparently often won't. This is pretty demotivating.  * A thing you seem to be advocating for is PhDs primarily at top programs. These are very competitive, it is hard to make progress towards getting into a better program once you graduate, and there is a large opportunity cost to devoting my entire undergraduate degree to doing enough research to be admitted. * PhDs take up many years of your life. Life is short. * It is very common for PhD students (not just in alignment) to tell other people not to do a PhD. This is very concerning. If I was an impact-maximizer I might do a PhD, but as a person who is fairly committed to not being depressed, it seems obvious that I should probably not do a PhD and look for alternative routes to becoming a research lead instead. I'd be interested to hear whether you disagree with these points (you seem to like your PhD!), or whether this post was just meant to address the claim that it doesn't train you to be a good researcher. 

Whether a PhD is something someone will enjoy is so dependent on individual personality, advisor fit, etc that I don't feel I can offer good generalized advice. Generally I'd suggest people trying to gauge fit try doing some research in an academic environment (e.g. undergrad/MS thesis, or a brief RA stint after graduating) and talk to PhD students in their target schools. If after that you think you wouldn't enjoy a PhD then you're probably right!

Personally I enjoyed my PhD. I had smart & interesting colleagues, an advisor who wanted me to do high-qua... (read more)

4Richard Ngo4d
Note that these are very different claims, both because the half-life for a given value is below its mean, and because TAI doesn't imply doom. Even if you do have very high P(doom), it seems odd to just assume everyone else does too. So? Your research doesn't have to be useful in every possible world. If a PhD increases the quality of your research by, say, 3x (which is plausible, since research is heavy-tailed) then it may well be better to do that research for half the time. (In general I don't think x-risk-motivated people should do PhDs that don't directly contribute to alignment, to be clear; I just think this isn't a good argument for that conclusion.)
Load More