AGI safety career advice

Some recommended ways to upskill at empirical research (roughly in order):

For people specifically interested in getting into mechanistic interpretability, my guide to getting started may be useful - it's much more focused on the key, relevant parts of deep learning, with a bunch more interpretability specific stuff

[-]DanielFilan3y814

Alignment is an unusual field because the base of fans and supporters is much larger than the number of researchers

Isn't this entirely usual? Like, I'd assume that there are more readers of popular physics books than working physicists. Similarly for nature documentary viewers vs biologists.

[-]LawrenceC2y52

I think the deciding difference is that the amount of fans and supporters who want to be actively involved and who think the problem is the most important in the world is much larger than the number of researchers; while popular physics book readers and nature documentary viewers are plentiful, I doubt most of them feel a compelling need to become involved!

[-]Neel Nanda3y31

Maybe in contrast to other fields of ML? (Though that's definitely stopped being true for eg LLMs)

[-]Joe Collman3y*40

Scalable oversight: finding ways to leverage more powerful models to produce better reward signals

It might be worth clarifying how you expect this to help, and to make clear where you'd expect other researchers to disagree.

For instance, for debate, one could believe:
1) Debate will work for long enough for us to use it to ~~help find~~ make progress towards an alignment solution.
2) Debate is a plausible basis for an alignment solution.

To me (2) seems fairly clearly false - at the very least it's not doing anything about inner alignment (debate on weights/activations does nothing to address this, since there's still no [debaters are aiming to win the game] starting point).

Viewing it as a question-answering system is similarly confused: it's an [output whatever text is selected by the debate process] system.
We can't have both [debaters optimise for a debate win] and [debate robustly remains a question-answering system] - at least without making obviously false assumptions about a human-based judge system.

Could Debate be a component of an alignment solution? Sure.
Is it the part that seems hard/neglected? No.

On (1) I'm less clear, however here the case that needs to be made is that debate approaches will be more useful before they become dangerous than e.g. simulators or conditioning predictive models (which I agree will also break at some point).

This is not obviously false, but I don't see a good argument for it. If I have to bet which of these approaches has the lowest [capability before deceptive alignment] (cbda) threshold, my money is currently on debate (and indeed RRM). Imitative amplification seems plausibly safer, but only to the degree that it's less efficient - so still unclear it gets higher cbda (if distillation ends up buying efficiency, I expect it to throw out the imitative rationale for safety in the process).

To me, most of the value to a new researcher in studying debate would lie in:

Thinking about it for a while
Figuring out what assumptions it'd require to work
Noticing that having these assumptions hold is the hard part
Going to work on those (or their foundations)

And as Eliezer/Nate/John... would point out, this doesn't require getting into the details of the mechanism design - only to notice that the mechanism is doing nothing to address the fundamentals of the problem.

I'd be genuinely interested if I'm wrong on any of this - it'd be nice if debate were actually useful! (I don't claim to be making all the necessary arguments above - just pointing out my current belief)

[-]Richard_Ngo3y20

For instance, for debate, one could believe:
1) Debate will work for long enough for us to use it to help find an alignment solution.
2) Debate is a plausible basis for an alignment solution.

I generally don't think about things in terms of this dichotomy. To me, an "alignment solution" is anything that will align an AGI which is then capable of solving alignment for its successor. And so I don't think you can separate these two things.

(Of course I agree that debate is not an arbitrarily scalable alignment solution in the sense that you can just keep training models using debate without adding any more techniques; but I don't think that really matters. We need to get to the moon, not to Andromeda.)

[-]Joe Collman3y30

Oh, to be clear, with "to help find" I only mean that we expect to make significant progress using debate. If we knew we'd safely make enough progress to get to a solution, then you're quite right that that would amount to (2). (apologies for lack of clarity if this was the miscommunication)

That's the distinction I mean to make between (1) and (2): we need to get to the moon safely
With (1) we have no idea when our rocket will explode.
Similarly, we have no idea whether the moon will be far enough to know when our next rocket will explode. (not that I'm knocking robustly getting to the moon safely)

If we had some principled argument telling us how far we could push debate before things became dangerous, that'd be great. I'm claiming that we have no such argument, and that all work on debate (that I'm aware of) stands near-zero chance of finding one.

Of course I'm all for work "on debate" that aims at finding that kind of argument - however, I would expect that such work leaves the specifics of debate behind pretty quickly.

[-]Neel Nanda3y10

To me (2) seems fairly clearly false - at the very least it's not doing anything about inner alignment (debate on weights/activations does nothing to address this, since there's still no [debaters are aiming to win the game] starting point).

Why do you believe this? It's fairly plausible to me that "train an AI to use interpretability tools to show that this other AI is being deceptive" is the kind of scalable oversight approach that might work, especially for detecting inner misalignment, if you can get the training right and avoid cooperation. But that seems like a plausibly solvable problem to me

[-]Joe Collman3y10

The problem is robustly getting the incentive to show that the other AI is being deceptive.
Giving access to the weights, activations and tools may give debaters the capability to expose deception - but that alone gets you nothing.

You're still left saying:
So long as we can get the AI to robustly do what we want (i.e. do its best to expose deception), we can get the AI to robustly do what we want.

Similarly, "...and avoid cooperation" is essentially the entire problem.

To be clear, I'm not saying that an approach of this kind will never catch any instances of an AI being deceptive. (this is one reason I'm less certain on (1))
I'm am saying that there's no reason to predict anything along these lines should catch all such instances.
I see no reason to think it'll scale.

Another issue: unless you have some kind of true name of deception (I see no reason to expect this exists), you'll train an AI to detect [things that fit your definition of deception], and we die to things that didn't fit your definition.

[-]Richard_Ngo3y40

These are all arguments about the limit; whether or not they're relevant depends on whether they apply to the regime of "smart enough to automate alignment research".

[-]Joe Collman3y30

Agreed.
Are you aware of any work that attempts to answer this question?
Does this work look like work on debate?
(not rhetorical questions!)

My guess is that work likely to address this does not look like work on debate.
Therefore my current position remains: don't bother working on debate; rather work on understanding the fundamentals that might tell you when it'll break.

The world won't be short of debate schemes.
It'll be short of principled arguments for their safe application.

[-]Neel Nanda3y*34

Eventually, once you've had a bunch of experience, you might notice a feeling of confusion or frustration: why is everyone else missing the point, or doing so badly at this? (Though note that a few top researchers commented on a draft to say that they didn't have this experience.) For some people that involves investigating a specific topic (for me, the question “what’s the best argument that AGI will be misaligned?“); for others it's about applying skills like conscientiousness (e.g. "why can't others just go through all the obvious steps?") Being excellent seldom feels like you’re excellent, because your own abilities set your baseline for what feels normal.

I relate a lot with this, this feels like one of the clearer markers internally for me of what becoming good at interpretability research felt like - there's so much low hanging fruit! Why aren't other people plucking it?

There's also just some internal sense of "I kind of know what I'm doing, and have ideas for what to do next", though this is much clearer to me when mentoring and advising other people, where I have strong opinions, than when applying it to myself, where I can sometimes pull it off but find it easily to fall into random spirals of doubt

[-]Neel Nanda3y10

Case studies: finding algorithms inside networks that implement specific capabilities. My favorite papers here are Olsson et al. (2022), Nanda et al. (2023), Wang et al. (2022) and Li et al. (2022); I’m excited to see more work which builds on the last in particular to find world-models and internally-represented goals within networks.

If you want to build on Li et al (the Othello paper), my follow-up work is likely to be a useful starting point, and then the post I wrote about future directions I'm particularly excited about

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

43

43

General mindset

Alignment research

Alignment research directions

Governance work

List of governance topics