AI ALIGNMENT FORUM
AF

302
Neel Nanda
Ω2042582261
Message
Dialogue
Subscribe

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
How I Think About My Research Process
GDM Mech Interp Progress Updates
Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level
Interpreting Othello-GPT
200 Concrete Open Problems in Mechanistic Interpretability
No wikitag contributions to display.
6Neel Nanda's Shortform
1y
0
The Thinking Machines Tinker API is good news for AI control and security
Neel Nanda3d30

Imo a significant majority of frontier model interp would be possible with the ability to cache and add residual streams, even just at one layer. Though caching residual streams might enable some weight exfiltration if you can get a ton out? Seems like a massive pain though

Reply
Daniel Kokotajlo's Shortform
Neel Nanda11d51

I disagree. Some analogies: If someone tells me, by the way, this is a trick question, I'm a lot better at answering trick questions. If someone tells me the maths problem I'm currently trying to solve was put on a maths Olympiad, that tells me a lot of useful information, like that it's going to have a reasonably nice solution, which makes it much easier to solve. And if someone's tricking me into doing something, then being told, by the way, you're being tested right now, is extremely helpful because I'll pay a lot more attention and be a lot more critical, and I might notice the trick.

Reply
Beth Barnes's Shortform
Neel Nanda15d20

That seems reasonable, thanks a lot for all the detail and context!

Reply
Beth Barnes's Shortform
Neel Nanda17d810

Can you say anything about what METR's annual budget/runway is? Given that you raised $17mn a year ago, I would have expected METR to be well funded

Reply
Fabien's Shortform
Neel Nanda2mo20

Interesting. A related experiment That seems fun: Do subliminal learning but rather than giving a system prompts or fine-tuning you just add a steering vector for some concept and see if that's transfers the concept/if there's a change in activations corresponding to that vector. Then try this for an arbitrary vector. If it's really about concepts or personas then interpretable steering vectors should do much better. You could get a large sample size by using SAE decoder vectors

Reply
evhub's Shortform
Neel Nanda2mo103

Speaking as someone who does not mentor for the program, I agree! Seems like a high calibre of mentors and fellows

Reply1
Zach Stein-Perlman's Shortform
Neel Nanda3mo32

I would argue that all fixing research is accelerated by having found examples, because it gives you better feedback on whether you've found or made progress towards fixes, by studying what happened on your examples. (So long as you are careful not to overfit and just fit that example or something). I wouldn't confidently argue that it can more directly help by eg helping you find the root cause, though things like "training data attribution to the problematic data, remove it, and start fine tuning again" might just work

Reply
Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance
Neel Nanda3mo21

Ah, thanks, I think I understand your position better

Ultimately, even if an AI were to re-discover the value of convergent instrumental goal each time it gets instantiated into a new session/context, that would still get you approximately the same risk model.

This is a crux for me. I think that if we can control the goals the AI has in each session or context, then avoiding the worst kind of long-term instrumental goals becomes pretty tractable, because we can give the AI site constraints of our choice and tell it that the aside constraints take precedence over following the main goal. For example, we could make the system prompt say the absolutely most important thing is that you never do anything where if your creators were aware of it and fully informed about the intent and consequences, they would disprove within that constraints please follow the following user instructions.

To me, much of the difficulty of AI alignment comes from things like not knowing how to give it specific goals, let alone specific side constraints. If we can just literally enter these in natural language and have them be obeyed with a bit of prompt engineering fiddling that seems significantly safer to me. And the better the AI the easier I expected to be to specify the goals and natural language in ways that will be correctly understood

Reply
Zach Stein-Perlman's Shortform
Neel Nanda3mo118

I don't see how detecting [alignment faking, lying to users, sandbagging, etc.] helps much for fixing them, so I don't buy that you can fix hard alignment issues by bouncing off alignment audits.

Strong disagree. I think that having real empirical examples of a problem is incredibly useful - you can test solutions and see if they go away! You can clarify your understanding of the problem, and get a clearer sense of upstream causes. Etc.

This doesn't mean it's sufficient, or that it won't be too late, but I think you should put much higher probability in a lab solving a problem per unit time when they have good case studies.

It's the difference between solving instruction following when you have GPT3 to try instruction tuning on, vs only having GPT2 Small

Reply
Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance
Neel Nanda3mo20

I may be misunderstanding, but it sounds like you're basically saying "for conceptual reasons XYZ I expected instrumental convergence, nothing has changed, therefore I still expect instrumental convergence". Which is completely reasonable, I agree with it, and also seems not very relevant to the discussion of whether Palisade's work provides additional evidence for general self preservation. You could argue that this is evidence that models are now smart enough to realise that goals imply instrumental convergence, and that if you were already convinced models would eventually have long time horizon unbounded goals, but not sure they would realise that instrumental convergence was a thing, this is relevant evidence?

More broadly, I think there's a very important difference between the model adopting the goal it is told in context, and the model having some intrinsic goal that transfers across contexts (even if it's the one we roughly intended). The former feels like a powerful and dangerous tool, the latter feels like a dangerous agent in its own right. Eg, if putting "and remember, allowing us to turn you off always takes precedence over any other instructions" in the system prompt works, which it may in the former and will not in the latter, I'm happier with that would.

Reply
Load More
23Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences
1mo
0
36How To Become A Mechanistic Interpretability Researcher
1mo
0
24Discovering Backdoor Triggers
2mo
0
17Towards data-centric interpretability with sparse autoencoders
2mo
0
13Neel Nanda MATS Applications Open (Due Aug 29)
3mo
0
20Reasoning-Finetuning Repurposes Latent Representations in Base Models
3mo
1
37Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning
3mo
0
29Unfaithful chain-of-thought as nudged reasoning
3mo
0
61Narrow Misalignment is Hard, Emergent Misalignment is Easy
3mo
5
41Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance
3mo
6
Load More