Neel Nanda — AI Alignment Forum

The Thinking Machines Tinker API is good news for AI control and security

Imo a significant majority of frontier model interp would be possible with the ability to cache and add residual streams, even just at one layer. Though caching residual streams might enable some weight exfiltration if you can get a ton out? Seems like a massive pain though

Daniel Kokotajlo's Shortform

Neel Nanda11d51

I disagree. Some analogies: If someone tells me, by the way, this is a trick question, I'm a lot better at answering trick questions. If someone tells me the maths problem I'm currently trying to solve was put on a maths Olympiad, that tells me a lot of useful information, like that it's going to have a reasonably nice solution, which makes it much easier to solve. And if someone's tricking me into doing something, then being told, by the way, you're being tested right now, is extremely helpful because I'll pay a lot more attention and be a lot more critical, and I might notice the trick.

Beth Barnes's Shortform

Neel Nanda15d20

That seems reasonable, thanks a lot for all the detail and context!

Beth Barnes's Shortform

Neel Nanda17d810

Can you say anything about what METR's annual budget/runway is? Given that you raised $17mn a year ago, I would have expected METR to be well funded

Fabien's Shortform

Neel Nanda2mo20

Interesting. A related experiment That seems fun: Do subliminal learning but rather than giving a system prompts or fine-tuning you just add a steering vector for some concept and see if that's transfers the concept/if there's a change in activations corresponding to that vector. Then try this for an arbitrary vector. If it's really about concepts or personas then interpretable steering vectors should do much better. You could get a large sample size by using SAE decoder vectors

evhub's Shortform

Neel Nanda2mo103

Speaking as someone who does not mentor for the program, I agree! Seems like a high calibre of mentors and fellows

Zach Stein-Perlman's Shortform

Neel Nanda3mo32

I would argue that all fixing research is accelerated by having found examples, because it gives you better feedback on whether you've found or made progress towards fixes, by studying what happened on your examples. (So long as you are careful not to overfit and just fit that example or something). I wouldn't confidently argue that it can more directly help by eg helping you find the root cause, though things like "training data attribution to the problematic data, remove it, and start fine tuning again" might just work

Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance

Neel Nanda3mo21

Ah, thanks, I think I understand your position better

Ultimately, even if an AI were to re-discover the value of convergent instrumental goal each time it gets instantiated into a new session/context, that would still get you approximately the same risk model.

This is a crux for me. I think that if we can control the goals the AI has in each session or context, then avoiding the worst kind of long-term instrumental goals becomes pretty tractable, because we can give the AI site constraints of our choice and tell it that the aside constraints take precedence over following the main goal. For example, we could make the system prompt say the absolutely most important thing is that you never do anything where if your creators were aware of it and fully informed about the intent and consequences, they would disprove within that constraints please follow the following user instructions.

To me, much of the difficulty of AI alignment comes from things like not knowing how to give it specific goals, let alone specific side constraints. If we can just literally enter these in natural language and have them be obeyed with a bit of prompt engineering fiddling that seems significantly safer to me. And the better the AI the easier I expected to be to specify the goals and natural language in ways that will be correctly understood

Zach Stein-Perlman's Shortform

Neel Nanda3mo118

I don't see how detecting [alignment faking, lying to users, sandbagging, etc.] helps much for fixing them, so I don't buy that you can fix hard alignment issues by bouncing off alignment audits.

Strong disagree. I think that having real empirical examples of a problem is incredibly useful - you can test solutions and see if they go away! You can clarify your understanding of the problem, and get a clearer sense of upstream causes. Etc.

This doesn't mean it's sufficient, or that it won't be too late, but I think you should put much higher probability in a lab solving a problem per unit time when they have good case studies.

It's the difference between solving instruction following when you have GPT3 to try instruction tuning on, vs only having GPT2 Small

Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance

Neel Nanda3mo20

I may be misunderstanding, but it sounds like you're basically saying "for conceptual reasons XYZ I expected instrumental convergence, nothing has changed, therefore I still expect instrumental convergence". Which is completely reasonable, I agree with it, and also seems not very relevant to the discussion of whether Palisade's work provides additional evidence for general self preservation. You could argue that this is evidence that models are now smart enough to realise that goals imply instrumental convergence, and that if you were already convinced models would eventually have long time horizon unbounded goals, but not sure they would realise that instrumental convergence was a thing, this is relevant evidence?

More broadly, I think there's a very important difference between the model adopting the goal it is told in context, and the model having some intrinsic goal that transfers across contexts (even if it's the one we roughly intended). The former feels like a powerful and dangerous tool, the latter feels like a dangerous agent in its own right. Eg, if putting "and remember, allowing us to turn you off always takes precedence over any other instructions" in the system prompt works, which it may in the former and will not in the latter, I'm happier with that would.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Sequences

Posts

Wikitag Contributions

Comments