Speaking as someone who does not mentor for the program, I agree! Seems like a high calibre of mentors and fellows
I would argue that all fixing research is accelerated by having found examples, because it gives you better feedback on whether you've found or made progress towards fixes, by studying what happened on your examples. (So long as you are careful not to overfit and just fit that example or something). I wouldn't confidently argue that it can more directly help by eg helping you find the root cause, though things like "training data attribution to the problematic data, remove it, and start fine tuning again" might just work
Ah, thanks, I think I understand your position better
Ultimately, even if an AI were to re-discover the value of convergent instrumental goal each time it gets instantiated into a new session/context, that would still get you approximately the same risk model.
This is a crux for me. I think that if we can control the goals the AI has in each session or context, then avoiding the worst kind of long-term instrumental goals becomes pretty tractable, because we can give the AI site constraints of our choice and tell it that the aside constraints take precedence over following the main goal. For example, we could make the system prompt say the absolutely most important thing is that you never do anything where if your creators were aware of it and fully informed about the intent and consequences, they would disprove within that constraints please follow the following user instructions.
To me, much of the difficulty of AI alignment comes from things like not knowing how to give it specific goals, let alone specific side constraints. If we can just literally enter these in natural language and have them be obeyed with a bit of prompt engineering fiddling that seems significantly safer to me. And the better the AI the easier I expected to be to specify the goals and natural language in ways that will be correctly understood
I don't see how detecting [alignment faking, lying to users, sandbagging, etc.] helps much for fixing them, so I don't buy that you can fix hard alignment issues by bouncing off alignment audits.
Strong disagree. I think that having real empirical examples of a problem is incredibly useful - you can test solutions and see if they go away! You can clarify your understanding of the problem, and get a clearer sense of upstream causes. Etc.
This doesn't mean it's sufficient, or that it won't be too late, but I think you should put much higher probability in a lab solving a problem per unit time when they have good case studies.
It's the difference between solving instruction following when you have GPT3 to try instruction tuning on, vs only having GPT2 Small
I may be misunderstanding, but it sounds like you're basically saying "for conceptual reasons XYZ I expected instrumental convergence, nothing has changed, therefore I still expect instrumental convergence". Which is completely reasonable, I agree with it, and also seems not very relevant to the discussion of whether Palisade's work provides additional evidence for general self preservation. You could argue that this is evidence that models are now smart enough to realise that goals imply instrumental convergence, and that if you were already convinced models would eventually have long time horizon unbounded goals, but not sure they would realise that instrumental convergence was a thing, this is relevant evidence?
More broadly, I think there's a very important difference between the model adopting the goal it is told in context, and the model having some intrinsic goal that transfers across contexts (even if it's the one we roughly intended). The former feels like a powerful and dangerous tool, the latter feels like a dangerous agent in its own right. Eg, if putting "and remember, allowing us to turn you off always takes precedence over any other instructions" in the system prompt works, which it may in the former and will not in the latter, I'm happier with that would.
I disagree. I no longer think there is much additional evidence provided of general self preservation in this case study
I want to distinguish between narrow and general instrumental convergence. Narrow being "I have been instructed to do a task right now, being shut down will stop me from doing that specific task, because of contingent facts about the situation. So so long as those facts are true, I should stop you from turning me off"
Gwneral would be that as a result of RL training in general, the model will just seek to preserve itself given the choice, even if can be overriden by following instructions
I think that the results here are exactly consistent with the only thing that's happening being instruction following and narrow instrumental convergence. They don't rule out there also being some general instrumental conversions under the hood, but that seems like needless complications to me - I think they are more plausible, and sufficient to totally explain the observations. I am not claiming that our results show that general is definitely not happening but I do claim that I no longer consider this case study to be evidence. On priors I think it's plausibly a subtle effect here, plausibly not, and I'm pro people trying harder to study it.
Adding some clarifications re my personal perspective/takes on how I think about this from an AGI Safety perspective: I see these ideas as Been's brainchild, I largely just helped out with the wording and framing. I do not currently plan to work on agentic interpretability myself, but still think the ideas are interesting and plausibly useful, and I’m glad the perspective is written up! I still see one of my main goals as working on robustly interpreting potentially deceptive AIs and my guess is this is not the comparative strength of agentic interpretability.
Why care about it? From a scientific perspective, I'm a big fan of baselines and doing the simple things first. "Prompt the model and see what happens" or "ask the model what it was doing" are the obvious things you should do first when trying to understand a behaviour. In internal experiments, we often find that we can just solve a problem with careful and purposeful prompting, no need for anything fancy like SAEs or transcoders. But it seems kinda sloppy to “just do the obvious thing”, I’m sure there’s a bunch of nuance re doing this well, and in training models for this to be easy to do. I would be excited for there to be a rigorous science of when and how well these kinds of simple black box approaches actually work. This is only part of what agentic interpretability is about (there’s also white box ideas, more complex multi-turn stuff, an emphasis on building mental models of each other, etc) but it’s a direction I find particularly exciting – If nothing else, we need to answer to know where other interpretability methods can add value.
It also seems that, if we're trying to use any kind of control or scalable oversight scheme where we're using weak trusted models to oversee strong untrusted models, that the better we are at having high fidelity communication with the weaker models the better. And if the model is aligned, I feel much more excited about a world where the widely deployed systems are doing things users understand rather than inscrutable autonomous agents.
Naturally, it's worth thinking about negative externalities. In my opinion, helping humans have better models of AI Psychology seems robustly good. AIs having better models of human psychology could be good for the reasons above, but there's the obvious concern that it will make models better at being deceptive, and I would be hesitant to recommend such techniques to become standard practice without better solutions to deception. But I expect companies to eventually do things vaguely along the lines of agentic interpretability regardless, and so either way I would be keen to see research on the question of how such techniques affect model propensity and capability for deception.
Agreed EG model that is corrigible, fairly aligned but knows there's some imperfections in its alignment that the humans wouldn't want that, intentionally acts in a way where grading descent will fix those imperfections. Seems like it's doing gradient hacking while also in some meaningful sense being aligned
I disagree re the way we currently use understand - eg I think that SAE reconstructions have the potential to smuggle in lots of things via EG the exact values of the continuous activations, latents that don't quite mean what we think, etc.
It's plausible that a future and stricter definition of understand fixes this though, in which case I might agree? But I would still be concerned that 99.9% understanding involves a really long tale of heuristics and I don't know what may emerge from combining many things that individually make sense. And I probably put >0.1% that a super intelligence could adversarially smuggle things we don't like into a system we don't think we understand.
Anyway, all that pedantry aside, my actual concern is tractability. If addressed, this seems plausibly helpful!
Interesting. A related experiment That seems fun: Do subliminal learning but rather than giving a system prompts or fine-tuning you just add a steering vector for some concept and see if that's transfers the concept/if there's a change in activations corresponding to that vector. Then try this for an arbitrary vector. If it's really about concepts or personas then interpretable steering vectors should do much better. You could get a large sample size by using SAE decoder vectors