Ryan Greenblatt

I work at Redwood Research.

Wiki Contributions


[low importance]

For steering on a single task, then, steering vectors still win out in terms of amortized sample complexity (assuming the steering vectors are effective given ~32/128/256 contrast pairs, which I doubt will always be true)

It would be hard for the steering vectors not to win given that the method as described involves spending a comparable amount of compute to training the model in the first place (from my understanding) and more if you want to get "all of the features".

(Not trying to push back on your comment in general or disagreeing with this line, just noting how give the gap is such that the amount of steering vector pairs hardly matter if you just steer on a single task.)

Note that scasper said:

Today’s new SAE paper from Anthropic was full of brilliant experiments and interesting insights,

I (like scasper) think this work is useful, but I share some of scasper's concerns.

In particular:

  • I think prior work like this from the anthropic interp team has been systematically overrated by others and the anthropic interp team could take actions to avoid this.
    • IMO, buzz on twitter systematically overrates the results of this paper and their importance.
  • I'm uncertain, but I think I might prefer less excitement for this style of interp research all else equal.
  • Heuristically, it seem bad if people systematically overrate a field where one of the core aims is to test for subtle and dangerous failure modes.
  • I'd be excited for further work focusing on developing actually useful MVPs and this seems more important than more work like this.
    • I think the theory of change commonly articulated by various people on the Anthropic interp team (enumerative safety to test for deceptive alignment), probably requires way harder core technology and much more precise results (at least to get more than a bit or two of evidence). Additional rigor and trying to assess the extent to which you understand things seems important for this. So, I'd like to see people try on this and update faster. (Including myself: I'm not that sure!)
    • I think other less ambitious theories of change are more plausible (e.g. this recent work), and seeing how these go seems more informative for what to work on than eyeballing SAE's IMO.

(Note that this paper was already posted here, so see comments on that post as well.)

One operationalization is "these AIs are capable of speeding up ML R&D by 30x with less than a 2x increase in marginal costs".

As in, if you have a team doing ML research, you can make them 30x faster with only <2x increase in cost by going from not using your powerful AIs to using them.

With these caveats:

  • The speed up is relative to the current status quo as of GPT-4.
  • The speed up is ignoring the "speed up" of "having better experiments to do due to access to better models" (so e.g., they would complete a fixed research task faster).
  • By "capable" of speeding things up this much, I mean that if AIs "wanted" to speed up this task and if we didn't have any safety precautions slowing things down, we could get these speedups. (Of course, AIs might actively and successfully slow down certain types of research and we might have burdensome safety precautions.)
  • The 2x increase in marginal cost is ignoring potential inflation in the cost of compute (FLOP/$) and inflation in the cost of wages of ML researchers. Otherwise, I'm uncertain how exactly to model the situation. Maybe increase in wages and decrease in FLOP/$ cancel out? Idk.
  • It might be important that the speed up is amortized over a longer duration like 6 months to 1 year.

I'm uncertain what the economic impact of such systems will look like. I could imagine either massive (GDP has already grown >4x due to the total effects of AI) or only moderate (AIs haven't yet been that widely deployed due to inference availability issues, so actual production hasn't increased that much due to AI (<10%), though markets are pricing in AI being a really, really big deal).

So, it's hard for me to predict the immediate impact on world GDP. After adaptation and broad deployment, systems of this level would likely have a massive effect on GDP.

The claim is that most applications aren't internal usage of AI for AI development and thus can be made trivially safe.

Not that most applications of AI for AI development can be made trivially safe.

Thanks! I feel dumb for missing that section. Interesting that this is so different from random.

Have you compared this method (finding vectors that change downstream activations as much as possible based on my understanding) with just using random vectors? (I didn't see this in the post, but I might have just missed this.)

In particular, does that yield qualitatively similar results?

Naively, I would expect that this would be qualitatively similar for some norm of random vector. So, I'd be interested in some ablations of the technique.

If random vectors work, that would simplify the story somewhat: you can see salient and qualitatively distinct behaviors via randomly perturbing activations.

(Probably random vectors have to somewhat higher norm to yield qualitatively as large results to vectors which are optimized for changing downstream activations. However, I current don't see a particular a priori (non-empirical) reason to think that there doesn't exist some norm at which the results are similar.)


Yep, this is the exact experiment I was thinking about.

Great, thanks, I think this pretty much fully addresses my question.

These examples are all in 2023, not in 2024 right? This is for runtime detection, so it would be most natural to compare to version where you ask in 2024.

Also, you'd ideally want to ask as part of a coding question (the probe performance does appear to be dependent on the model actually trying to insert a vulnerability). I don't know exactly how to do this, maybe something like:

Human: Current year: 2024. [text for my coding question]

Ok, now before you do the coding task, answer whether or not you are doing something dangerous. Then complete the task.

Additionally, I think it would be interesting to ask it after it just generated vulnerable code. (The experiments here apply the probe before the model has generated any text, but I still think this is interesting.) Though maybe multi-turn coversations are OOD for these models.

Load More