All of Joseph Bloom's Comments + Replies

My vibe from this post is something like "we're making on stuff that could be helpful so there's stuff to work on!" and this is a vibe I like. However, I suspect that for people who might not be as excited about these approaches, you're likely not touching on important cruxes (eg: do these approaches really scale? Are some agendas capabilities enhancing? Will these solve deceptive alignment or just corrigible alignment?)

I also think that if the goal is to actually make progress and not to maximize the number of people making progress or who feel like they'... (read more)

Really exciting! I added a version of AVEC to my interpretability tool for gridworld agents and am keen to explore it more. I really like that the injection coefficient has a scalar and this had enabled me to do what I can "an injection coefficient scan". 

The procedure I'm using looks like this:

  1. Repeat your input tokens say, 128 times. 
  2. Apply the activation vector at 128 different steps between a coefficient of -10 and 10 to each of your input tokens when doing your AVEC forward pass. 
  3. Decompose the resulting residual stream to whatever granula
... (read more)
2Alex Turner6mo
I don't think I follow your procedure. Would you be willing to walk me through an example situation?

We would love to see more ideas & hypotheses on why the model might be doing this, as well as attempts to test this! We mainly wrote-up this post because both Alex and I independently noticed this and weren't aware of this previously, so we wanted to make a reference post.

Happy to provide! I think I'm pretty interested in testing this/working on this in the future. Currently a bit tied up but I think (as Alex hints at) there could be some big implications for interpretability here.

TLDR: Documenting existing circuits is good but explaining what relation... (read more)

2Alex Turner7mo
Can you say more on this point? The latter kind of good (useful when integrated with other features) doesn't necessarily imply that direct unembed (logit lens) or learned linear unembed (tuned lens iirc) would be able to extract use from such goods. I suspect that I probably just missed your point, though.

Second pass through this post which solidly nerd-sniped me! 

A quick summary of my understand of the post: (intentionally being very reductive though I understand the post may make more subtle points). 

  1. There appears to be exponential growth in the norm of the residual stream in a range of models. Why is this the case?
  2. You consider two hypotheses: 
    1. 1. That the parameters in the Attention and/or MLP weights increase later in the network. 
    2. 2. That there is some monkey business with the layer norm sneaking in a single extra feature. 
  3. In ter
... (read more)
3Stefan Heimersheim7mo
Thank for for the extensive comment! Your summary is really helpful to see how this came across, here's my take on a couple of these points: 2.b: The network would be sneaking information about the size of the residual stream past LayerNorm. So the network wants to implement an sort of "grow by a factor X every layer" and wants to prevent LayerNorm from resetting its progress. 1. There's the difference between (i) How does the model make the residual stream grow exponentially -- the answer is probably theory 1, that something in the weights grow exponentially. And there is (ii) our best guess on Why the model would ever want this, which is the information deletion thing. Yep we give some evidence for How, but for Why we have only a guess. Yes, all we have is some intuition here. It seems plausible that the model needs to communicate stuff between some layers, but doesn't want this to take up space in the residual stream. So this exponential growth is a neat way to make old information decay away (relatively). And it seems plausible to implement a few amplification circuits for information that has to be preserved for much later in the network. We would love to see more ideas & hypotheses on why the model might be doing this, as well as attempts to test this! We mainly wrote-up this post because both Alex and I independently noticed this and weren't aware of this previously, so we wanted to make a reference post.

Thank you for letting me know about your work on procgen with MI. It sounds like you're making progress, particularly I'd be interested in your visualisation techniques (how do they compare to what was done in Understanding RL Vision?) and the reproduction of the cheese-maze policies (is this tricky? Do you think a DT could be well-calibrated on this problem?). 

Some questions that might be useful to discuss more:

  • What are the pros/cons of doing DT vs actor-critic MI? (You're using Actor-Critic of some form?). It could also be interesting to study analo
... (read more)
4Alex Turner9mo
We're studying a net with the structure I commented below, trained via PPO. I'd be happy to discuss more at EAG.  Not posting much publicly right now so that we can a: work on the research sprint and b: let people preregister credences in various mechint / generalization propositions, so that they can calibrate / see how their opinions evolve over time. 

Hey Adam, thanks for running Refine and writing this up. 

Out of curiosity, do you (or anyone else) know if there are statistics for previous SERI-MATS cohorts/other programs designed to generate conceptual alignment researchers? 

3Adam Shimi1y
Thanks for the kind words! I'm not aware of any such statistics, but I'm guessing that MATS organizers might have some.