Thanks for writing this up!
I'm curious about this:
I personally found the discussion useful for helping me understand what motivated some of the researchers I talked to. I was surprised by the diversity.
What motivated people in particular? What was surprising?
Minor clarifying point: Act-adds cannot be cast as ablations.
Sorry, ablation might be the wrong word here (but people use it anyways): the technique is to subtract/add/move along the discovered direction and see what happens to the outputs. It's possible there's a better or standard word that I can't think of write now.
Also, another example of an attempt at interp -> alignment would arguably be the model editing stuff following causal tracing in the ROME paper?
This is why I'm pessimistic about most interpretability work. It just isn't focused enough
Most of the "exploratory" interp work you suggest is trying to achieve an ambitious mechanistic understanding of models, which requires a really high degree of model understanding in general. They're not trying to solve particular concrete problems, and it seems unfair to evaluate them according to a different theory of change. If you're going to argue against this line of work, I think you should either argue that they're failing to achieve their theory of change, or...
Glad to see that this work is out!
I don't have much to say here, especially since I don't want to rehash the old arguments about the usefulness of prosaic adversarial ML research. (I think it's worth working on but the direct impacts of the work are unclear). I do think that most people in AIS agree that image advexes are challenging and generally unsolved, but the people who disagree on the relevance of this line of research tend to question the implied threat model.
The main funders are LTFF, SFF/Lightspeed/other S-process stuff from Jaan Tallinn, and Open Phil. LTFF is the main one that solicits independent researcher grant applications.
There's a lot of orgs, off the top of my head, there's Anthropic/OpenAI/GDM as the scaling labs with decent-sized alignment teams, and then there's a bunch of smaller/independent orgs:
And there's always academia.
(I'm sure I'm missing a few though!)
(EDIT: added in RR and CLR)
I think this has gotten both worse and better in several ways.
It's gotten better in that ARC and Redwood (and to a lesser extent, Anthropic and OpenAI) have put out significantly more of their research. FAR Labs also exists is also doing some of the research proliferation that would've gone on inside of Constellation.
It's worse in that there's been some amount of deliberate effort to build more of an AIS community in Constellation, e.g. with explicit Alignment Days where people are encouraged to present work-in-progress and additional fellowships and workshops.
On net I think it's gotten better, mainly because there's just been a lot more content put out in 2023 (per unit research) than in 2022.
I suspect the underfitting explanation is probably a lot of what's going on given the small models used by the authors. But in the case of larger, more capable models, why would you expect it to be underfitting instead of generalization (properly fitting)?
I don't think so, unfortunately, and it's been so long that I don't think I can find the code, let alone get it running.
I think the deciding difference is that the amount of fans and supporters who want to be actively involved and who think the problem is the most important in the world is much larger than the number of researchers; while popular physics book readers and nature documentary viewers are plentiful, I doubt most of them feel a compelling need to become involved!
Great work, glad to see it out!
...
- Why doesn't algebraic value editing break all kinds of internal computations?! What happened to the "manifold of usual activations"? Doesn't that matter at all?
- Or the hugely nonlinear network architecture, which doesn't even have a persistent residual stream? Why can I diff across internal activations for different observations?
- Why can I just add 10 times the top-right vector and still get roughly reasonable behavior?
- And the top-right vector also transfers across mazes? Why isn't it maze-specific?
- To make up
Thanks!
(As an amusing side note: I spent 20+ minutes after finishing the writeup trying to get the image from the recent 4-layer docstring circuit post to preview properly the footnotes, and eventually gave up. That is, a full ~15% of the total time invested was spent on that footnote!)
For what it's worth, I buy the claim from Gwern that Microsoft trained Sydney pretty poorly, much worse than is achievable with SFT on highly rated data. For example, Sydney shows significant repetition, which you don't see even on text-davinci-002
or (early 2022) LaMDA, both trained without RLHF.
Quick clarifications:
Edit: played around with the models, it seems like the transformer only gets 99.7% train accuracy and 97.5% test accuracy!
I broadly agree with the points being made here, but allow me to nitpick the use of the word "predictive" here, and argue for the key advantage of the simulators framing over the prediction one:
Pretrained models don’t ‘simulate a character speaking’; they predict what comes next, which implicitly involves making predictions about the distribution of characters and what they would say next.
The simulators frame does make it very clear that there's a distinction between the simulator/GPT-3 and the simulacra/characters or situations it's making predictions abo...
The time-evolution rules of the state are simply the probabilities of the autoregressive model -- there's some amount of high level structure but not a lot. (As Ryan says, you don't get the normal property you want from a state (the Markov property) except in a very weak sense.)
I also disagree that purely thinking about the text as state + GPT-3 as evolution rules is the intention of the original simulators post; there's a lot of discussion about the content of the simulations themselves as simulated realities or alternative universes (though the post does...
Nitpick: I mean, technically, the state is only the last 4k tokens or however long your context length is. Though I agree this is still very uninteresting.
We were quite familiar with Geiger et al's work before writing the post, and think it's importantly different. Though it seems like we forgot to cite it in the Causal Scrubbing AF post, whoops.
Hopefully this will be fixed with the forthcoming arXiv paper!
At least based on my convos with them, the Anthropic team does seem like a clear example of this, at least insofar as you think understanding circuits in real models with more than one MLP layer in them is important for interp -- superposition just stops you from using the standard features as directions approach almost entirely!
In general, a tool being engineering-relevant does not imply that it will be competitive for setting a new SOTA on something risky. So when I will talk about engineering relevance in this sequence, I don't have big advancements in mind so much as stuff like fairly simple debugging work.
Fwiw this does not seem to be in the Dan Hendrycks post you linked!
Google’s event where they’re presumably unveiling their response will happen Feb 8th at 2:30 PM CET/5:30 AM PT:
That being said, it's possible that both group composition tasks (like the mod add stuff) and MNIST are pretty special datasets, in that generalizing solutions have small weight norm and memorization solutions have large weight norm. It might be worth constructing tasks where generalizing solutions have large weight norm, and seeing what happens.
The negative result tells us that the strong form of the claim "regularization = navigability" is probably wrong. Having a smaller weight norm actually is good for generalization (just as the learning theorists would have you believe). You'll have better luck moving along the set of minimum loss weights in the way that minimizes the norm than in any other way.
Have you seen the Omnigrok work? It directly argues that weight norm is directly related to grokking:
Similarly, Figure 7 from https://arxiv.org/abs/2301.05217 also makes this point, but less str...
As for other forms of noise inducing grokking: we do see grokking with dropout! So there's some reason to think noise -> grokking.
(Source: Figure 28 from https://arxiv.org/abs/2301.05217)
Also worth noting that grokking is pretty hyperparameter sensitive -- it's possible you just haven't found the right size/form of noise yet!
In particular, can we use noise to make a model grok even in the absence of regularization (which is currently a requirement to make models grok with SGD)?
Worth noting that you can get grokking in some cases without explicit regularization with full batch gradient descent, if you use an adaptive optimizer, due to the slingshot mechanism: https://arxiv.org/abs/2206.04817
Unfortunately, reproducing slingshots reliably was pretty challenging for me; I could consistently get it to happen with 2+ layer transformers but not reliably on 1 layer t...
Yep, this is correct - in the worse case, you could have performance that is exponential in the size of the interpretation.
(Redwood is fully aware of this problem and there have been several efforts to fix it.)
Yeah, I think it was implicitly assumed that there existed some such that no token ever had probability .
Thanks for the clarification!
I agree that your model of subagents in the two posts share a lot of commonalities with parts of Shard Theory, and I should've done a lit review of your subagent posts. (I based my understanding of subagent models on some of the AI Safety formalisms I've seen as well as John Wentworth's Why Subagents?.) My bad.
That being said, I think it's a bit weird to have "habitual subagents", since the word "agent" seems to imply some amount of goal-directedness. I would've classified your work as closer to Shard Theory than the subagent models I normally think about.
Thanks!
just procrastination/lacking urgency
This is probably true in general, to be honest. However, it's an explanation for why people don't do anything, and I'm not sure this differentially leads to delaying contact with reality more than say, delaying writing up your ideas in a Google doc.
Some more strategies I like for touching reality faster
I like the "explain your ideas to other people" point, it seems like an important caveat/improvement to the "have good collaborators" strategy I describe above. I also think the meta strategy point of building a good workflow is super important!
Thanks for posting this! I agree that it's good to get it out anyways, I thought it was valuable. I especially resonate with the point in the Pure simulators section.
Some responses:
In general I'm skeptical that the simulator framing adds much relative to 'the model is predicting what token would appear next in the training data given the input tokens'. I think it's pretty important to think about what exactly is in the training data, rather than about some general idea of accurately simulating the world.
I think that the main value of the simula...
- C* What is the role of Negative/ Backup/ regular Name Movers Heads outside IOI? Can we find examples on which Negative Name Movers contribute positively to the next-token prediction?
So, it turns out that negative prediction heads appear ~everywhere! For example, Noa Nabeshima found them on ResNeXt
s trained on ImageNet: there seem to be heads that significantly reduce the probability of certain outputs. IIRC the explanation we settled on was calibration; ablating these heads seemed to increase log loss via overconfident predictions on borderline cases?
The distinction between "newbies get caught up trying to understand every detail, experts think in higher-level abstractions, make educated guesses, and only zoom in on the details that matter" felt super interesting and surprising to me.
I claim that this is 1) an instance of a common pattern that 2) is currently missing a step (the pre-newbie stage).
The general pattern is the following (terminology borrowed from Terry Tao):
Many forms of interpretability seek to explain how the network's outputs relate high level concepts without referencing the actual functioning of the network. Saliency maps are a classic example, as are "build an interpretable model" techniques such as LIME.
In contrast, mechanistic interpretability tries to understand the mechanisms that compose the network. To use Chris Olah's words:
Mechanistic interpretability seeks to reverse engineer neural networks, similar to how one might reverse engineer a compiled binary computer program.
Or see this post by ...
Thanks Nate!
I didn't add a 1-sentence bullet point for each thesis because I thought the table of contents on the left was sufficient, though in retrospect I should've written it up mainly for learning value. Do you still think it's worth doing after the fact?
Ditto the tweet thread, assuming I don't plan on tweeting this.
See also Superexponential Concept Space, and Simple Words, from the Sequences:
...By the time you're talking about data with forty binary attributes, the number of possible examples is past a trillion—but the number of possible concepts is past two-to-the-trillionth-power. To narrow down that superexponential concept space, you'd have to see over a trillion examples before you could say what was In, and what was Out. You'd have to see every possible example, in fact.
[...]
From this perspective, learning doesn't just rely on inductive bias, it is nea
wirehead-proof crib, and eventually it will be sufficiently self-aware and foresighted that when we let it out of the crib, it can deliberately avoid situations that would get it addicted to wireheading.
I feel like I'm saying something relatively uncontroversial here, which is that if you select agents on the basis of doing well wrt X sufficiently hard enough, you should end up with agents that care about things like X. E.g. if you select agents on the basis of human approval, you should expect them to maximize human approval in situations even where...
I feel like I'm saying something relatively uncontroversial here, which is that if you select agents on the basis of doing well wrt X sufficiently hard enough, you should end up with agents that care about things like X. E.g. if you select agents on the basis of human approval, you should expect them to maximize human approval in situations even where human approval diverges from what the humans "would want if fully informed".
I actually want to controversy that. I'm now going to write quickly about selection arguments in alignment more generally (thi...
Right, that's a decent objection.
I have three responses:
There’s no such thing as convergence in the real world. It’s essentially infinitely complicated. There are always new things to discover.
I would ask “how is it that I don’t want to take cocaine right now”? Well, if I took cocaine, I would get addicted. And I know that. And I don’t want to get addicted. So I have been deliberately avoiding cocaine for my whole life. By the same token, maybe we can raise our baby AGIs in a wirehead-proof crib, and eventually it will be sufficiently self-aware and foresighted that when we let it out of the crib, it can delibe...
I think the claim that an optimizer is a retargetable search process makes a lot of sense* and I've edited the post to link to this clarification.
That being said, I'm still confused about the details.
Suppose that I do a goal-conditioned version of the paper, where (hypothetically) I exhibit a transformer circuit that, conditioned on some prompt or the other, was able to alternate between performing gradient descent on three types of objectives (say, L1, L2, L\infty) -- would this suffice? How about if, instead, there wasn't any prompt that let me swi...
Well, no, that's not the definition of optimizer in the mesa-optimization post! Evan gives the following definition of an optimizer:
A system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system
And the following definition of a mesa-optimizer:
...Mesa-optimization occurs when a base optimizer (in searching for algorithms to solve some problem) fi
That definition of "optimizer" requires
some objective function that is explicitly represented within the system
but that is not the case here.
There is a fundamental difference between
The transformers in this paper are programs of the 2nd type. They don't contain any l...
I really do empathize with the authors, since writing an abstract fundamentally requires trading off faithfulness to the paper content and the length and readability of the abstract. But I do agree that they could've been more precise without a significant increase in length.
Nitpick: I think instead of expanding on the sentence
As a result we are able to train a more harmless and less evasive AI assistant than previous attempts that engages with harmful queries by more often explaining its objections to them than avoiding answering
My proposed rewrite ...
Thanks!
Yeah, I think ELK is surprisingly popular in my experience amongst academics, though they tend to frame it in terms of partial observability (as opposed to the measurement tampering framing I often hear EA/AIS people use).