I do AI Alignment research. Currently at ARC Evals, though I still dabble in interpretability in my spare time.
I'm also currently on leave from my PhD at UC Berkeley's CHAI.
Yep, this is correct - in the worse case, you could have performance that is exponential in the size of the interpretation.
(Redwood is fully aware of this problem and there have been several efforts to fix it.)
Thanks for the clarification! I'll have to think more about this.
Yeah, I think it was implicitly assumed that there existed some such that no token ever had probability .
Thanks for the clarification!
I agree that your model of subagents in the two posts share a lot of commonalities with parts of Shard Theory, and I should've done a lit review of your subagent posts. (I based my understanding of subagent models on some of the AI Safety formalisms I've seen as well as John Wentworth's Why Subagents?.) My bad.
That being said, I think it's a bit weird to have "habitual subagents", since the word "agent" seems to imply some amount of goal-directedness. I would've classified your work as closer to Shard Theory than the subagent models I normally think about.
Thanks!
just procrastination/lacking urgency
This is probably true in general, to be honest. However, it's an explanation for why people don't do anything, and I'm not sure this differentially leads to delaying contact with reality more than say, delaying writing up your ideas in a Google doc.
Some more strategies I like for touching reality faster
I like the "explain your ideas to other people" point, it seems like an important caveat/improvement to the "have good collaborators" strategy I describe above. I also think the meta strategy point of building a good workflow is super important!
I think this is a good word of caution. I'll edit in a link to this comment.
Thanks for posting this! I agree that it's good to get it out anyways, I thought it was valuable. I especially resonate with the point in the Pure simulators section.
Some responses:
In general I'm skeptical that the simulator framing adds much relative to 'the model is predicting what token would appear next in the training data given the input tokens'. I think it's pretty important to think about what exactly is in the training data, rather than about some general idea of accurately simulating the world.
I think that the main value of the simulators framing was to push back against confused claims that treat (base) GPT3 and other generative models as traditional rational agents. That being said, I do think there are some reasons why the simulator framework adds value relative to "the model is doing next token prediction":
My guess is you have a significantly more sophisticated, empirical model of LMs, such that the simulators framework feels like a simplification to your empirical knowledge + "the model is doing next token prediction". But I think the simulator framework is valuable because it incorporates additional knowledge about the LM task while pushing back against two significantly more confused framings. (Indeed, Janus makes these claims explicitly in the simulators post!)
(Paul has a post which talks about this 'what is actually the correct generalization' thing somewhere that I wanted to link, but I can't currently find it)
Are you thinking of A naive alignment strategy and optimism about generalization?
(Paul does talk about intended vs unintended generalization in a bunch of posts, so it's conceivable you're thinking about something more specific.)
GPT-style transformers are purely myopic
I'm not sure this is that important, or that anyone else actually thinks this, but it was something I got wrong for a while. I was thinking of everything that happens at sequence position n as about myopically predicting the nth token.
I do think people think variants of this, see the comments of Steering Behaviour: Testing for (Non-)Myopia in Language Models for example.
- C* What is the role of Negative/ Backup/ regular Name Movers Heads outside IOI? Can we find examples on which Negative Name Movers contribute positively to the next-token prediction?
So, it turns out that negative prediction heads appear ~everywhere! For example, Noa Nabeshima found them on ResNeXt
s trained on ImageNet: there seem to be heads that significantly reduce the probability of certain outputs. IIRC the explanation we settled on was calibration; ablating these heads seemed to increase log loss via overconfident predictions on borderline cases?
Many forms of interpretability seek to explain how the network's outputs relate high level concepts without referencing the actual functioning of the network. Saliency maps are a classic example, as are "build an interpretable model" techniques such as LIME.
In contrast, mechanistic interpretability tries to understand the mechanisms that compose the network. To use Chris Olah's words:
Mechanistic interpretability seeks to reverse engineer neural networks, similar to how one might reverse engineer a compiled binary computer program.
Or see this post by Daniel Filan.
I claim that this is 1) an instance of a common pattern that 2) is currently missing a step (the pre-newbie stage).
The general pattern is the following (terminology borrowed from Terry Tao):
I think that many experts mainly notice the 2->3 transition, but not the 1->2 one, and so often dissuade newbies by encouraging them to not work in the rigorous stage. I claim this is really, really bad, and that a solid understanding of the rigorous stage is a good idea for ~everyone doing technical work.
Here's a few examples: