All of Neel Nanda's Comments + Replies

I'll mention here that from what little I saw, your coordination efforts seemed a bit misguided and harmful. I am left with a sense that you wanted everyone to be friends and not feel threatened by interacting with others, to feel like "come in, the water is fine, don't worry, you won't end up with people criticizing you for maybe ending civilization or self-deceiving along the way or call you unethical". While I am pretty open to fairly respectful coordination and am a strong fan of finding positive-sum trade, I care more about being frank and honest in

... (read more)
2Ben Pace3d
Thanks for the link, I'll aim to give that podcast a listen, it's relevant to a bunch of my current thinking.

Another reason for publishing more quickly is that conversations with many interpretability researchers have led us to believe that there is a wealth of knowledge in short experiments and unpublished research that really should be shared. We’d encourage other organizations who think similarly to post frequently, and share results even if they’re not completely polished.

Extremely strong +1! There is so much low-hanging fruit in mechanistic interpretability (of the flavour "I could probably spend a full-time day working on this and find something mildly c... (read more)

I'd recommend editing a link to Ethan's comment to the top of the post - I think people could easily lead with a misleading impression otherwise

2Edouard Harris10d
Done, a few days ago. Sorry thought I'd responded to this comment.

Oh that's sketchy af lol. Thanks!

See my other comment - it turns out to be the boring fact that there's a large constant offset in the GPT-Neo embeddings. If you subtract the mean of the GPT-Neo embed it looks normal. (Though the fact that this exists is interesting! I wonder what that direction is used for?)

What do you mean about MLP0 being basically part of the embed btw? There is no MLP before the first attention layer right?

I mean that, as far as I can tell (medium confidence) attn0 in GPT-2 isn't used for much, and MLP0 contains most of the information about the value of the toke... (read more)

0Aryaman Arora19d
Cool that you figured that out, easily explains the high cosine similarity! It does seem to me that a large constant offset to all the embeddings is interesting, since that means GPT-Neo's later layers have to do computation taking that into account, which seems not at all like an efficient decision. I will def poke around more. Interesting on MLP0 (I swear I use zero indexing lol just got momentarily confused)! Does that hold across the different GPT sizes?

Thanks for clarifying your position, that all makes sense.

I'd argue that most of the updating should already have been done already, not even based on Chris Olah's work, but on neuroscientists working out things like the toad's prey-detection circuits.

Huh, can you say more about this? I'm not familiar with that example (though have a fairly strong prior on there being at best a weak association between specific neuroscience results + specific AI interp results)

1Charlie Steiner19d
I'm thinking about the paper Ewert 1987 [] , which I know about because it spurred Dennet's great essay Eliminate the Middletoad [], but I don't really know the gory details of, sorry. I agree the analogy is weak, and there can be disanalogies even between different ANN architectures. I think my intuition is based more on some general factor of "human science being able to find something interesting in situations kinda like this," which is less dependent on facts of the systems themselves and more about, like, do we have a paradigm for interpreting signals in a big mysterious network?

Just dug into it more, the GPT-Neo embed just has a large constant offset. Average norm is 11.4, norm of mean is 11. Avg cosine sim is 0.93 before, after subtracting the mean it's 0.0024 (avg absolute value of cosine sim is 0.1831)

avg. pairwise cosine similarity is 0.960

Wait, WTF? Are you sure? 0.96 is super high. The only explanation I can see for that is a massive constant offset dominating the cosine sim (which isn't crazy tbh).

The Colab claims that the logit lens doesn't work for GPT-Neo, but does work if you include the final block, which seems sane to me. I think that in GPT-2 the MLP0 is basically part of the embed, so it doesn't seem crazy for the inverse to be true (esp if you do the dumb thing of making your embedding + unembedding matrix the same)

1Aryaman Arora20d
I'm pretty sure! I don't think I messed up anywhere in my code (just nested for loop lol). An interesting consequence of this is that for GPT-2, applying logit lens to the embedding matrix (i.e. softmax(WEWU)=softmax(WEWTE)) gives us a near-perfect autoencoder (the top output is the token fed in itself), but for GPT-Neo it always gets us the vector with the largest magnitude since in the dot product x⋅y=∥x∥∥y∥cos(θ) the cosine similarity is a useless term. What do you mean about MLP0 being basically part of the embed btw? There is no MLP before the first attention layer right?

Really interesting, thanks for sharing!

I find it super surprising that the tasks worked up until Gopher, but stopped working at PaLM. That's such a narrow gap! That alone suggests some kind of interesting meta-level point re inverse scaling being rare, and that in fact the prize mostly picked up on the adverse selection of "the tasks that were inverse-y enough to not have issues on the models used.

One prediction this hypothesis makes is that people were overfitting to "what can GPT-3 not do" and thus that there's a bunch of submitted tasks that were U-Shaped by Gopher, and the winning ones were just the ones that were U Shaped a bit beyond Gopher?

I'm also v curious how well these work on Chinchilla.

1Ethan Perez19d
See this disclaimer [] on how they've modified our tasks (they're finding u-shaped trends on a couple tasks that are different from the ones we found inverse scaling on, and they made some modifications that make the tasks easier)

Idk, I definitely agree that all data so far is equally consistent with 'mechanistic interp will scale up to identifying whether GPT-N is deceiving us' and with 'MI will work on easy problems but totally fail on hard stuff'. But I do take this as evidence in favour of it working really well.

What kind of evidence could you imagine seeing that mechanistic understanding is actually sufficient for understanding deception?

Good methods might not excite you much in terms of the mechanistic clarity they provide, and vice versa.

Not sure what you mean by this

1Charlie Steiner20d
I'd argue that most of the updating should already have been done already, not even based on Chris Olah's work, but on neuroscientists working out things like the toad's prey-detection circuits. You seem pretty motivated by understanding in detail why and how NNs do particular things. But I think this doesn't scale to interpreting complicated world-modeling, and think that what we'll want is methods that tell us abstract properties without us needing to understand the details. To some aesthetics, this is unsatisfying. E.g. suppose we do the obvious extensions of ROME to larger sections of the neural network rather than just one MLP layer, driven by larger amounts of data. That seems like it has more power for detection or intervention, but only helps us understand the micro-level a little, and doesn't require it. If the extended-ROME example turns out to be possible to understand mechanistically through some lens, in an amount of work that's actually sublinear in the range of examples you want to understand, that would be strong evidence to me. If instead it's hard to understand even some behavior, and it doesn't get easier and easier as you go on (e.g. each new thing might be an unrelated problem to previous ones, or even worse you might run out of low-hanging fruit) that would be the other world.

Honestly I expect that training without dropout makes it notably better. Dropout is fucked! Interesting that you say logit lens fails and later layers don't matter - can you say more about that?

Arthur mentions something in the walkthrough about how GPT-Neo does seem to have some backup heads, which is wild - I agree that intuitively backup heads should come from dropout.

0Aryaman Arora20d
Huh interesting about the backup heads in GPT-Neo! I would not expect a dropout-less model to have that--some ideas to consider: * the backup heads could have other main functions but incidentally are useful for the specific task we're looking at, so they end up taking the place of the main heads * thinking of virtual attention heads [] , the computations performed are not easily interpretable at the individual head-level once you have a lot of layers, sort of like how neurons aren't interpretable in big models due to superposition Re: GPT-Neo being weird, one of the colabs in the original logit lens post [] shows that logit lens is pretty decent for standard GPT-2 of varying sizes but basically useless for GPT-Neo, i.e. outputs some extremely unlikely tokens for every layer before the last one. The bigger GPT-Neos are a bit better (some layers are kinda intepretable with logit lens) but still bad. Basically, the residual stream is just in a totally wacky basis until the last layer's computations, unlike GPT-2 which shows more stability (the whole reason logit lens works). One weird thing I noticed with GPT-Neo 125M's embedding matrix is that the input static embeddings are super concentrated in vector space, avg. pairwise cosine similarity is 0.960 compared to GPT-2 small's 0.225. On the later layers not doing much, I saw some discussion on the EleutherAI discord that probes can recover really good logit distributions from the middle layers of the big GPT-Neo models. I haven't looked into this more myself so I don't know how it compares to GPT-2. Just seems to be an overall profoundly strange model.

Yeah, agreed that's not an optimal arrangement, that was just a proof of concept for 'non tegum things can get a lot of orthogonality

I don't think so? If you have eg 8 vectors arranged evenly in a 2D plane (so at 45 degrees to each other) there's a lot of orthogonality, but no tegum product. I think the key weirdness of a tegum product is that it's a partition, where every pair in different bits of the partition is orthogonal. I could totally imagine that eg the best way to fit 2n vectors is n dimensional space is two sets of n orthogonal vectors, but at some arbitrary angle to each other.

I can believe that tegum products are the right way to maximise the number of orthogonal pairs, tho... (read more)

2Adam Jermyn20d
Oh yes you're totally right. I think partitions can get you more orthogonality than your specific example of overlapping orthogonal sets. Take n vectors and pack them into d dimensions in two ways: 1. A tegum product with k subspaces, giving (n/k) vectors per subspace and n^2*(1-1/k)orthogonal pairs. 2. (n/d) sets of vectors, each internally orthogonal but each overlapping with the others, giving n*d orthogonal pairs. If d < n*(1-1/k) the tegum product buys you more orthogonal pairs. If n > d then picking large k (so low-dimensional spaces) makes the tegum product preferred. This doesn't mean there isn't some other arrangement that does better though...

Thanks for the feedback! I'm impressed you had 5 people interested! What context was this in? (Ie, what do you mean by "here"?)

I use surface area as a fuzzy intuition around "having some model of what's going on, and understanding of what's happening in a problem/phenomena". Which doesn't necessarily looking like a full understanding, but looks like having a list in my head of confusing phenomena, somewhat useful ideas, and hooks into what I could investigate next.

I find this model useful both to recognise 'do I have any surface area on this problem' and to motivate next steps by 'what could give me more surface area on this problem' even if it's not a perfectly robust way.

Thanks for writing this! I found this a really helpful post for clarifying my own intuitions. Trying to operationalise what confused me before, and what now feels clear:

Confusion: Why does the model want to split vectors into these orthogonal subspaces? This seems somewhat unnatural and wasteful - it loses a lot of degrees of freedom, and surely it wants to spread out and minimise interference as much as possible?

Implicitly, I was imagining something like L2 loss where the model wants to minimise the sum of squared dot products.

New intuition: There is no i... (read more)

2Adam Jermyn20d
That's good to hear! And I agree with your new intuition. I think if you want interference terms to actually be zero you have to end up with tegum products, because that means you want orthogonal vectors and that implies disjoint subspaces. Right?

Makes sense, thanks! Fwiw, I think the correct takeaway is a mix of "try to form hypotheses about what's going on" and "it's much, much easier when you have at least some surface area on what's going on". There are definitely problems where you don't really know going in (eg, I did not expect modular addition to be being solved with trig identities!), and there's also the trap of being overconfident in an incorrect view. But I think the mode of iteratively making and testing hypotheses is pretty good.

An alternate, valid but harder, mode is to first do some... (read more)

0Garrett Baker24d
What do you mean by “surface area”?

Awesome, really appreciate the feedback! And makes sense re copilot, I'll keep that in mind in future videos :) (maybe should just turn it off?)

I'd love to hear more re possible-mistakes if you're down to share!

1Garrett Baker24d
The main big one was that when I was making experiments, I did not have in mind a particular theory about how the network was doing a particular capability. I just messed around with matrices, and graphed a bunch of stuff, and multiplied a bunch of weights by a bunch of other weights. Occasionally, I'd get interesting looking pictures, but I had no clue what to do with those pictures, or followup questions I could ask, and I think it's because I didn't have an explicit model of what I think it should be doing, and so couldn't update my picture of the mechanisms the network was using off the data I gathered about the network's internals.

Interesting results, thanks for sharing! To clarify, what exactly are you doing after identifying a direction vector? Projecting and setting its coordinate to zero? Actively reversing it?

And how do these results compare to the dumb approach of just taking the gradient of the logit difference at that layer, and using that as your direction?

Some ad-hoc hypotheses for what might be going on:

  • An underlying thing is probably that the model is representing several correlated features - is_woman, is_wearing_a_dress, has_long_hair, etc. Even if you can properly i
... (read more)

My understanding of interpretability is that humans understand what the AI is doing. Weaker AIs understanding the powerful AI doesn't feel like a solution to interpretability. Instead it feels like a solution to amplification that's ultimately still uninterpretable by humans.

This somewhat feels like semantics to me - this still feels like a win condition! I don't personally care about whether interpretability helps via humans directly understanding the systems themselves, vs us somewhat understanding it ourselves and being able to use weaker AI systems ... (read more)

Strong tractability: We can build interpretable AGI-level systems without sacrificing too much competitiveness.

Interesting argument! I think my main pushback would be on clarifying exactly what "interpretable" means here. If you mean "we reverse engineer a system so well, and understand it so clearly, that we can use this understanding to build the system from scratch ourselves", then I find your argument somewhat plausible, but I also think it's pretty unlikely that we live in that world. My personal definition of strong tractability would be something... (read more)

1David Scott Krueger1mo
I agree it's a spectrum. I would put it this way: * For any point on the spectrum there is some difficulty in achieving it. * We can approach that point from either direction, 1) starting with a "big blob of compute" and encountering the difficulty in extracting these pieces from the blob, or 2) starting with assembling the pieces, and encountering the difficulty in figuring out how to assemble them. * It's not at all clear that (1) would be easier than (2). * Probably it's best to do some of both. Regarding difficulty of (1) vs. (2), OTMH, there may be some sort of complexity-style argument that engineering, say, a circuit is harder than recognizing it. However, the DNN doesn't produce the circuit, we still need to do that using interpretability techniques. So I'm not sure how I feel about this argument.

Really excited to see this come out! This feels like one of my favourite interpretability papers in a while. The part of this paper I found most surprising/compelling was just seeing the repeated trend of "you do a natural thing, form a plausible hypothesis with some evidence. Then dig further, discover your hypothesis was flawed in a subtle way, but then patch things to be better". Eg, the backup name movers are fucking wild.

I'm really excited about this program! Super curious to see what comes out of it - I expect I'll learn a lot whether it goes well, or struggles to get traction. And I want to see more of this kind of ambitious scalable alignment effort!

If you're interested in getting into mechanistic interpretability work, you should definitely apply to it

Thanks for the feedback! Glad to hear it was useful :)

intro to IB video series What do you mean by this?

Thanks! I've been pretty satisfied by just how easy this was - one-shot recording, no prep, something I can do in the evenings when I'm otherwise pretty low energy. Yet making a product that seems good enough to be useful to people (even if it could be much better with more effort).

I'm currently doing ones for the toy model paper and induction heads paper, and experimenting with recording myself while I do research.

I'd love to see other people doing this kind of thing!

Thanks! I learned Python ~10 years ago and have no idea what sources are any good lol. I've edited the post with your recs :)

This is a fair point! I honestly have only vaguely skimmed that survey, and got the impression there was a lot of stuff in there that I wasn't that interested in. But it's on my list to read properly at some point, and I can imagine updating this a bunch.

For example, I think there is some chance that Neel Nanda’s mechanistic analysis of grokking will lead to capability improvements in the long run.

I'm curious if you have a particular concern in mind here?

My personal take is that this is the kind of interpretability work where I'm least concerned about it leading to capabilities improvements, since it's very specific to toy models and analysing deep learning puzzles, and pretty far from the state of the art frontier.

In a world where it does lead to advancements, my best guess is that it follows a pretty ind... (read more)

Thanks! So, I was trying to disentangle the two claims of "if examples are semantically similar (either similar patches of images, or words in similar contexts re predicting the next token) the model learns to map their full representations to be close to each other" and the claim of "if we pick a specific direction, the projection onto this direction is polysemantic. But it actually intersects many meaningful polytopes, and if we cluster the projection onto this direction (converting examples to scalars) we get clusters in this 1D space, and each cluster ... (read more)

Gotcha, thanks!

The polytope lens only becomes relevant when trying to explain what perfectly linear models can't account for. Although LN might create a bias toward directions, each layer is still nonlinear; nonlinearities probably still need to be accouted for somewhere in our explanations.

Re this, this somewhat conflicts with my understand of the direction lens. The point is not that things are perfectly linear. This point is that we can interpret directions after a non-linear activation function. The non-linearities are used between interpretable spaces... (read more)

To verify this claim, here we collect together activations in a) the channel dimension in InceptionV1 and b) various MLP layers in GPT2 and cluster them using HDBSCAN, a hierarchical clustering technique

Are the clusters in this section clustering the entire residual stream (ie a vector) or the projection onto a particular direction? (ie a scalar)

1Lee Sharkey2mo
For GPT2-small, we selected 6/1024 tokens in each sequence (evenly spaced apart and not including the first 100 tokens), and clustered on the entire MLP hidden dimension (4 * 768). For InceptionV1, we clustered the vectors corresponding to all the channel dimensions for a single fixed spatial dimension (i.e. one example of size [n_channels] per image).

Interesting, thanks! Like, this lets the model somewhat localise the scaling effect, so there's not a ton of interference? This seems maybe linked to the results on Emergent Features in the residual stream

Excited to see this work come out!

One core confusion I have: Transformers apply a LayerNorm every time they read from the residual stream, which scales the vector to have unit norm (ish). If features are represented as features, this is totally fine - it's the same feature, just rescaled. But if they're polytopes, and this scaling throws it into a different polytope, this is totally broken. And, importantly, the scaling factor is a global thing about all of the features currently represented by the model, and so is likely pretty hard to control. Shouldn't this create strong regularisation favouring using meaningful directions over meaningful polytopes?

1Lee Sharkey2mo
Thanks for your interest! Yes, that seems reasonable! One thing we want to emphasize is that it's perfectly possible to have both meaningful directions and meaningful polytopes. For instance, if all polytope boudaries intersect the origin, then all polytopes will be unbounded. In that case, polytopes will essentially be directions! The polytope lens only becomes relevant when trying to explain what perfectly linear models can't account for. Although LN might create a bias toward directions, each layer is still nonlinear; nonlinearities probably still need to be accouted for somewhere in our explanations. All this said, we haven't thought a lot about LN in this context. It'd be great to know if this regularisation is real and if it's strong enough that we can reason about networks without thinking about polytopes.

I think at least some GPT2 models have a really high-magnitude direction in their residual stream that might be used to preserve some scale information after LayerNorm. [I think Adam Scherlis originally mentioned or showed the direction to me, but maybe someone else?]. It's maybe akin to the water-droplet artifacts in StyleGAN touched on here:

We begin by observing that most images generated by StyleGAN exhibit characteristic blob-shaped artifacts that resemble water droplets. As shown in Figure 1, even when the droplet

... (read more)

Interesting post! I'm pretty curious about these.

A great resource for answering these questions is a set of model runs put out by the Stanford Center for Research into Foundation Models - they trained 5 runs of GPT-2 small and GPT-2 medium with 600 checkpoints and different random seeds, and released the weights. It seems like a good way to get some surface area on these questions with interesting real models. A few ideas that are somewhere on my maybe/someday research ideas list:

  • For each pair of models, feed in a bunch of text and look at the log prob for
... (read more)

Thanks, I really appreciate it.  Though I wouldn't personally consider this among the significant alignment work, and would love to hear about why you do!

Update 2: The nicely LaTeXed version of my Grokking post was also rejected from Arxiv?! I'll revisit this at some point in the next few weeks, but I'm going to give up on this for now. I consider this a mark against putting posts on Arxiv being an easy and fairly low effort thing to do (though plausibly still worth the effort).

Note: While I think the argument in this post is important and evidence, I overall expect phase changes to be a big deal. I consider my grokking work pretty compelling evidence that phase changes are tied to the formation of circuits and I'm excited about doing more research on this direction. Though it's not at all obvious to me that sophisticated behaviors like deception will be a single circuit vs many.

The picture of phase changes from your post, as well as the theoretical analysis here, both suggest that you may be able to observe capabilities as they form if you know what to look for. It seems like a kind of similar situation to the one suggested in the OP, though I think with a different underlying mechanism (in the OP I think there probably is no similar phase change in the model itself for any of these tasks) and a different set of things to measure.

In general if we get taken by surprise by a capability, it seems fairly likely that the story in retr... (read more)

Thanks for the thoughts, and sorry for dropping the ball on responding to this!

I appreciate the pushback, and broadly agree with most of your points.

In particular, I strongly agree that if you're trying to form the ability to be a research lead in alignment (and less strongly, be an RE/otherwise notably contribute to research) that forming an inside view is important, totally independently from how well it tracks the truth, and agree that I undersold that in my post. 

In part, I think the audience I had in mind is different from you? I see this as part... (read more)

3Rohin Shah3mo
Oh wild. I assumed this must be directed at researchers since obviously they're the ones who most need to form inside views. Might be worth adding a note at the top saying who your audience is. For that audience I'd endorse something like "they should understand the arguments well enough that they can respond sensibly to novel questions". One proxy that I've considered previously is "can they describe an experiment (in enough detail that a programmer could go implement it today) that would mechanistically demonstrate a goal-directed agent pursuing some convergent instrumental subgoal". I think people often call this level of understanding an "inside view", and so I feel like I still endorse what-people-actually-mean, even though it's quantitatively much less understanding than you'd want to actively do research. (Though it also wouldn't shock me if people were saying "everyone in the less technical roles needs to have a detailed take on exactly which agendas are most promising and why and this take should be robust to criticism from senior AI safety people". I would disagree with that.) I would have said you don't understand an aspect of their view, and that's exactly the aspect you can't evaluate. (And then if you try to make a decision, the uncertainty from that aspect propagates into uncertainty about the decision.) But this is mostly semantics. Thanks, I'll keep that in mind. Tbc I did all of this too -- by reading a lot of papers and blog posts and thinking about them. (The main exception is "how to do research", that I think I learned from just practicing doing research + advice from my advisors.)

I'm pretty unconvinced by this. I do not think that any substantial fraction of AI x-risk comes from an alignment research who thinks carefully about x-risk deciding that a GPT-3 level system isn't scary enough to take significant precautions with re boxing.

I think taking frivolous risks is bad, but that risk aversion to the point of not being able to pursue otherwise promising research directions seems pretty costly, while the benefits of averting risks >1e-9 is pretty negligible in comparison.

(To be clear, this argument does not apply to more powerful... (read more)

like, I agree that there are some safety lessons to be learned here, but it seems somewhat clear to me that people are working on WebGPT and InstructGPT primarily for capabilities reasons, not for existential-risk-from-AI reasons

This also seems like an odd statement - it seems reasonable to say "I think the net effect of InstructGPT is to boost capabilities" or even "If someone was motivated by x-risk it would be poor prioritisation/a mistake to work on InstructGPT". But it feels like you're assuming some deep insight into the intention behind the people w... (read more)

Yeah, I agree that I am doing reasoning on people's motivations here, which is iffy and given the pushback I will be a bit more hesitant to do, but also like, in this case reasoning about people's motivations is really important, because what I care about is what the people working at OpenAI will actually do when they have extremely powerful AI in their hands, and that will depend a bunch on their motivations.

I am honestly a bit surprised to see that WebGPT was as much driven by people who I do know reasonably well and who seem to be driven primarily by sa... (read more)

WebGPT seemed like one of the most in-expectation harmful projects that OpenAI has worked on

That seems weirdly strong. Why do you think that?

3Jacob Hilton3mo
For people viewing on the Alignment Forum, there is a separate thread on this question here. [] (Edit: my link to LessWrong is automatically converted to an Alignment Forum link, you will have to navigate there yourself.)

Thanks for writing this! I agree with most of the claims you consider to be objective, and appreciate you writing this up so clearly.

why training InstructGPT was safe?

Who is claiming that it is safe? I didn't get that implication from the post

"Safe" as in "safe enough for it to be on net better to run it" or "safe enough it wouldn't definitely kill everyone". It's not that I don't have popular intuition that GPT wouldn't kill anyone. It's just that I don't think it's a good habit to run progressively more capable systems while relying on informal intuitions about their safety. And then maybe I will see an explanation for why future safety tools would outpace capability progress, when now we are already at the point where current safety tools are not applicable to current AI systems.

It can be as easy as creating a pdf of your post and submitting it (although if your post was written in LaTeX, they'll want the tex file). If everything goes well, this takes less than an hour. 

Hilariously, this does not work. I converted my Grokking post to a PDF (very crudely - just printing to PDF) and uploaded that, and it was rejected: 

Dear author,

Thank you for submitting your work to arXiv. We regret to inform you that arXiv’s moderators have determined that your submission will not be accepted and made public on[ |][arXiv.

... (read more)

I should say formatting is likely a large contributing factor for this outcome. Tom Dietterich, an arXiv moderator, apparently had a positive impression of the content of your grokking analysis. However, research on arXiv will be more likely to go live if it conforms to standard (ICLR, NeurIPS, ICML) formatting and isn't a blogpost automatically exported into a TeX file.

Ah, sorry to hear. I wouldn't have predicted this from reading arXiv's content moderation guidelines [].

I would love this! I'm currently paying someone ~$200 to port my grokking post to LaTeX, getting a PDF automatically would be great

Making it stronger means more weights so the regularization should push against it, UNLESS you can simultaneously delete or dampen weights from the memorized answer part, right?

I think this does happen (and is very surprising to me!). If you look at the excluded loss section, I ablate the model's ability to use one component of the generalising algorithm, in a way that shouldn't affect the memorising algorithm (much), and see the damage of this ablation rise smoothly over training. I hypothesise that it's dampening memorisation weights simulatenously, thou... (read more)

Thanks! I agree that they're pretty hard to distinguish, and evidence between them is fairly weak - it's hard to distinguish between a winning lottery ticket at initialisation vs one stumbled upon within the first 200 steps, say.

My favourite piece of evidence is [this video from Eric Michaud]( - we know that the first 2 principle components of the embedding form a circle at the end of training. But if we fix the axes at the end of training, and project the embedding at the start of training, it's pretty circle-y

Interesting hypothesis, thanks!

My guess is that memorisation isn't really a discrete thing - it's not that the model has either memorised a data point or not, it's more that it's fitting a messed up curve to approximated all training data as well as it can. And more parameters means it memorises all data points a bit better, not that it has excellent loss on some data and terrible loss on others, and gradually the excellent set expands.

I haven't properly tested this though! And I'm using full batch training, which probably messes this up a fair bit.

Load More