RGRGRG — AI Alignment Forum

Finding Sparse Linear Connections between Features in LLMs

To confirm - the weights you share, such as 0.26 and 0.23 are each individual entries in the W matrix for:
y=Wx ?

What city/country is PIBBSS based out of / where will the retreats be? (Asking as a Bay Area American without a valid passport).

The positional embedding matrix and previous-token heads: how do they actually work?

RGRGRG2y10

This is a surprising and fascinating result. Do you have attention plots of all 144 heads you could share?

I'm particularly interested in the patterns for all heads on layers 0 and 1 matching the following caption

(Left: a 50x50 submatrix of LXHY's attention pattern on a prompt from openwebtext-10k. Right: the same submatrix of LXHY's attention pattern, when positional embeddings are averaged as described above.)

Thoughts on sharing information about language model capabilities

RGRGRG2y00

My primary safety concern is what happens if one of these analyses somehow leads to a large improvement over the state of the art. I don't know what form this would take and it might be unexpected given the Bitter Lesson you cite above, but if it happens, what do we do then? Given this is hypothetical and the next large improvement in LMs could come elsewhere, I'm not suggesting we stop sharing now. But I think we should be prepared that there might be a point in time where we need to acknowledge such sharing leads to significantly stronger models and thus should re-evaluate sharing such eval work.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments