AI ALIGNMENT FORUM
AF

88
RGRGRG
Ω1030
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No posts to display.
Finding Sparse Linear Connections between Features in LLMs
RGRGRG2y10

To confirm - the weights you share, such as 0.26 and 0.23 are each individual entries in the W matrix for:
y=Wx ?

Reply
Become a PIBBSS Research Affiliate
RGRGRG2y10

What city/country is PIBBSS based out of / where will the retreats be?  (Asking as a Bay Area American without a valid passport).

Reply
The positional embedding matrix and previous-token heads: how do they actually work?
RGRGRG2y10

This is a surprising and fascinating result.  Do you have attention plots of all 144 heads you could share?

I'm particularly interested in the patterns for all heads on layers 0 and 1 matching the following caption

(Left: a 50x50 submatrix of LXHY's attention pattern on a prompt from openwebtext-10k. Right: the same submatrix of LXHY's attention pattern, when positional embeddings are averaged as described above.)

Reply
Thoughts on sharing information about language model capabilities
RGRGRG2y00

My primary safety concern is what happens if one of these analyses somehow leads to a large improvement over the state of the art.  I don't know what form this would take and it might be unexpected given the Bitter Lesson you cite above, but if it happens, what do we do then?  Given this is hypothetical and the next large improvement in LMs could come elsewhere, I'm not suggesting we stop sharing now.  But I think we should be prepared that there might be a point in time where we need to acknowledge such sharing leads to significantly stronger models and thus should re-evaluate sharing such eval work.

Reply
No wikitag contributions to display.