AI ALIGNMENT FORUM
AF

Tim Hanson
Ω2010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Residual stream norms grow exponentially over the forward pass
Tim Hanson2y20

I wonder if this is related to vector-packing and unpacking via cosine similarity: the activation norm is increased so layers can select a large & variable number of semi-orthogonal bases.  (This is very much related to your information packing idea.)  

Easy experimental manipulation to test this would be to increase the number of heads, thereby decreasing the dimensionality of the cos_sim for attention, which should increase the per-layer norm growth.  (Alas, this will change the loss too - so not a perfect manipulation)

Reply
No wikitag contributions to display.
No posts to display.