Sorted by New

Wiki Contributions


I wonder if this is related to vector-packing and unpacking via cosine similarity: the activation norm is increased so layers can select a large & variable number of semi-orthogonal bases.  (This is very much related to your information packing idea.)  

Easy experimental manipulation to test this would be to increase the number of heads, thereby decreasing the dimensionality of the cos_sim for attention, which should increase the per-layer norm growth.  (Alas, this will change the loss too - so not a perfect manipulation)