I was pretty sure this exist, maybe even built into LW. It seems like an obvious thing, and there are lots of parts of LW that for some reason is hard to find from the fron page. Googleing "lesswrong dictionary" yealded
https://www.lesswrong.com/w/lesswrong-jargon
https://www.lesswrong.com/w/r-a-z-glossary
https://www.lesswrong.com/posts/fbv9FWss6ScDMJiAx/appendix-jargon-dictionary
In Defence of Jargon
People used to say (maybe still do? I'm not sure) that we should use less jargon to increase accessibility to writings on LW, i.e. make it easier to outsider to read.
I think this is mostly a confused take. The underlying problem is inferential distance. Geting rid of the jargon is actually unhelpful since it hides the fact that there is an inferential distance.
When I want to explain physics to someone and I don't know what they already know, I start by listing relevant physics jargon and ask them what words they know. This is a super quick way to find out what concepts they already have, and let me know what level I should start on. This work great in Swedish since in Swedish most physics words are distinct from ordinary words, but unfortunately don't work as well in English, which means I have to probe a bit deeper than just checking if they recognise the words.
Jargon isn't typically just synonyms to some common word, and when it is, I predict that it didn't start out that way, but that the real meaning was destroyed by too many people not bothering to learn the word properly. This is because people invent new words (jargon) when they need a new word to point to a new concept that didn't already have a word.
I seen some post by people who are not native to LW trying to fit in and be accepted by using LW jargon, without bothering to understand the underling concepts or even seem to notice that this is something they're supposed to do. The result is very jarring and rather than making the post look read like a typical LW post, their misuse of LW jargon makes it extra obvious that they are not a native. Edit to add: This clearly illustrates that the jargon isn't just synonyms to words/concepts they already know.
The way to make LW more accessible is to embrace jargon, as a clear signal of assumed prior knowledge of some concept, and also have a dictionary, so people can look up words they don't know. I think this is also more or less what we're already doing, because it's kind of the obvious thing to do.
There is basically zero risk that the people wanting less jargon will win this fight, because jargon is just too useful for communication, and humans really like communication, especially nerds. But maybe it would be marginally helpful for more people to have an explicit model of what jargon is and what it's for, which is my justification for this quick take.
According to my calculation, this embedding will result in too much compounding noise. I get the same noise results as you for one layer, but the noise grows too much from layer to layer.
However, Lucius suggested a different embedding, which seems to work.
We'll have some publication on this eventually. If you want to see the details sooner you can message me.
Since Bayesian statistics is both fundamental and theoretically tractable
What do you mean by "tractable" here?
In standard form, a natural latent is always approximately a deterministic function of . Specifically: .
What does the arrow mean in this expression?
You can find their prefeed contact info in each document in the Team section.
Yes there are, sort of...
You can apply to as many projects as you want, but you can only join one team.
The reasons for this is: When we've let people join more than one team in the past, they usually end up not having time for both and dropping out of one of the projects.
What this actually means:
When you join a team you're making a promise to spend 10 or more hours per week on that project. When we say you're only allowed to join one team, what we're saying is that you're only allowed to make this promise to one project.
However, you are allowed to help out other teams with their projects, even if you're not officially on the team.
@Samuel Nellessen
Thanks for answering Gunnars question.
But also, I'm a bit nervous that posting their email here directly in the comments is too public, i.e. easy for spam-bots to find.
If the research lead want to be contactable, their contact info is in their projekt document, under the "Team" section. Most (or all, I'm not sure) research leads have some contact info.
Estimated MSE loss for three diffrent ways of embedding features into neuons, when there are more possible features than neurons.
I've typed up some math notes for how much MSE loss we should expect for random embedings, and some other alternative embedings, for when you have more features than neurons. I don't have a good sense for how ledgeble this is to anyone but me.
Note that neither of these embedings are optimal. I belive that the optimal embeding for minimising MSE loss is to store the features in almost orthogonal directions, which is similar to random embedings but can be optimised more. But I also belive that MSE loss don't prefeer this solution very much, which means that when there are other tradeofs, MSE loss might not be enough to insentivise superposstion.
This does not mean we should not expect superpossition in real network.
Setup and notation
Assuming:
True feature values:
Using random embedding directions (superpossition)
Estimated values:
Total Mean Squared Error (MSE)
MSErand=z((1−a)2+(z−1)a2D)+(T−z)za2D≈z(1−a)2+zTDa2This is minimised by
a=DT+DMaking MSE
MSErand=zTT+D=z(1−DT+D)One feature per neuron
We emebd a single feature in each neuron, and the rest of the features, are just not represented.
Estimated values:
Total Mean Squared Error (MSE)
MSEsingle=zT−DDOne neuron per feature
We embed each feature in a single neuron.
We assume that the probability of co-activated features on the same neuron is small enough to ignore. We also assume that every neuron is used at least once. Then for any active neuron, the expected number of inactive neurons that will be wrongfully activated, are T−DD, giving us the MSE loss for this case as
MSEmulti=z((1−a)2+(TD−1)a2)We can already see that this is smaller than MSErand, but let's also calculate what the minimum value is. MSErand is minimised by
a=DTMaking MSE
MSErand=z(1−DT)