The common narrative in ML is that the MLP layers are effectively a lookup table (see e.g. “Transformer Feed-Forward Layers Are Key-Value Memories”). This is probably a part of the correct explanation but the true story is likely much more complicated. Nevertheless, it would be helpful to understand how NNs represent their mappings in settings where they are forced to memorize, i.e. can’t learn any general features and basically have to build a dictionary.

Most probably a noobish question but I couldn't resist asking.

If a neural network learns either to become a lookup table or to generalize over the data, what would happen if we initialized the weights of the network to be as much as a lookup table as possible?

For example if you have N=1000 data points and only M=100 parameters. Initialize the 100 weights so that each neuron extracts only 1 random data point (without replacement). Could that somehow speedup the training more than starting from pure randomness or gaussian noise?

If then we could also try with initializing a lookup table based on a quick clustering to ensure good representation of the different features from the get go.

What should I know that would make this an obviously stupid idea?

Most probably a noobish question but I couldn't resist asking.

If a neural network learns either to become a lookup table or to generalize over the data, what would happen if we initialized the weights of the network to

beas much as a lookup table as possible?For example if you have N=1000 data points and only M=100 parameters. Initialize the 100 weights so that each neuron extracts only 1 random data point (without replacement). Could that somehow speedup the training more than starting from pure randomness or gaussian noise?

If then we could also try with initializing a lookup table based on a quick clustering to ensure good representation of the different features from the get go.

What should I know that would make this an obviously stupid idea?

Thanks!