Sorted by New

Wiki Contributions


As others have hinted at/pointed out in the comments, there is an entire science of deep learning out there, including on high-level (vs. e.g. most of low-level mech interp) aspects that can be highly relevant to alignment and that you seem to not be aware of/dismiss. E.g. follow the citation trail of An Explanation of In-context Learning as Implicit Bayesian Inference.

Related - I'd be excited to see connectome studies on how mice are mechanistically capable of empathy; this (+ computational models) seems like it should be in the window of feasibility given e.g. Towards a Foundation Model of the Mouse Visual Cortex: 'We applied the foundation model to the MICrONS dataset: a study of the brain that integrates structure with function at unprecedented scale, containing nanometer-scale morphology, connectivity with >500,000,000 synapses, and function of >70,000 neurons within a ∼ 1mm3 volume spanning multiple areas of the mouse visual cortex. This accurate functional model of the MICrONS data opens the possibility for a systematic characterization of the relationship between circuit structure and function.' 

The computational part could take inspiration from the large amounts of related work modelling other brain areas (using Deep Learning!), e.g. for a survey/research agenda: The neuroconnectionist research programme.

Excited to see people thinking about this! Importantly, there's an entire ML literature out there to get evidence from and ways to [keep] study[ing] this empirically. Some examples of the existing literature (also see Path dependence in ML inductive biases and How likely is deceptive alignment?): Linear Connectivity Reveals Generalization Strategies - on fine-tuning path-dependance, The Grammar-Learning Trajectories of Neural Language Models (and many references in that thread), Let's Agree to Agree: Neural Networks Share Classification Order on Real Datasets - on pre-training path-dependance. I can probably find many more references through my boorkmarks, if there's an interest for this.