PhD student in statistics and sociology at the University of Chicago. Co-founder of the Sentience Institute. Currently looking for AI alignment dissertation ideas, perhaps in interpretability, causality, or value learning.
This model was produced by fine-tuning DeBERTa XL on a dataset produced by contractors labeling a bunch of LM-generated completions to snippets of fanfiction that were selected by various heuristics to have a high probability of being completed violently.
I think you might have better performance if you train your own DeBERTa XL-like model with classification of different snippets as a secondary objective alongside masked token prediction, rather than just fine-tuning with that classification after the initial model training. (You might use different snippets in each step to avoid double-dipping the information in that sample, analogous to splitting text data for causal inference, e.g., Egami et al 2018.) The Hugging Face DeBERTa XL might not contain the features that would be most useful for the follow-up task of nonviolence fine-tuning. However, that might be a less interesting exercise if you want to build tools for working with more naturalistic models.
I appreciate making these notions more precise. Model splintering seems closely related to other popular notions in ML, particularly underspecification ("many predictors f that a pipeline could return with similar predictive risk"), the Rashomon effect ("many different explanations exist for the same phenomenon"), and predictive multiplicity ("the ability of a prediction problem to admit competing models with conflicting predictions"), as well as more general notions of generalizability and out-of-sample or out-of-domain performance. I'd be curious what exactly makes model splintering different. Some example questions: Is the difference just the alignment context? Is it that "splintering" refers specifically to features and concepts within the model failing to generalize, rather than the model as a whole failing to generalize? If so, what does it even mean for the model as a whole to fail to generalize but not features failing to generalize? Is it that the aggregation of features is not a feature? And how are features and concepts different from each other, if they are?