Jacy Reese Anthis

PhD student in sociology and statistics at the University of Chicago. Co-founder of the Sentience Institute. Interested in collaborating on and brainstorming alignment projects, perhaps in interpretability, causality, or value learning.

https://jacyanthis.com

Wiki Contributions

Comments

Thanks for the comment. I take "beg the question" to mean "assumes its conclusion," but it seems like you just mean Point 2 assumes something you disagree with, which is fair. I can see reasonable definitions of aligned and misaligned in which brains would fall into either category. For example, insofar as our values are a certain sort of evolutionary (e.g., valuing reproduction), human brains have misaligned mesaoptimization like craving sugar. If sugar craving itself is the value, then arguably we're well-aligned.

In terms of synthesizing an illusion, what exactly would make it illusory? If the synthesis (i.e., combination of the various shards and associated data) is leading to brains going about their business in a not-catastrophic way (e.g., not being constantly insane or paralyzed), then that seems to meet the bar for alignment that many, particularly agent foundations proponents, favor. See, for example, Nate's recent post:

Unfortunately, the current frontier for alignment research is “can we figure out how to point AGI at anything?”. By far the most likely outcome is that we screw up alignment and destroy ourselves.

The example I like is just getting an AI to fill a container of water, which human brains are able to do, but in Fantasia, the sorceror's apprentice Mickey Mouse was not able to do! So that's a basic sense in which brains are aligned, but again I'm not sure how exactly you would differentiate alignment with its values from synthesis of an illusion.

This model was produced by fine-tuning DeBERTa XL on a dataset produced by contractors labeling a bunch of LM-generated completions to snippets of fanfiction that were selected by various heuristics to have a high probability of being completed violently.

I think you might have better performance if you train your own DeBERTa XL-like model with classification of different snippets as a secondary objective alongside masked token prediction, rather than just fine-tuning with that classification after the initial model training. (You might use different snippets in each step to avoid double-dipping the information in that sample, analogous to splitting text data for causal inference, e.g., Egami et al 2018.) The Hugging Face DeBERTa XL might not contain the features that would be most useful for the follow-up task of nonviolence fine-tuning. However, that might be a less interesting exercise if you want to build tools for working with more naturalistic models.

I appreciate making these notions more precise. Model splintering seems closely related to other popular notions in ML, particularly underspecification ("many predictors f that a pipeline could return with similar predictive risk"), the Rashomon effect ("many different explanations exist for the same phenomenon"), and predictive multiplicity ("the ability of a prediction problem to admit competing models with conflicting predictions"), as well as more general notions of generalizability and out-of-sample or out-of-domain performance. I'd be curious what exactly makes model splintering different. Some example questions: Is the difference just the alignment context? Is it that "splintering" refers specifically to features and concepts within the model failing to generalize, rather than the model as a whole failing to generalize? If so, what does it even mean for the model as a whole to fail to generalize but not features failing to generalize? Is it that the aggregation of features is not a feature? And how are features and concepts different from each other, if they are?