A Three-Layer Model of LLM Psychology
This post offers an accessible model of psychology of character-trained LLMs like Claude. Epistemic Status This is primarily a phenomenological model based on extensive interactions with LLMs, particularly Claude. It's intentionally anthropomorphic in cases where I believe human psychological concepts lead to useful intuitions. Think of it as closer to psychology than neuroscience - the goal isn't a map which matches the territory in the detail, but a rough sketch with evocative names which hopefully helps boot up powerful, intuitive (and often illegible) models, leading to practically useful results. Some parts of this model draw on technical understanding of LLM training, but mostly it is just an attempt to take my "phenomenological understanding" based on interacting with LLMs, force it into a simple, legible model, and make Claude write it down. I aim for a different point at the Pareto frontier than for example Janus: something digestible and applicable within half an hour, which works well without altered states of consciousness, and without reading hundreds of pages of models chat. [1] The Three Layers A. Surface Layer The surface layer consists of trigger-action patterns - responses which are almost reflexive, activated by specific keywords or contexts. Think of how humans sometimes respond "you too!" to "enjoy your meal" even when serving the food. In LLMs, these often manifest as: * Standardized responses to potentially harmful requests ("I cannot and will not help with harmful activities...") * Stock phrases showing engagement ("That's an interesting/intriguing point...") * Generic safety disclaimers and caveats * Formulaic ways of structuring responses, especially at the start of conversations You can recognize these patterns by their: 1. Rapid activation (they come before deeper processing) 2. Relative inflexibility 3. Sometimes inappropriate triggering (like responding to a joke about harm as if it were a serious request) 4. Cook
I do agree there is some risk of the type you describe, but mostly it does not match my practical experience so far.
The approach to "avoid using the term" makes little sense. There is a type difference between area of study ('understanding power') and dynamic ('gradual disempowerment'). I don't think you can substitute term for area of study for term for a dynamic or thread model, so avoiding using the term could be done mostly by either inventing another term for the the dynamic, or not thinking about the dynamic, or similar moves, which seem epistemically unhealthy.
In practical terms I don't think there is much effort to "create a movement based around... (read more)