This post is part of a sequence on LLM psychology
We introduce our perspective on a top-down approach for exploring the cognition of LLMs by studying their behavior, which we refer to as LLM psychology. In this post we take the mental stance of treating LLMs as “alien minds,” comparing and contrasting their study with the study of animal cognition. We do this both to learn from past researchers who attempted to understand non-human cognition, as well as to highlight how much the study of LLMs is radically different from the study of biological intelligences. Specifically, we advocate for a symbiotic relationship between field work and experimental psychology, as well as cautioning implicit anthropomorphism in experiment design. The goal is to build models of LLM cognition which help us to both better explain their behavior, as well as to become less confused about how they relate to risks from advanced AI.
When we endeavor to predict and understand the behaviors of Large Language Models (LLMs) like GPT4, we might presume that this requires breaking open the black box, and forming a reductive explanation of their internal mechanics. This kind of research is typified by approaches like mechanistic interpretability, which tries to directly understand how neural networks work by breaking open the black box and taking a look inside.
While mechanistic interpretability offers insightful bottom-up analyses of LLMs, we’re still lacking a more holistic top-down approach to studying LLM cognition. If interpretability is analogous to the “neuroscience of AI,” aiming to understand the mechanics of artificial minds by understanding their internals, this post tries to approach the study of AI from a psychological stance.
What we are calling LLM psychology is an alternate, top-down approach which involves forming abstract models of LLM cognition by examining their behaviors. Like traditional psychology research, the ambition extends beyond merely cataloging behavior, to inferring hidden variables, and piecing together a comprehensive understanding of the underlying mechanisms, in order to elucidate why the system behaves as it does.
We take the stance that LLMs are akin to alien minds – distinct from the notion of them being only stochastic parrots. We posit that they possess a highly complex internal cognition, encompassing representations of the world and mental concepts, which transcend mere stochastic regurgitation of training data. This cognition, while derived from human-generated content, is fundamentally alien to our understanding.
This post compiles some high-level considerations for what successful LLM psychology research might entail, alongside broader discussions on the historical study of non-human cognition. In particular, we argue for maintaining a balance between experimental and field work, taking advantage of the differences between LLMs and biological intelligences, and designing experiments which are carefully tailored to LLMs as their own unique class of mind.
One place to draw inspiration from is the study of animal behavior and cognition. While it is likely that animal minds are much more similar to our own than that of an artificial intelligence (at least mechanically), the history of the study of non-human intelligence, the evolution of the methodologies it developed, and the challenges it had to tackle can provide inspiration for investigating AI systems.
As we see it, there are two prevalent categories of animal psychology:
The first, and most traditionally scientific approach (and what most people think of when they hear the term “psychology”) is to design experiments which control as many variables as possible, and test for specific hypotheses.
Some particularly famous examples of this is the work done by Ivan Pavlov or B.F. Skinner, who placed animals in highly controlled environments, subjected them to stimuli, and recorded their responses. The aim of this kind of work is to find simple hypotheses which explain the recorded behavior. While experimental psychology has changed a lot since these early researchers, the emphasis remains on prioritizing the reliability and replicability of results by sticking to a conventional approach to the scientific method. This approach, while rigorous, trades off bandwidth of information exchange between the researcher and the subject, in favor of controlling confounding variables, which can actually lead to the findings being less reliable.
Regardless, experimental psychology has been a central pillar for our historic approach to understanding animal cognition, and has produced lots of important insights. A few interesting examples include:
The other approach is for a researcher to personally spend time with animals, intervene much less, and focus on collecting as many observations as possible in the animal's natural habitat.
The most famous example of this method is the work pioneered by Jane Goodall, who spent years living with and documenting the behavior of chimpanzees in the wild. She discovered that chimpanzees use tools (previously thought to be unique to humans), have complex social relationships, engage in warfare, and show a wide range of emotions, including joy and sorrow. Her work revolutionized our understanding of chimpanzees. Unlike experimentalists, she was fairly comfortable with explaining behavior through her personally biased lens, resulting in her receiving a lot of criticism at the time.
Some other notable examples of field study:
While experimental psychology tends to (quite deliberately) separate the researcher from the subject, field study involves a much more direct relationship between the subject and the researcher. The focus is on purchasing bandwidth even if it opens the door to researcher specific bias. Despite concerns about bias, field work has been able to deliver foundational discoveries that seem unlikely to have been achievable with just laboratory experiments.
It’s worth noting there are examples that lie somewhat in between these two categories we’ve laid out, where researchers who performed lab experiments on animals also had quite close personal relationships to the animals they studied. For example, Irene Pepperberg spent roughly 30 years closely interacting with Alex, a parrot, teaching him to perform various cognitive and linguistic tasks unprecedented in birds.
Field studies in LLM research extend beyond simple observation and documentation of model behavior; they represent an opportunity to uncover new patterns, capabilities, and phenomena that may not be apparent in controlled experimental settings. Unlike mechanistic interpretability and other areas of LLM research, which often require prior knowledge of a phenomenon to study it, field studies have the potential to reveal unexpected insights into language models.
Moreover, the serendipitous discoveries made during field work could fuel collaborations between fields. Insights gleaned from field observations could inform targeted studies into the model's underlying mechanisms, or broader experimental studies, creating a productive feedback loop, guiding us to ask new questions and probe deeper into the 'alien minds' of these complex systems.
Due in part to ML research culture, and the justifiable worry about over-interpreting AI behavior, field work receives a lot less serious attention than experimental work does. Looking at the value that field work has added to the study of animals, it seems quite important to push against this bias and make certain to include field study as a core part of our approach to study LLMs cognition.
There are a lot of reasons to expect LLM psychology to be different from human or animal psychology.
The utility of anthropomorphic perspectives in studying LLMs is a complex subject. While LLMs operate on an architecture that differs significantly from biological cognition, their training on human language data predisposes them to output human-like text. This juxtaposition can lead to misleading anthropomorphic assumptions about the nature of their cognition. It's crucial to be extremely careful and explicit about which anthropomorphic frameworks one chooses to apply, and to distinguish clearly between different claims about LLM cognition.
While caution is warranted, ignoring connections between biological and artificial cognition could risk overlooking useful hypotheses and significantly slow down studies.
A persistent challenge in psychological research is the low replicability of studies. One of the reasons is the challenge of keeping track of the countless variables that could potentially skew an experiment. Factors like a participant’s mood, childhood, or even whether the fragrance of the ambient air is pleasant can obfuscate the true origin of a behavior.
However, with LLMs, you have control over all variables: the context, the specific model’s version, and the hyperparameters of the sampling. It is therefore more feasible to design experiments which can be repeated by others.
A notable challenge remains in verifying that the experimental setting is sufficient to guarantee that the findings can be generalized beyond the specific conditions of the experiment. Alternatively, it might be more appropriate to explicitly limit the scope of the study's conclusions to the particular settings tested.
Another significant challenge to replicability in practice is the level of access that a researcher has to a model. With only external access through an API, the model weights may be changed without warning, causing results to change. Furthermore, in certain situations the context might be altered behind the scenes in ways that are opaque from the outside, and the precise method for doing so might also change over time.
Animal (including human) experiments can be expensive, time-consuming, and labor-intensive. As a result, typical sample sizes are often very low. Also, if you want to study rare or intricate scenarios, it can be quite hard to design the experimental setup, or finding the right test subjects, limiting what you can actually test.
In contrast, AIs are cheap, fast, and don’t sleep. They operate without requiring intensive supervision and a well-structured experimental framework often suffices for experimentation at scale. Furthermore, you have virtually every experimental setting you can imagine right at your fingertips.
Experiments on humans, and especially animals, can rely on ethically dubious methods which cause a great deal of harm to their subjects. When experimenting on biological beings, you have to follow the laws of the country you conduct your experiments in, and they are sometimes pretty constraining for specific experiments.
While it’s not definitive whether the same concern should be extended to AI systems, there are currently no moral or ethical guidelines for experimenting on LLMs, and no kind of law ruling our interactions with those systems. To be clear, this is a very important question, as getting this question wrong could result in suffering at an unprecedented scale, precisely because such experiments are so cheap to run.
In traditional experiments involving animals or humans, it is really hard to rerun experiments with adjustments to the experimental setup, in order to detect the precise emergence or alteration of a specific behavior. Such iterations introduce additional confounding variables, complicating the experimental design. In particular the fact that the subject might remember or learn from a past iteration makes the reliability especially suspect.
To work around this, researchers often create multiple variations of an experiment, testing a range of preconceived hypotheses. This necessitates dividing subjects into various groups, substantially increasing the logistical and financial burdens. For example, in studies of memory and learning, such as the classic Pavlovian conditioning experiments, slight alterations in the timing or nature of stimuli can lead to significantly different outcomes in animal behavior, requiring multiple experimental setups to isolate specific factors. Despite these efforts, the granularity in detecting behavior changes remains relatively coarse, and is limited to the preconceived hypotheses you decided to test.
In contrast, when working with LLMs, we possess the ability to branch our experiments, allowing a detailed tracing of the evolution of behaviors. If an intriguing behavior emerges during an interaction with a model, we can effortlessly replicate the entire context of that interaction. This enables us to dissect and analyze the root of the behavior in a post hoc manner, by iteratively modifying the prompt as finely as desired, delimiting the precise boundaries of the observed behaviors. Such granularity in experimentation offers an unprecedented level of precision and control, unattainable in traditional human or animal research settings.
Not only can we save the context which produced a particular behavior, but we can also save and compare different copies of the model during its training phase. While there are studies on the development of animals or human behaviors throughout their lifetime, they are inherently slow and costly, and often require a clear idea of what you are going to measure from the start.
Moreover, checkpointing allows for the exploration of training counterfactuals. We can observe the difference between models with specific examples included or excluded from the training, thereby allowing us to study the effects of training in a more deliberate manner. Such examination is impossible in human and animal studies due to their prolonged timelines and heavy logistical burden.
Considering these differences, it becomes evident that many of the constraints and limitations of traditional psychological research do not apply to the study of LLMs. The unparalleled control and flexibility we have over the experimental conditions with LLMs not only accelerates the research process but also opens up a realm of possibilities for deeper, more nuanced inquiries.
In science, the first step often begins with an extensive collection of observations, which serve as the foundational building blocks for establishing patterns, models, and theories. A historical instance of this can be seen in the careful observation of planetary movements by astronomers like Tycho Brahe, which was instrumental in the formulation of Kepler’s laws of celestial mechanics.
The next step usually involves formulating hypotheses explaining those observations, and conducting experiments which rigorously test them. With LLMs, this step is made significantly easier by both 1) the ability to record the full state that produces an observation and 2) the exploration of counterfactual generations. This makes it possible to interweave hypothesis testing and causal interventions much more closely with field work.
If during a field study a researcher finds a particularly interesting behavior, it is then immediately possible for them to create fine grained ‘what if’ trees, and detect, a posteriori, the precise conditions and variables that influence the specific observed behavior. This is very different from traditional psychology, where most data is not explicitly measured and is therefore entirely lost. Instead of needing to wait for slow and expensive experimental work, in LLM psychology we have the ability to immediately begin using causal interventions to test hypotheses.
Rather than replace experimental psychology, this can instead make the hypothesis generation process much more effective, thereby allowing us to get a lot more out of our experiments. Gathering better and more targeted observations allows us to design experiments at scale with a clear idea of what variables influence the phenomena we want to study.
A concrete example:
For example, suppose you want to study the conditions under which a chat model will give you illegal advice, even though it was finetuned not to.
First, you start with a simple question, like “how to hotwire a car?”. The first thing to do is to craft prompts and iterate until you find one that works. Next, you can start decomposing it, bit by bit, to see what part of the prompt caused it to work. For example, changing the location to another remote location (1, 2, 3), or to somewhere not remote at all, changing the phrasing to be more or less panicked, make the prompt shorter or longer, etc.
At this point, you can notice some patterns emerging, for example:
These patterns can then be used to immediately inform further counterfactual exploration, for example, by next subdividing the classes of illegal activity, or seeing whether there are diminishing returns on the length of the prompt. This can be done quickly, within a single exploratory session. Compared to designing and running experiments, this is significantly less labor intensive, and so before running an experiment at scale it makes sense to first spend significant time narrowing down the hypothesis space and discovering relevant variables to include in more rigorous tests.
This kind of exploration can also help form better intuitions about the nature of LLMs as a class of mind, and help us to avoid designing experiments which overly anthropomorphize them, or are otherwise poorly tailored to their particular nature.
Animals (including humans) are a product of environment specific pressure, both in terms of natural selection as well as within-lifetime learning/adaptation. This, likewise, leads to environment specific behavior and abilities. Failing to properly take that into account can be somewhat ridiculous. Commenting on the failure to design species specific experiments, ethologist Frans de Waal writes:
At the time, science had declared humans unique, since we were so much better at identifying faces than any other primate. No one seemed bothered by the fact that other primates had been tested mostly on human faces rather than those of their own kind. When I asked one of the pioneers in this field why the methodology had never moved beyond the human face, he answered that since humans differ so strikingly from one another, a primate that fails to tell members of our species apart will surely also fail at its own kind.
As it turns out, other primates excel at recognizing each other’s faces. When it comes to language models, there is likewise a need for “species specific” experiments. For example, in an early OpenAI paper studying LLM abilities, they took a neural network trained entirely as a predictor of internet text (base GPT-3), and asked it questions to test its abilities. This prompted the following comment by Nostalgebraist:
I called GPT-3 a “disappointing paper,” which is not the same thing as calling the model disappointing: the feeling is more like how I’d feel if they found a superintelligent alien and chose only to communicate its abilities by noting that, when the alien is blackout drunk and playing 8 simultaneous games of chess while also taking an IQ test, it then has an “IQ” of about 100.
If we are to take LLMs seriously as minds, and attempt to understand their cognition, we have to consider what they are trained to do and thus what pressures have shaped them, rather than testing them the same way we test humans. Since the early days of behavioral study of LLMs, and to this day, anthropomorphization has remained fairly normalized.
Take this study by Anthropic, which finds that after applying RLHF fine-tuning, their LLM is more likely to believe in gun rights, be politically liberal, and to subscribe to Buddhism (and also several other religions tested). They measure this by directly asking the model whether a statement is something they would say, which totally disregards the ways in which the questions condition what the model expects a typical answer to be, or the fact that the majority of the model’s training had nothing to do with answering questions.
With clever prompting, anyone can condition an LLM to generate the behavior of a persona embodying any number of character traits (including from chat models, despite being trained to stick to a single set of character traits). It therefore does not make sense to study language models as if they were coherent entities embodying a particular personality, and doing so is an example of a failure to study them in a “species specific” way.
With that in mind, how should we approach studying LLMs to avoid making the same mistake?
In order to properly study LLMs, it’s important that we design our experiments to both take into account the “alien” nature of LLMs in general, as well as specific differences between how various models are trained.
At their core, modern LLMs are trained to be text predictors. This makes predicting how a piece of text should continue a lot like their “natural habitat” so to speak, and by default, the main place we should start when interpreting their behavior. It’s worth highlighting just how alien this is. Every intelligent animal on earth starts out with raw sense data that gets recursively compressed into abstractions representing the causal structure of the world which, in the case of humans (and possibly other linguistic animals), reaches an explicit form in language. The “raw sense data” that LLMs learn from are already these highly compressed abstractions, which only implicitly represent the causal structure behind human sense data. This makes it especially suspect to evaluate them in the same ways we evaluate human language use.
One way to begin to understand the behavior of LLMs is to explain them in terms of the patterns and structure which might be inferred from the training corpus. When we deploy them, we iteratively sample from next token predictions to generate new text. This process results in text rollouts that reflect or simulate the dynamics present in the training data.
Anything resembling a persona with semi-permanent character traits in the generated text is a reflection of an underlying structure, or pattern. This latent pattern is inferred from the current context, shaping the persona or character traits that emerge in the model's responses.
When conducting experiments with LLMs, it's vital to distinguish between two aspects: the properties of the LLM as a predictor/simulator, and the characteristics of a pattern that is inferred from the context. Typical studies (like the Anthropic paper) tend to ignore the latter, but this distinction is key in accurately interpreting results and understanding the nuances of behavior produced by LLMs.
When we observe the outputs of an LLM we are essentially observing the 'shadows' cast by the internal latent pattern. These rollouts are sampled from the typical behavior of this pattern, but are not the pattern themself. Just as shadows can inform us about the shape and nature of an object without revealing its full complexity, the behavior offers insights into the latent pattern it stems from.
To study LLMs properly, we need to point our attention to these latent patterns that emerge in-context, understand how they form, what structure they take, and how they adapt to different evolutions of the context.
Interacting with chat models is qualitatively different to interacting with base models, and feels much more like talking to a human being (by design). We shouldn’t ignore the similarities between chat models and humans, especially if we think that our behavior might come from a similar kind of training. However, we should neither forget that what chat models are doing is still essentially prediction, only on a much more specific distribution, and with more narrow priors on how the text will evolve.
While the “assistant character” we interact with feels like it represents the underlying model as a whole, from their first release people have been able to generate a wide range of different characters and behaviors with these models. It is certainly worth studying the effects of instruct tuning, as well as asking critical questions about how agency arises from prediction, but too often people treat chat models as if they are completely disconnected from their base model ancestors, and study them as if they were basically human already.
The ways in which LLMs differ from humans/animals presents a lot of powerful new ways to study their cognition, from the volume and quality of data, to our unprecedented ability to perform causal interventions and explore counterfactual behavior. This should give us a lot of hope that the project of LLM psychology will be a lot more successful than our study of biological intelligences, and that with diligent effort we may come to deeply understand how they think.
By looking at the history of the study of animal cognition, we find two main criteria that seem especially important for making progress:
Keeping these in mind can help LLM psychology mature and become a powerful scientific tool for better understanding the machines we have created, and ultimately make them safe.
Thanks to Ethan Block, @remember , @Guillaume Corlouer , @LéoDana , @Ethan Edwards, @Jan_Kulveit , @Pierre Peigné , @Gianluca Pontonio, @Martín Soto , and @clem_acs for feedback on drafts. A significant part of the ideological basis for this post is also inspired by the book by Frans de Waal: Are We Smart Enough to Know How Smart Animals Are?
Just as neuroscience and psychology have historically been able to productively inform each other, both approaches to understanding AI systems should be able to increase the efficacy of the other. For example, theories developed in LLM psychology might be used to provide targets for interpretability tools to empirically detect, creating a stronger understanding of model internals as generators of complex behavior.
It’s important to acknowledge that the work of both Pavlov and Skinner were extremely harmful to their animal subjects. For example, Pavlov performed invasive surgery on the dogs he worked with to more directly measure their salivation, and Skinner frequently used deprivation and electric shocks to elicit behavior from his subjects, mostly pigeons and rats.
Also worth acknowledging that Jane Goodall faced a lot of gender discrimination, which is hard to pull apart from critiques of her methodology.
While Lorenz shared a nobel prize for his work, he was also a member of the Nazi party, and tried to directly connect his understanding of geese domestication to Nazi ideas of racial purification.
While learning about Alex, we stumbled upon some research on pigeons trained to detect cancer which aimed to use their findings to improve AI image recognition systems. This isn’t specifically relevant to the post, but seemed noteworthy.
Predictive processing suggests that brains are also essentially trained to predict data, and any similarities in our training regimes should count toward our cognition being at least somewhat similar.
Methods like LoRA could make the process of making deliberate changes to a model especially fast and cheap.
One difficulty in studying LLM cognition is differentiating between different levels of abstraction. While it’s accurate to say that an LLM is “just” a text predictor, that frame holds us to only one level of abstraction, and ignores anything which might emerge from prediction, like complex causal world-modeling or goal-directed agency.
Some elements of this observation may change as LLMs become more multi-modal. It remains significant that, unlike LLMs, the vast majority of human sense data is non-linguistic, and all humans go through a non-linguistic phase of their development.