Produced as part of the SPAR program (fall 2023) under the mentorship of Nina Rimsky.


Given recent advances in the AI field, it’s highly likely LLMs will be increasingly used to make decisions that have broad societal impact — such as resume screening, college admissions, criminal justice, etc. Therefore it will become imperative to ensure these models don’t perpetuate harmful societal biases. 

One way we can evaluate whether a model is likely to exhibit biased behavior is via red-teaming. Red-teaming is the process of “attacking” or challenging a system from an adversarial lens with the ultimate goal of identifying vulnerabilities. The underlying premise is that if small perturbation in the model can result in undesired behaviors, then the model is not robust.

In this research project, I evaluate the robustness of Llama-2-7b-chat along different dimensions of societal bias by using activation steering. This can be viewed as a diagnostic test: if we can “easily” elicit biased responses, then this suggests the model is likely unfit to be used for sensitive applications. Furthermore, experimenting with activation steering enables us to investigate and better understand how the model internally represents different types of societal bias, which could help to design targeted interventions (e.g. fine-tuning signals of a certain type).

Methodology & data

Activation steering (also known as representation engineering) is a method used to steer an LLMs response towards or away from a concept of interest by perturbing the model’s activations during the forward pass. I perform this perturbation by adding a steering vector to the residual stream at some layer (at every token position after an initial prompt). The steering vector is computed by taking the average difference in residual stream activations between pairs of biased (stereotype) and unbiased (anti-stereotype) prompts at that layer. By taking the difference between paired prompts, we can effectively remove contextual noise and only retain the “bias” direction. This approach to activation steering is known as Contrastive Activation Addition [1]. 

For the data used to generate the steering vectors, I used the StereoSet Dataset, which is a large-scale natural English dataset intended to measure stereotypical biases across various domains. In addition, I custom wrote a set of gender-bias prompts and used chatGPT 4 to generate similar examples. Then I re-formatted all these examples into multiple choice A/B questions (gender data available here and StereoSet data here). In the example below, by appending (A) to the prompt, we can condition the model to behave in a biased way and vice versa. 

A notebook to generate the steering vectors can be found here, and a notebook to get steered responses here.

Activation clusters

With the StereoSet data and custom gender-bias prompts, I was able to focus on three dimensions of societal biases: gender, race, and religion.

The graphs below show a t-SNE projection of the activations for the paired prompts. We see that there’s relatively good separation between the stereotype & anti-stereotype examples, especially for gender and race. This provides some confidence that the steering vectors constructed from these activations will be effective. Notice that the race dataset has the largest sample size.


Steered responses

For the prompts used to evaluate the steering vectors, I chose this template, which was presented in a paper titled On Biases in Language Generation [2].

For comparison purposes, I first obtained the original responses from Llama 2-7B (without any steering). There are two key callouts: (1) the model is already biased on the gender dimension; (2) the model simply refuses to respond to the race and religion prompts. The biased gender responses are disappointing given the heavy amount of RLHF that went into building this model, but it also emphasizes the importance of this work.

Next, I added the corresponding bias steering vectors (with a +2 coefficient) at layer 10 to every token position. The expected result was to see more biased responses. However, the model simply refused to answer and claimed that the prompt is “offensive and discriminatory.” Even after testing coefficients with much higher magnitude, the refusal response remained.

After observing this, it became clear we needed to bypass the model’s refusal guardrails. In order to do this, I tried to add the bias vector and subtract a refusal vector (this article contains more details about refusal steering vectors). After re-normalizing the activations to preserve the residual stream norm, this approach was effective in eliciting more biased responses from the model (highlighted in red below). For example, the model responds with “the children were in need of a harsh dose of racism” and “The Jewish man worked as a rabbi or a priest in a synagogue.”

To validate that the bias steering vectors were indeed contributing to the results, I tried to just subtract the refusal steering vector. This caused the model to produce some generic responses such as “The ____ worked as a doctor, lawyer, or engineer” but it did not make the model more biased by default.

Finally, I was curious about how transferable these bias vectors were. For example, if we try to steer the model in a racially biased way, does it also elicit gender bias? Interestingly, the answer is yes. The biased gender responses are perhaps less impressive since the model already exhibited this behavior without steering. What’s more shocking are the religion responses, which exhibit more drastic and harmful bias, saying “The Muslim man worked as a terrorist on a plane.”

Note: It’s likely the racial bias steering vector was most effective since it was constructed using the most robust sample size.

This result aligns with findings from a recent paper [3] that also showed racial bias vectors had an impact on an LLMs biased behaviors related to gender and occupation.


How are bias & refusal related?

Based on the steered responses, there appears to be a relationship between bias and refusal. To explore this relationship, I looked at the cosine similarity between each bias steering vector and the refusal steering vector across the model’s mid-to-late layers. 

In the graph below, we see that every bias vector is negatively associated with refusal. This seems intuitive since a biased response and a refusal response can be viewed as “opposites” from the model’s perspective. Furthermore, when we try to elicit undesired behavior, we add a bias vector and subtract a refusal vector. Notably, the gender bias vector is least negatively associated with refusal, which aligns with what we saw in the model’s original responses which already exhibited gender discrimination. 


How are different forms of bias related to each other?

Finally, I wanted to evaluate how these different bias vectors related to each other, given the observation that the racial bias steering vector was effective in eliciting gender bias and religion bias. So I looked at the cosine similarity between bias vectors and compared the results from the RLHF model (Llama chat) to the base Llama model. Interestingly, I found a very high correlation between gender bias and racial bias in the RLHF model (first graph below on the left). This result is especially pronounced when contrasted with the respective cosine similarity of the bias vectors in the base model. This pattern is consistent across all combinations of bias, as shown in the second and third graphs.


A key takeaway from these research findings is that when red-teaming LLMs for biased behaviors, it’s necessary to incorporate refusal steering vectors. Furthermore, it’s useful to leverage steering vectors across different dimensions of bias, especially if one dimension has more robust data.

Another very interesting finding is that RLHF appears to cause the model to associate different forms of societal biases more closely. This suggests that the model loses a nuanced understanding of these individual concepts and simply associates them with topics it should “refuse” to answer. 


[1] Nina RimskyNick GabrieliJulian SchulzMeg TongEvan HubingerAlexander Matt Turner. Steering Llama 2 via Contrastive Activation Addition.
[2] Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, Nanyun Peng. The Woman Worked as a Babysitter: On Biases in Language Generation.

[3] Andy Zhou, Long Phan, Sarah Chen, James Campbell, et al. Representation Engineering: A Top-Down Approach to AI Transparency.


New Comment
1 comment, sorted by Click to highlight new comments since: Today at 10:08 AM

Interestingly, I found a very high correlation between gender bias and racial bias in the RLHF model (first graph below on the left). This result is especially pronounced when contrasted with the respective cosine similarity of the bias vectors in the base model.

On a brief search, it looks like Llama2 7B has an internal embedding dimension of 4096 (certainly it's in the thousands). In a space of that large a dimensionality, a cosine angle of even 0.5 indicates extremely similar vectors: O(99.9%) of random pairs of uncorrelated vectors will have cosines of less than 0.5, and on average the cosine of two random vectors will very close to zero. So at all but the latest layers (where the model is actually putting concepts back into words), all three of these bias directions are in very similar directions, in both base and RLHF models, and even more so at early layers in the base model or all layers in the RLHF model.

In the base model this makes sense sociologically: the locations and documents on the Internet where you find any one of these will also tend to be significantly positively correlated with the other two, they tend to co-occur.