Clement Neo

Message

Twitter: _clementneo
Site: clementneo.com

189

Clement Neo

Twitter: _clementneo
Site: clementneo.com

Analysing Adversarial Attacks with Linear Probing

Yoann Poupart, Imene Kerboua, Clement Neo, Jason Hoelscher-Obermaier

This work was produced as part of the Apart Fellowship. @Yoann Poupart and @Imene Kerboua led the project; @Clement Neo and @Jason Hoelscher-Obermaier provided mentorship, feedback and project guidance.

Here, we present a qualitative analysis of our preliminary results. We are at the very beginning of our experiments, so any setup change, experiment advice, or idea is welcomed. We also welcome any relevant paper that we could have missed. For a better introduction to interpreting adversarial attacks, we recommend reading scasper’s post: EIS IX: Interpretability and Adversaries.

Code available on GitHub (still drafty).

TL;DR

Basic adversarial attacks produce humanly indistinguishable examples for the human eye.
In order to explore the features vs bugs view of adversarial attacks, we trained linear probes on a classifier’s activations (we used CLIP-based models for further multimodal work). The goal is

...

(Continue Reading - 2317 more words)

We Found An Neuron in GPT-2

Joseph Miller, Clement Neo

This is a linkpost for https://clementneo.com/posts/2023/02/11/we-found-an-neuron

We started out with the question: How does GPT-2 know when to use the word "an" over "a"? The choice depends on whether the word that comes after starts with a vowel or not, but GPT-2 can only output one word at a time.

We still don’t have a full answer, but we did find a single MLP neuron in GPT-2 Large that is crucial for predicting the token " an". And we also found that the weights of this neuron correspond with the embedding of the " an" token, which led us to find other neurons that predict a specific token.

Discovering the Neuron

Choosing the prompt

It was surprisingly hard to think of a prompt where GPT-2 would output “ an” (the leading space is part of the token)...

(Continue Reading - 2069 more words)