Christopher Ackerman — AI Alignment Forum

Role embeddings: making authorship more salient to LLMs

This is an interim research report on role embeddings, an approach to make language models more robust to many-shot jailbreaks and prompt injections by adding role information at every token position in the context rather than just at special token delimiters. We credit Cem Anil for originally proposing this idea....

Jan 7, 202550

Investigating the Ability of LLMs to Recognize Their Own Writing

This post is an interim progress report on work being conducted as part of Berkeley's Supervised Program for Alignment Research (SPAR). Summary of Key Points * We test the robustness of an open-source LLM’s (Llama3-8b) ability to recognize its own outputs on a diverse mix of datasets, two different tasks...

Jul 30, 202432

Representation Tuning

Summary First, I identify activation vectors related to honesty in an RLHF’d LLM (Llama-2-13b-chat). Next, I demonstrate that model output can be made more or less honest by adding positive or negative multiples of these vectors to residual stream activations during generation. Then, I show that a similar effect can...

Jun 27, 202435