This is an interim research report on role embeddings, an approach to make language models more robust to many-shot jailbreaks and prompt injections by adding role information at every token position in the context rather than just at special token delimiters. We credit Cem Anil for originally proposing this idea....
This post is an interim progress report on work being conducted as part of Berkeley's Supervised Program for Alignment Research (SPAR). Summary of Key Points * We test the robustness of an open-source LLM’s (Llama3-8b) ability to recognize its own outputs on a diverse mix of datasets, two different tasks...
Summary First, I identify activation vectors related to honesty in an RLHF’d LLM (Llama-2-13b-chat). Next, I demonstrate that model output can be made more or less honest by adding positive or negative multiples of these vectors to residual stream activations during generation. Then, I show that a similar effect can...