This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
Tags
AF
Login
Activation Engineering
•
Applied to
Avoiding jailbreaks by discouraging their representation in activation space
by
Guido Bergman
13d
ago
•
Applied to
Validating / finding alignment-relevant concepts using neural data
by
Bogdan Ionut Cirstea
20d
ago
•
Applied to
[Paper] Programming Refusal with Conditional Activation Steering
by
Bruce W. Lee
1mo
ago
•
Applied to
Activation Engineering Theories of Impact
by
Jakub Nowak
3mo
ago
•
Applied to
I found >800 orthogonal "write code" steering vectors
by
Jacob G-W
3mo
ago
•
Applied to
An Introduction to Representation Engineering - an activation-based paradigm for controlling LLMs
by
Jan Wehner
3mo
ago
•
Applied to
Control Vectors as Dispositional Traits
by
Jan Wehner
3mo
ago
•
Applied to
Representation Tuning
by
Jan Wehner
3mo
ago
•
Applied to
LLMs Universally Learn a Feature Representing Token Frequency / Rarity
by
Sean Osier
3mo
ago
•
Applied to
Jailbreak steering generalization
by
Nina Panickssery
4mo
ago
•
Applied to
Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller
by
Henry Cai
4mo
ago
•
Applied to
Introducing SARA: a new activation steering technique
by
Alejandro Tlaie Boria
4mo
ago
•
Applied to
Mechanistically Eliciting Latent Behaviors in Language Models
by
Alex Turner
5mo
ago
•
Applied to
How well do truth probes generalise?
by
mishajw
8mo
ago
•
Applied to
Causal Graphs of GPT-2-Small's Residual Stream
by
David Udell
8mo
ago