AI ALIGNMENT FORUMTags
AF

Activation Engineering

•

Applied to Mechanistically Eliciting Latent Behaviors in Language Models by Alex Turner 20d ago

•

Applied to How well do truth probes generalise? by mishajw 3mo ago

•

Applied to Auto-matching hidden layers in Pytorch LLMs by chanind 3mo ago

•

Applied to What's the theory of impact for activation vectors? by jacobjacob 3mo ago

•

Applied to Implementing activation steering by Annah 3mo ago

•

Applied to Investigating Bias Representations in LLMs via Activation Steering by kave 4mo ago

•

Applied to Striking Implications for Learning Theory, Interpretability — and Safety? by Roger Dearnaley 4mo ago

•

Applied to Steering Llama-2 with contrastive activation additions by Alex Turner 5mo ago

•

Applied to Classifying representations of sparse autoencoders (SAEs) by Annah 6mo ago

•

Applied to Features and Adversaries in MemoryDT by Tassilo Neubauer 7mo ago

•

Applied to Comparing representation vectors between llama 2 base and chat by Nina Rimsky 7mo ago

•

Applied to Paper: Understanding and Controlling a Maze-Solving Policy Network by Alex Turner 7mo ago

•

Applied to Inference-Time Intervention: Eliciting Truthful Answers from a Language Model by Zach Stein-Perlman 8mo ago

•

Applied to Evaluating hidden directions on the utility dataset: classification, steering and removal by Annah 8mo ago

•

Applied to Understanding and controlling a maze-solving policy network by Alex Turner 8mo ago

•

Applied to Sparse Coding, for Mechanistic Interpretability and Activation Engineering by David Udell 8mo ago

•

Applied to ActAdd: Steering Language Models without Optimization by Alex Turner 9mo ago

•

Applied to Modulating sycophancy in an RLHF model via activation steering by David Udell 9mo ago

•

Applied to Understanding Counterbalanced Subtractions for Better Activation Additions by David Udell 9mo ago