AI ALIGNMENT FORUM
AF

Wikitags

Activation Engineering

Edited by David Udell last updated 29th Aug 2023

Activation Engineering is the direct manipulation of activation vectors inside of a trained machine learning model. Potentially, it is a way to steer a model's behavior.

Activation engineering can be contrasted with other strategies for steering models: fine-tuning the models for desired behavior and crafting prompts that get a particular response.

Subscribe
2
Subscribe
2
Discussion0
Discussion0
Posts tagged Activation Engineering
121Steering GPT-2-XL by adding an activation vector
TurnTrout, Monte M, David Udell, lisathiergart, Ulisse Mini
2y
63
33Modulating sycophancy in an RLHF model via activation steering
Nina Panickssery
2y
19
97Mechanistically Eliciting Latent Behaviors in Language Models
Andrew Mack, TurnTrout
1y
20
52Reducing sycophancy and improving honesty via activation steering
Nina Panickssery
2y
4
37Maze-solving agents: Add a top-right vector, make the agent go to the top-right
TurnTrout, peligrietzer, lisathiergart
2y
7
8Extracting and Evaluating Causal Direction in LLMs' Activations
Fabien Roger, simeon_c
3y
2
23An Introduction to Representation Engineering - an activation-based paradigm for controlling LLMs
j_we
1y
1
45ActAdd: Steering Language Models without Optimization
technicalities, TurnTrout, lisathiergart, David Udell, Ulisse Mini, Monte M
2y
2
14Representation Tuning
Christopher Ackerman
1y
3
13Evaluating hidden directions on the utility dataset: classification, steering and removal
Annah, shash42
2y
0
140Understanding and controlling a maze-solving policy network
TurnTrout, peligrietzer, Ulisse Mini, Monte M, David Udell
2y
23
64Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
likenneth
2y
0
49Steering Llama-2 with contrastive activation additions
Nina Panickssery, Wuschel Schulz, NickGabs, Meg, evhub, TurnTrout
2y
23
51Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models
Andrew Mack, TurnTrout
9mo
1
52Steering Gemini with BiDPO
TurnTrout
7mo
2
Load More (15/26)
Add Posts