AI ALIGNMENT FORUMTags
AF

Gradient Hacking

•

Applied to Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor by Roger Dearnaley 3mo ago

•

Applied to Interpreting the Learning of Deceit by Roger Dearnaley 4mo ago

•

Applied to Research Log, RLLMv2: Phi-1.5, GPT2XL and Falcon-RW-1B as paperclip maximizers by Miguel de Guzman 4mo ago

•

Applied to Programmatic backdoors: DNNs can use SGD to run arbitrary stateful computation by duck_master 6mo ago

•

Applied to Eliciting Credit Hacking Behaviours in LLMs by omegastick 7mo ago

•

Applied to What an actually pessimistic containment strategy looks like by elbow921 9mo ago

•

Applied to Gradient hacking via actual hacking by Max H 1y ago

•

Applied to Challenge: construct a Gradient Hacker by RobertM 1y ago

•

Applied to Gradient hacking is extremely difficult by Beren Millidge 1y ago

•

Applied to Gradient Filtering by Arun Jose 1y ago

•

Applied to [ASoT] Simulators show us behavioural properties by default by Arun Jose 1y ago

•

Applied to (Extremely) Naive Gradient Hacking Doesn't Work by ojorgensen 1y ago

•

Applied to Gradient Hacker Design Principles From Biology by Multicore 2y ago

Multicore v1.1.0Aug 27th 2022 (+262)

Gradient Hacking describes a scenario where a mesa-optimizer in an AI system acts in a way that intentionally manipulates the way that gradient descent updates it, likely to preserve its own mesa-objective in future iterations of the AI.

See also: Inner Alignment

•

Applied to Gradient hacking: definitions and examples by Raymond Arnold 2y ago

•

Applied to Is Fisherian Runaway Gradient Hacking? by Ryan Kidd 2y ago