AI ALIGNMENT FORUM
AF

baturinsky
Ω-1000
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
0baturinsky's Shortform
2y
0
Attribution Patching: Activation Patching At Industrial Scale
baturinsky2y-10

Can this be used as some kind of lie detector?

Reply
[Linkpost] Some high-level thoughts on the DeepMind alignment team's strategy
baturinsky3y09

While the model has the advantage of only having to "win" once.

Reply
No posts to display.