AI ALIGNMENT FORUM
AF

Tim Kostolansky
010
Message
Dialogue
Subscribe

learning and loving

my site, my twitter

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
The blue-minimising robot and model splintering
Timothy Kostolansky8mo00

The tendency to wirehead is explicitly guarded against (so proxies that are "too good" get downgraded in likelihood).

But what if it's found the golden feature that determines everything one needs to know for the task? Wouldn't this be desired?

Reply
36Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring
3mo
1