AI ALIGNMENT FORUM
AF

Jan Betley
Ω129100
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
3Jan Betley's Shortform
6mo
0
No wikitag contributions to display.
Can you control the past?
Jan Betley6mo11

Hey, this post is great - thank you.

I don't get one thing - the violation of Guaranteed Payoffs in case of precommitment. If I understand correctly, the claim is: if you precommit to pay while on desert, then you "burn value for certain" while in the city. But you can only "burn value" / violate Guaranteed Payoffs when you make a decision, and if you successfully precommited before, then you're no longer making any decision in the city - you just go to the ATM and pay, because that's literally the only thing you can do.

What am I missing?

Reply
110Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
6mo
1
19Localizing goal misgeneralization in a maze-solving policy network
2y
0