AI ALIGNMENT FORUM
AF

1615
Jan Betley
Ω130100
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Can you control the past?
Jan Betley7mo11

Hey, this post is great - thank you.

I don't get one thing - the violation of Guaranteed Payoffs in case of precommitment. If I understand correctly, the claim is: if you precommit to pay while on desert, then you "burn value for certain" while in the city. But you can only "burn value" / violate Guaranteed Payoffs when you make a decision, and if you successfully precommited before, then you're no longer making any decision in the city - you just go to the ATM and pay, because that's literally the only thing you can do.

What am I missing?

Reply
3Jan Betley's Shortform
7mo
0
111Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
8mo
1
19Localizing goal misgeneralization in a maze-solving policy network
2y
0