AI ALIGNMENT FORUM
AF

556
Abhimanyu Pallavi Sudhir
000
Message
Dialogue
Subscribe

CS PhD student

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
0Abhimanyu Pallavi Sudhir's Shortform
1y
0
15Inference-Only Debate Experiments Using Math Problems
1y
0
o1: A Technical Primer
Abhimanyu Pallavi Sudhir10mo20

we'll elide all of the subtle difficulties involved in actually getting RL to work in practice

I haven't properly internalized the rest of the post, but this confuses me because I thought this post was about the subtle difficulties.

The RL setup itself is straightforward, right? An MDP where S is the space of strings, A is the set of strings < n tokens, P(s'|s,a)=append(s,a) and reward is given to states with a stop token based on some ground truth verifier like unit tests or formal verification.

Reply