AI ALIGNMENT FORUM
AF

Abhimanyu Pallavi Sudhir
000
Message
Dialogue
Subscribe

CS PhD student

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
0Abhimanyu Pallavi Sudhir's Shortform
1y
0
o1: A Technical Primer
Abhimanyu Pallavi Sudhir9mo20

we'll elide all of the subtle difficulties involved in actually getting RL to work in practice

I haven't properly internalized the rest of the post, but this confuses me because I thought this post was about the subtle difficulties.

The RL setup itself is straightforward, right? An MDP where S is the space of strings, A is the set of strings < n tokens, P(s'|s,a)=append(s,a) and reward is given to states with a stop token based on some ground truth verifier like unit tests or formal verification.

Reply
No wikitag contributions to display.
15Inference-Only Debate Experiments Using Math Problems
1y
0