AI ALIGNMENT FORUM
AF

2436
porby
Ω43310
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Thoughts on the impact of RLHF research
porby3y33

One consequence downstream of this that seems important to me in the limit:

  1. Nonconditioning fine-tuned predictor models make biased predictions. If those biases happen to take the form of a misaligned agent, the model itself is fighting you.
  2. Conditioned predictor models make unbiased predictions. The conditioned sequence could still represent a misaligned agent, but the model itself is not fighting you.

I think having that one extra layer of buffer provided by 2 is actually very valuable. A goal agnostic model (absent strong gradient hacking) seems more amenable to honest and authentic intermediate reporting and to direct mechanistic interpretation.

Reply
2porby's Shortform
2y
0
15Implied "utilities" of simulators are broad, dense, and shallow
3y
1
14Instrumentality makes agents agenty
3y
0
14Simulators, constraints, and goal agnosticism: porbynotes vol. 1
3y
0