AI ALIGNMENT FORUM
AF

Jemal Young
Ω4000
Message
Dialogue
Subscribe

I specialize in regulatory affairs for AI-enabled Software as a Medical Device, and I'm interested in the safe development of frontier AI.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
1Jemal Young's Shortform
4mo
0
Discovering Language Model Behaviors with Model-Written Evaluations
Jemal Young3y45

How do we know that an LM's natural language responses can be interpreted literally? For example, if given a choice between "I'm okay with being turned off" and "I'm not okay with being turned off", and the model chooses either alternative, how do we know that it understands what its choice means? How do we know that it has expressed a preference, and not simply made a prediction about what the "correct" choice is?

Reply
No wikitag contributions to display.
No posts to display.