I specialize in regulatory compliance and quality management for software as a medical device (SaMD) development. I want to work on AI safety research, so I spend as much time as I can trying to develop the skills I'll need. I live in Los Angeles.
How do we know that an LM's natural language responses can be interpreted literally? For example, if given a choice between "I'm okay with being turned off" and "I'm not okay with being turned off", and the model chooses either alternative, how do we know that it understands what its choice means? How do we know that it has expressed a preference, and not simply made a prediction about what the "correct" choice is?