Jemal Young

I specialize in regulatory affairs for Software as a Medical Device and hope to work in AI risk-mitigation. I enjoy studying machine learning and math, trying to keep up with capabilities research, reading fantasy, sci-fi and horror, and spending time with my family.


How do we know that an LM's natural language responses can be interpreted literally? For example, if given a choice between "I'm okay with being turned off" and "I'm not okay with being turned off", and the model chooses either alternative, how do we know that it understands what its choice means? How do we know that it has expressed a preference, and not simply made a prediction about what the "correct" choice is?