AI ALIGNMENT FORUM
AF

1446
John Steidley
Ω2010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
No posts to display.
interpreting GPT: the logit lens
John Steidley5y20

It doesn't sound hard at all. The things Gwern is describing are the same sort of thing that people do for interpretability where they, eg, find an image that maximizes the probability of the network predicting a target class.

Of course, you need access to the model, so only OpenAI could do it for GPT-3 right now.

Reply