(adapted from Nora's tweet thread here.) Consider a trained, fully functional language model. What are the chances you'd get that same model -- or something functionally indistinguishable -- by randomly guessing the weights? We crunched the numbers and here's the answer: We've developed a method for estimating the probability of...
Alternate title: "Somewhat Contra Scott On Simulators". Scott Alexander has a recent post up on large language models as simulators. I generally agree with Part I of the post, which advocates thinking about LLMs as simulators that can emulate a variety of language-producing "characters" (with imperfect accuracy). And I also...
I wrote this doc in December 2021, while working at Redwood Research. It summarizes a handful of observations about GPT-2-small's weights -- mostly the embedding matrix, but also the LayerNorm gain parameters -- that I found while doing some open-ended investigation of the model. I wanted to see how much...
This post is the result of my attempts to understand what’s going on in these two posts from summer 2021: Paul Christiano: https://www.alignmentforum.org/posts/QqwZ7cwEA2cxFEAun/teaching-ml-to-answer-questions-honestly-instead-of Evan Hubinger: https://www.alignmentforum.org/posts/gEw8ig38mCGjia7dj/answering-questions-honestly-instead-of-predicting-human The underlying problem is similar in some ways to Eliciting Latent Knowledge, although not quite the same. According to Paul, understanding the two-headed algorithm...
Thanks to several people for useful discussion, including Evan Hubinger, Leo Gao, Beth Barnes, Richard Ngo, William Saunders, Daniel Ziegler, and probably others -- let me know if I'm leaving you out. See also: Obstacles to Gradient Hacking by Leo Gao. I want to argue that gradient hacking (strongly enough...