Hi, thanks both for writing this - I enjoyed it.
I can share some of my thoughts first, and would be keen to hear (both/either of) yours.
A while ago I made a very quick Python script to pull Markdown from LW, then use pandoc to export to a PDF (because I prefer reading physical papers and Latex formatting). I used it somewhat regularly for ~6 months and found that it was good enough for my purposes. I assume the LW developers could write something much better, but I've thrown it into this Github [repo](https://github.com/juesato/lw_pdf_exporter/tree/main) in case it's of help or interest.
I'd still describe my optimistic take as "do imitative generalization.
Zooming out a bit, I would summarize a few high-level threads as:
Thanks for these thoughts. Mostly just responding to the bits with questions/disagreements, and skipping the parts I agree with:
That's basically right, although I think the view is less plausible for "decisions" than for some kinds of reports. For example, it is more plausible that a mapping from symbols in an internal vocabulary to expressions in natural language would generalize than that correct decisions would generalize (or even than other forms of reasoning).
Thanks for sharing these thoughts. I'm particularly excited about the possibility of running empirical experiments to better understand potential risks of ML systems and and contribute to debates about difficulties of alignment.1. Potential implications for optimistic views on alignment
If we observe systems that learn to bullshit convincingly, but don't transfer to behaving honestly, I think that's a real challenge to the most optimistic views about alignment and I expect it would convince some people in ML.
I'm most interested in this point. IIUC, the view... (read more)
Thanks for writing this. I've been having a lot of similar conversations, and found your post clarifying in stating a lot of core arguments clearly.
Is there an even better critique that the Skeptic could make?
Focusing first on human preference learning as a subset of alignment research: I think most ML researchers "should" agree on the importance of simple human preference learning, both from a safety and capabilities perspective. If we take the narrower question "should we do human preference learning, or is pretraining + minimal prompt engineering enough... (read more)
I'd be interested in the relationship between this and Implicit Gradient Regularization and the sharp/flat minima lit.The basic idea there is to compare the continuous gradient flow on the original objective, to the path followed by SGD due to discretization. They show that the latter can be re-interpreted as optimizing a modified objective which favors flat minima (low sensitivity to parameter perturbations). This isn't clearly the same as what you're analyzing here, since you're looking at variance due to sampling instead, but they might be related under... (read more)
Thanks for the great post. I found this collection of stories and framings very insightful.1. Strong +1 to "Problems before solutions." I'm much more focused when reading this story (or any threat model) on "do I find this story plausible and compelling?" (which is already a tremendously high bar) before even starting to get into "how would this update my research priorities?"2. I wanted to add a mention to Katja Grace's "Misalignment and Misuse" as another example discussing how single-single alignment problems and bargaining failures can blur together an... (read more)