Thanks for writing this! Would "fine-tune on some downstream task and measure the accuracy on that task before and after fine-tuning" count as measuring misalignment as you're imagining it? My sense is that there might be a bunch of existing work like that.
This RFP is an experiment for us, and we don't yet know if we'll be doing more of them in the future. I think we'd be open to including research directions we think that are promising that apply equally well to both DL and non-DL systems-- I'd be interested in hearing any particular suggestions you have.(We'd also be happy to fund particular proposals in the research directions we've already listed that apply to both DL and non-DL systems, though we will be evaluating them on how well they address the DL-focused challenges we've presented.)
Getting feedback in the next week would be ideal; September 15th will probably be too late.Different request for proposals!
Thank you so much for writing this! I've been confused about this terminology for a while and I really like your reframing.An additional terminological point that I think it would be good to solidify is what people mean when they refer to "inner alignment" failures. As you alude to, my impression is that some people use it to refer to objective robustness failures, broadly, whereas others (e.g. Evan) use it to refer to failures that involve mesa optimization. There is then additional confusion around whether we should think "inner alignment" failures that ... (read more)
Planned summary for the Alignment Newsletter:
This post describes the author’s insights from extrapolating the performance of GPT on the benchmarks presented in the <@GPT-3 paper@>(@Language Models are Few-Shot Learners@). The author compares cross-entropy loss (which measures how good a model is at predicting the next token) with benchmark performance normalized to the difference between random performance and the maximum possible performance. Since <@previous work@>(@Scaling Laws for Neural Language Models@) has shown that cross-entropy loss s
AI Impacts now has a 2020 review page so it's easier to tell what we've done this year-- this should be more complete / representative than the posts listed above. (I appreciate how annoying the continuously updating wiki model is.)
From Part 4 of the report:
Nonetheless, this cursory examination makes me believe that it’s fairly unlikely that my current estimates are off by several orders of magnitude. If the amount of computation required to train a transformative model were (say) ~10 OOM larger than my estimates, that would imply that current ML models should be nowhere near the abilities of even small insects such as fruit flies (whose brains are 100 times smaller than bee brains). On the other hand, if the amount of computation required to train a transformative model were
So exciting that this is finally out!!!
I haven't gotten a chance to play with the models yet, but thought it might be worth noting the ways I would change the inputs (though I haven't thought about it very carefully):
I'm a bit confused about this as a piece of evidence-- naively, it seems to me like not carrying the 1 would be a mistake that you would make if you had memorized the pattern for single-digit arithmetic and were just repeating it across the number. I'm not sure if this counts as "memorizing a table" or not.