This is honestly some of the most significant alignment work I've seen in recent years (for reasons I plan to post on shortly), thank you for going to all this length!
Typo: "Thoughout this process test loss remains low - even a partial memorising solution still performs extremely badly on unseen data!", 'low' should be 'high' (and 'throughout' is misspelled too).
There are some posts with perennial value, and some which depend heavily on their surrounding context. This post is of the latter type. I think it was pretty worthwhile in its day (and in particular, the analogy between GPT upgrades and developmental stages is one I still find interesting), but I leave it to you whether the book should include time capsules like this.
It's also worth noting that, in the recent discussions, Eliezer has pointed to the GPT architecture as an example that scaling up has worked better than expected, but he diverges from the thesis of this post on a practical level:
I suspect that you cannot get this out of small large amounts of gradient descent on small large layered transformers, and therefore I suspect that GPT-N does not approach superintelligence before the world is ended by systems that look differently, but I could be wrong about that.
I unpack this as the claim that someone will always be working on directly goal-oriented AI development, and that inner optimizers in an only-indirectly-goal-oriented architecture like GPT-N will take enough hardware that someone else will have already built an outer optimizer by the time it happens.
That sounds reasonable, it's a consideration I'd missed at the time, and I'm sure that OpenAI-sized amounts of money will be paid into more goal-oriented natural language projects adapted to whatever paradigm is prominent at the time. But I still agree with Eliezer's "but I could be wrong" here.
One tiny note: I was among the people on AAMLS; I did leave MIRI the next year; and my reasons for so doing are not in any way an indictment of MIRI. (I was having some me-problems.)
I still endorse MIRI as, in some sense, being the adults in the AI Safety room, which has... disconcerting effects on my own level of optimism.
I'd additionally expect the death of pseudonymity on the Internet, as AIs will find it easy to detect similar writing style and correlated posting behavior. What at present takes detective work will in the future be cheaply automated, and we will finally be completely in Zuckerberg's desired world where nobody can maintain a second identity online.
Oh, and this is going to be retroactive, so be ready for the consequences of everything you've ever said online.
If this post is selected, I'd like to see the followup made into an addendum—I think it adds a very important piece, and it should have been nominated itself.
I think this post (and similarly, Evan's summary of Chris Olah's views) are essential both in their own right and as mutual foils to MIRI's research agenda. We see related concepts (mesa-optimization originally came out of Paul's talk of daemons in Solomonoff induction, if I remember right) but very different strategies for achieving both inner and outer alignment. (The crux of the disagreement seems to be the probability of success from adapting current methods.)
Strongly recommended for inclusion.
It's hard to know how to judge a post that deems itself superseded by a post from a later year, but I lean toward taking Daniel at his word and hoping we survive until the 2021 Review comes around.
The content here is very valuable, even if the genre of "I talked a lot with X and here's my articulation of X's model" comes across to me as a weird intellectual ghostwriting. I can't think of a way around that, though.
This reminds me of That Alien Message, but as a parable about mesa-alignment rather than outer alignment. It reads well, and helps make the concepts more salient. Recommended.
Remind me which bookies count and which don't, in the context of the proofs of properties?
If any computable bookie is allowed, a non-Bayesian is in trouble against a much larger bookie who can just (maybe through its own logical induction) discover who the bettor is and how to exploit them.
[EDIT: First version of this comment included "why do convergence bettors count if they don't know the bettor will oscillate", but then I realized the answer while Abram was composing his response, so I edited that part out. Editing it back in so that Abram's reply has context.]