I reached this via Joachim pointing it out as an example of someone urging epistemic defection around AI alignment, and I have to agree with him there. I think the higher difficulty posed by communicating "we think there's a substantial probability that AGI happens in the next 10 years" vs "AGI is near" is worth it even from a PR perspective, because pretending you know the day and the hour smells like bullshit to the most important people who need convincing that AI alignment is nontrivial.
GPT-4 is good enough to identify you if you're a prolific writer.
I can imagine this coming from the equivalent of "adapt someone else's StackOverflow code" level capability, which is still pretty impressive.
In my opinion, the scariest thing I've seen so far is coding Game Of Life Pong, which doesn't seem to resemble any code GPT-4 would have had in its training data. Stitching those things together means coding for real for real.
Kudos for talking about learning empathy in a way that seems meaningfully different and less immediately broken than adjacent proposals.
I think what you should expect from this approach, should it in fact succeed, is not nothing- but still something more alien than the way we empathize with lower animals, let alone higher animals. Consider the empathy we have towards cats... and the way it is complicated by their desire to be a predator, and specifically to enjoy causing fear/suffering. Our empathy with cats doesn't lead us to abandon our empathy for their prey, and so we are inclined to make compromises with that empathy.
Given better technology, we could make non-sentient artificial mice that are indistinguishable by the cats (but their extrapolated volition, to some degree, would feel deceived and betrayed by this), or we could just ensure that cats no longer seek to cause fear/suffering.
I hope that humans' extrapolated volitions aren't cruel (though maybe they are when judged by Superhappy standards). Regardless, an AI that's guaranteed to have empathy for us is not guaranteed, and in general quite unlikely, to have no other conflicts with our volitions; and the kind of compromises it will analogously make will probably be larger and stranger than the cat example.
Better than paperclips, but perhaps missing many dimensions we care about.
Very cool! How does this affect your quest for bounded analogues of Löbian reasoning?
This is honestly some of the most significant alignment work I've seen in recent years (for reasons I plan to post on shortly), thank you for going to all this length!
Typo: "Thoughout this process test loss remains low - even a partial memorising solution still performs extremely badly on unseen data!", 'low' should be 'high' (and 'throughout' is misspelled too).
There are some posts with perennial value, and some which depend heavily on their surrounding context. This post is of the latter type. I think it was pretty worthwhile in its day (and in particular, the analogy between GPT upgrades and developmental stages is one I still find interesting), but I leave it to you whether the book should include time capsules like this.
It's also worth noting that, in the recent discussions, Eliezer has pointed to the GPT architecture as an example that scaling up has worked better than expected, but he diverges from the thesis of this post on a practical level:
I suspect that you cannot get this out of small large amounts of gradient descent on small large layered transformers, and therefore I suspect that GPT-N does not approach superintelligence before the world is ended by systems that look differently, but I could be wrong about that.
I unpack this as the claim that someone will always be working on directly goal-oriented AI development, and that inner optimizers in an only-indirectly-goal-oriented architecture like GPT-N will take enough hardware that someone else will have already built an outer optimizer by the time it happens.
That sounds reasonable, it's a consideration I'd missed at the time, and I'm sure that OpenAI-sized amounts of money will be paid into more goal-oriented natural language projects adapted to whatever paradigm is prominent at the time. But I still agree with Eliezer's "but I could be wrong" here.
One tiny note: I was among the people on AAMLS; I did leave MIRI the next year; and my reasons for so doing are not in any way an indictment of MIRI. (I was having some me-problems.)
I still endorse MIRI as, in some sense, being the adults in the AI Safety room, which has... disconcerting effects on my own level of optimism.
I'd additionally expect the death of pseudonymity on the Internet, as AIs will find it easy to detect similar writing style and correlated posting behavior. What at present takes detective work will in the future be cheaply automated, and we will finally be completely in Zuckerberg's desired world where nobody can maintain a second identity online.
Oh, and this is going to be retroactive, so be ready for the consequences of everything you've ever said online.
If this post is selected, I'd like to see the followup made into an addendum—I think it adds a very important piece, and it should have been nominated itself.