Regarding 1, it either seems like
a) There are true adversarial examples for human values, situations where our values misbehave and we have no way of ever identifying that, in which case we have no hope of solving this problem, because solving it would mean we are in fact able to identify the adversarial examples.
b) Humans are actually immune to adversarial examples, in the sense that we can identify the situations in which our values (or rather, a subset of them) would misbehave (like being addicted to social medial), such that our true, complete values never do, and an AI that accurately models humans would also have such immunity.
One thing that's bothering me is... Google/DeepMind aren't stupid. The transformer model was invented at Google. What has stopped them from having *already* trained such large models privately? GPT-3 isn't that large an evidence for the effectiveness of scaling transformer models; GPT-2 was already a shock and caused huge public commotion. And in fact, if you were close to building an AGI, it would make sense for you not to announce this to the world, specially as open research that anyone could copy/reproduce, for obvious safety and economic reasons.
Maybe there are technical issues keeping us from doing large jumps in scale (i.e. , we only learn how to train a 1 trillion parameter model after we've trained a 100 billion one)?