I think I disagree with you, but I don't really understand what you're saying or how these analogies are being used to point to the real world anymore. It seems to me like you might be taking something that makes the problem of "learning from evolution" even more complicated (evolution -> protein -> something -> brain vs. evolution -> protein -> brain) and using that to argue the issues are solved, in the same vein as the "just don't use a value function" people. But I haven't read shard theory, so, GL.
In the evolution/mainstream-ML analogy, we humans are specifying the DNA, not the search process over DNA specifications.
You mean, we are specifying the ATCG strands, or we are specifying the "architecture" behind how DNA influences the development of the human body? It seems to me like we are definitely also choosing how the search for the correct ATCG strands and how they're identified, in this analogy. The DNA doesn't "align" new babies out of the womb, it's just a specification of how to copy the existing, already """aligned""" code.
Ah, I misunderstood.
Well, for starters, because if the history of ML is anything to go by, we're gonna be designing the thing analogous to evolution, and not the brain. We don't pick the actual weights in these transformers, we just design the architecture and then run stochastic gradient descent or some other meta-learning algorithm. That meta-learning algorithm is going to be what decides to go in the DNA, so in order to get the DNA right, we will need to get the meta-learning algorithm correct. Evolution doesn't have much to teach us about that except as a negative example.
But (I think) the answer is similar to this:
Why do you think that? Why is the process by which humans come to reliably care about the real world, not a process we could leverage analogously to make AIs care about the real world?
Humans came to their goals while being trained by evolution on genetic inclusive fitness, but they don't explicitly optimize for that. They "optimize" for something pretty random, that looks like genetic inclusive fitness in the training environment but then in this weird modern out-of-sample environment looks completely different. We can definitely train an AI to care about the real world, but his point is that, by doing something analogous to what happened with humans, we will end up with some completely different inner goal than the goal we're training for, as happened with humans.
That requires, not the ability to read this document and nod along with it, but the ability to spontaneously write it from scratch without anybody else prompting you; that is what makes somebody a peer of its author. It's guaranteed that some of my analysis is mistaken, though not necessarily in a hopeful direction. The ability to do new basic work noticing and fixing those flaws is the same ability as the ability to write this document before I published it, which nobody apparently did, despite my having had other things to do than write this up for the last five years or so. Some of that silence may, possibly, optimistically, be due to nobody else in this field having the ability to write things comprehensibly - such that somebody out there had the knowledge to write all of this themselves, if they could only have written it up, but they couldn't write, so didn't try. I'm not particularly hopeful of this turning out to be true in real life, but I suppose it's one possible place for a "positive model violation" (miracle). The fact that, twenty-one years into my entering this death game, seven years into other EAs noticing the death game, and two years into even normies starting to notice the death game, it is still Eliezer Yudkowsky writing up this list, says that humanity still has only one gamepiece that can do that. I knew I did not actually have the physical stamina to be a star researcher, I tried really really hard to replace myself before my health deteriorated further, and yet here I am writing this. That's not what surviving worlds look like.
Something bugged me about this paragraph, until I realized: If you actually wanted to know whether or not this was true, you could have just asked Nate Soares, Paul Christiano, or anybody else you respected to write this post first, then removed all doubt by making a private comparison. If you had enough confidence in the community you could have even made it into a sequence; gather up all of the big alignment researchers' intuitions on where the Filters are and then let us make our own opinion up on which was most salient.
Instead, now we're in a situation where, I expect, if anybody writes something basically similar you will just posit that they can't really do alignment research because they couldn't have written it "from the null string" like you did. Doing this would literally have saved you work on expectation, and it seems obvious enough for me to be suspicious as to why you didn't think of it.
Fiscal limits notwithstanding, doesn't this suggest MIRI should try hiring a lot more maybe B-tier researchers?
So what terminology do you want to use to make this distinction then?