One scenario that worries me: At first the number of AIs is small, and they aren't super smart, so they mostly just host normal human memes and seem as far as we (and even they) can tell to be perfectly aligned. Then, they get more widely deployed, and now there are many AIs and maybe they are smarter also, and alas it turns out that AIs are a different environment than humans, in a way which was not apparent until now. So different memes flourish and spread in the new environment, and bad things happen.
Typo? " A is aware that getting caught out making wrong predictions will lower its standing as a good hypothesis with A. "
Thanks! It sounds like you are saying the task is more important than the architecture, so we should talk less about architectures and more about tasks.
That seems plausible to me, with the caveat that I think it's still worth talking about architecture sometimes. For example, when thinking about the safety or generalization properties of a system the architecture might be more important, no?
If I could go back in time, I'd change the question to be about "Architecture+training setups" instead of just "architectures."
How many samples did you prune through to get this? Did you do any re-rolls? What was your stopping procedure?
I don't have an elevator pitch summary of my views yet, and it's possible that my interpretation of the classic arguments is wrong, I haven't reread them recently. But here's an attempt:
--The orthogonality thesis and convergent instrumental goals arguments, respectively, attacked and destroyed two views which were surprisingly popular at the time: 1. that smarter AI would necessarily be good (unless we deliberately programmed it not to be) because it would be smart enough to figure out what's right, what we intended, etc. and 2. that smarter AI wouldn't lie to us, hurt us, manipulate us, take resources from us, etc. unless it wanted to (e.g. because it hates us, or because it has been programmed to kill, etc) which it probably wouldn't. I am old enough to remember talking to people who were otherwise smart and thoughtful who had views 1 and 2.
--As for whether the default outcome is doom, the original argument makes clear that default outcome means absent any special effort to make AI good, i.e. assuming everyone just tries to make it intelligent, but no effort is spent on making it good, the outcome is likely to be doom. This is, I think, true. Later the book goes on to talk about how making it good is more difficult than it sounds. Moreover, Bostrom doesn't wave around his arguments about they are proofs; he includes lots of hedge words and maybes. I think we can interpret it as a burden-shifting argument; "Look, given the orthogonality thesis and instrumental convergence, and various other premises, and given the enormous stakes, you'd better have some pretty solid arguments that everything's going to be fine in order to disagree with the conclusion of this book (which is that AI safety is extremely important)." As far as I know no one has come up with any such arguments, and in fact it's now the consensus in the field that no one has found such an argument.
Proceeding from the idea of first-mover advantage, the orthogonality thesis, and the instrumental convergence thesis, we can now begin to see the outlines of an argument for fearing that a plausible default outcome of the creation of machine superintelligence is existential catastrophe.
Second, the orthogonality thesis suggests that we cannot blithely assume that a superintelligence will necessarily share any of the final values stereotypically associated with wisdom and intellectual development in humans—scientific curiosity, benevolent concern for others, spiritual enlightenment and contemplation, renunciation of material acquisitiveness, a taste for refined culture or for the simple pleasures in life, humility and selflessness, and so forth. We will consider later whether it might be possible through deliberate effort to construct a superintelligence that values such things, or to build one that values human welfare, moral goodness, or any other complex purpose its designers might want it to serve. But it is no less possible—and in fact technically a lot easier—to build a superintelligence that places final value on nothing but calculating the decimal expansion of pi. This suggests that—absent a special effort—the first superintelligence may have some such random or reductionistic final goal.
I think the errors in the classic arguments have been greatly exaggerated. So for me the update is just in one direction.
Can you explain more?
I think GPT-N is definitely not aligned, for mesa-optimizer reasons. It'll be some unholy being with a superhuman understanding of all the different types of humans, all the different parts of the internet, all the different kinds of content and style... but it won't itself be human, or anything close.
Of course, it's also not outer-aligned in Evan's sense, because of the universal prior being malign etc.
Unfortunately what you say sounds somewhat plausible to me; I look forward to hearing the responses.
I'll add this additional worry: If you are an early chemist exploring the properties of various metals, and you discover a metal that gets harder as it gets colder, this should increase your credence that there are other metals that share this property. Similarly, I think, for AI architectures. The GPT architecture seems to exhibit pretty awesome scaling properties. What if there are other architectures that also have awesome scaling properties, such that we'll discover this soon? How many architectures have had 1,000+ PF-days pumped into them? Seems like just two or three. And equally importantly, how many architectures have been tried with 100+ billion parameters? I don't know, please tell me if you do.
EDIT: By "architectures" I mean "Architectures + training setups (data, reward function, etc.)"