I think you're moving the goal-posts, since before you mentioned "without external calculators". I think external tools are likely to be critical to doing this, and I'm much more optimistic about that path to doing this kind of robust generalization. I don't think that necessarily addresses concerns about how the system reasons internally, though, which still seems likely to be critical for alignment.
I disagree; I think we have intuitive theories of causality (like intuitive physics) that are very helpful for human learning and intelligence.
RE GPT-3, etc. doing well on math problems: the key word in my response was "robustly". I think there is a big qualitative difference between "doing a good job on a certain distribution of math problems" and "doing math (robustly)". This could be obscured by the fact that people also make mathematical errors sometimes, but I think the type of errors is importantly different from those made by DNNs.
Are you aware of any examples of the opposite happening? I guess it should for some tasks.
I can interpret your argument as being only about the behavior of the system, in which case:- I agree that models are likely to learn to imitate human dialogue about causality, and this will require some amount of some form of causal reasoning.- I'm somewhat skeptical that models will actually be able to robustly learn these kinds of abstractions with a reasonable amount of scaling, but it certainly seems highly plausible.I can also interpret your argument as being about the internal reasoning of the system, in which case:- I put this in the "deep learning is magic" bucket of arguments; it's much better articulated than what we said though, I think...- I am quite skeptical of these arguments, but still find them plausible. I think it would be fascinating to see some proof of concept for this sort of thing (basically addressing the question 'when can/do foundation models internalize explicitly stated knowledge')
I basically agree.
I am arguing against extreme levels of pessimism (~>99% doom).
While I share a large degree of pessimism for similar reasons, I am somewhat more optimistic overall. Most of this comes from generic uncertainty and epistemic humility; I'm a big fan of the inside view, but it's worth noting that this can (roughly) be read as a set of 42 statements that need to be true for us to in fact be doomed, and statistically speaking it seems unlikely that all of these statements are true.However, there are some more specific points I can point to where I think you are overconfident, or at least not providing good reasons for such a high level of confidence (and to my knowledge nobody has). I'll focus on two disagreements which I think are closest to my true disagreements.1) I think safe pivotal "weak" acts likely do exist. It seems likely that we can access vastly superhuman capabilities without inducing huge x-risk using a variety of capability control methods. If we could build something that was only N<<infinity times smarter than us, then intuitively it seems unlikely that it would be able to reverse engineer details of the outside world or other AI systems source code (cf 35) necessary to break out of the box or start cooperating with its AI overseers. If I am right, then the reason nobody has come up with one is because they aren't smart enough (in some -- possibly quite narrow -- sense of smart); that's why we need the superhuman AI! Of course, it could also be that someone has such an idea, but isn't sharing it publicly / with Eliezer.2) I am not convinced that any superhuman AGI we are likely to have the technical means to build in the near future is going to be highly consequentialist (although this does seem likely). I think that humans aren't actually that consequentialist, current AI systems even less so, and it seems entirely plausible that you don't just automatically get super consequentialist things no matter what you are doing or how you are training them... if you train something to follow commands in a bounded way using something like supervised learning, maybe you actually end up with something that does something reasonably close to that. My main reason for expecting consequentialist systems at superhuman-but-not-superintelligent-level AGI is that people will build them that way because of competitive pressures, not because systems that people are trying to make non-consequentialist end up being consequentialist. These two points are related: If we think (2), then we should be more skeptical of (1), although we could still hope to use capability control and incentive schemes to harness a superhuman-but-not-superintelligent consequentialist AGI to devise and help execute "weak" pivotal acts.3) Maybe one more point worth mentioning is the "alien concepts" bit: I also suspect AIs will have alien concepts and thus generalize in weird ways. Adversarial examples and other robustness issues are evidence in favor of this, but we are also seeing that scaling makes models more robust, so it seems plausible that AGI will actually end up using similar concepts to humans, thus making generalizing in the ways we intend/expect natural for AGI systems.---------------------------------------------------------------------The rest of my post is sort of just picking particular places where I think the argumentation is weak, in order to illustrate why I currently think you are, on net, overconfident.
7. The reason why nobody in this community has successfully named a 'pivotal weak act' where you do something weak enough with an AGI to be passively safe, but powerful enough to prevent any other AGI from destroying the world a year later - and yet also we can't just go do that right now and need to wait on AI - is that nothing like that exists.
This contains a dubious implicit assumption, namely: we cannot build safe super-human intelligence, even if it is only slightly superhuman, or superhuman in various narrow-but-strategically-relevant areas.
19. More generally, there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment - to point to latent events and objects and properties in the environment, rather than relatively shallow functions of the sense data and reward.
This basically what CIRL aims to do. We can train for this sort of thing and study such methods of training empirically in synthetic settings.
23. Corrigibility is anti-natural to consequentialist reasoning
Maybe I missed it, but I didn't see any argument for why we end up with consequentialist reasoning.
30. [...] There is no pivotal output of an AGI that is humanly checkable and can be used to safely save the world but only after checking it; this is another form of pivotal weak act which does not exist.
It seems like such things are likely to exist by analogy with complexity theory (checking is easier than proposing).
36. AI-boxing can only work on relatively weak AGIs; the human operators are not secure systems.
I figured it was worth noting that this part doesn't explicitly say that relatively weak AGIs can't perform pivotal acts.
What about graphics? e.g. https://twitter.com/DavidSKrueger/status/1520782213175992320
This is Eliezer’s description of the core insight behind Paul’s imitative amplification proposal. I find this somewhat compelling, but less so than I used to, since I’ve realized that the line between imitation learning and reinforcement learning is blurrier than I used to think (e.g. see this or this).
I didn't understand what you mean by the line being blurrier... Is this a comment about what works in practice for imitation learning? Does a similar objection apply if we replace imitation
learning with behavioral cloning?
Weight-sharing makes deception much harder.
Can I read about that somewhere? Or could you briefly elaborate?