What this post is about: I'm outlining some thoughts on what I've been calling "convergent rationality". I think this is an important core concept for AI-Xrisk, and probably a big crux for a lot of disagreements. It's going to be hand-wavy! It also ended up being a lot longer than I anticipated.
Abstract: Natural and artificial intelligences tend to learn over time, becoming more intelligent with more experience and opportunity for reflection. Do they also tend to become more "rational" (i.e. "consequentialist", i.e. "agenty" in CFAR speak)? Steve Omohundro's classic 2008 paper argues that they will, and the "traditional AI safety view" and MIRI seem to agree. But I think this assumes an AI that already has a certain sufficient "level of rationality", and it's not clear that all AIs (e.g. supervised learning algorithms) will exhibit or develop a sufficient level of rationality. Deconfusion research around convergent rationality seems important, and we should strive to understand the conditions under which it is a concern as thoroughly as possible.
I'm writing this for at least these 3 reasons:
- I think it'd be useful to have a term ("convergent rationality") for talking about this stuff.
- I want to express, and clarify, (some of) my thoughts on the matter.
- I think it's likely a crux for a lot of disagreements, and isn't widely or quickly recognized as such. Optimistically, I think this article might lead to significantly more clear and productive discussions about AI-Xrisk strategy and technical work.
- Characterizing convergent rationality
- My impression of attitudes towards convergent rationality
- Relation to capability control
- Relevance of convergent rationality to AI-Xrisk
- Conclusions, some arguments pro/con convergent rationality
Characterizing convergent rationality
Consider a supervised learner trying to maximize accuracy. The Bayes error rate is typically non-0, meaning it's not possible to get 100% test accuracy just by making better predictions. If, however, the test data(/data distribution) were modified, for example to only contain examples of a single class, the learner could achieve 100% accuracy. If the learner were a consequentialist with accuracy as its utility function, it would prefer to modify the test distribution in this way in order to increase its utility. Yet, even when given the opportunity to do so, typical gradient-based supervised learning algorithms do not seem to pursue such solutions (at least in my personal experience as an ML researcher).
We can view the supervised learning algorithm as either ignorant of, or indifferent to, the strategy of modifying the test data. But we can also this behavior as a failure of rationality, where the learner is "irrationally" averse or blind to this strategy, by construction. A strong version of the convergent rationality thesis (CRT) would then predict that given sufficient capacity and "optimization pressure", the supervised learner would "become more rational", and begin to pursue the "modify the test data" strategy. (I don't think I've formulated CRT well enough to really call it a thesis, but I'll continue using it informally).
More generally, CRT would imply that deontological ethics are not stable, and deontologists must converge towards consequentialists. (As a caveat, however, note that in general environments, deontological behavior can be described as optimizing a (somewhat contrived) utility function (grep "existence proof" in the reward modeling agenda)). The alarming implication would be that we cannot hope to build agents that will not develop instrumental goals.
I suspect this picture is wrong. At the moment, the picture I have is: imperfectly rational agents will sometimes seek to become more rational, but there may be limits on rationality which the "self-improvement operator" will not cross. This would be analogous to the limit of ω which the "add 1 operator" approaches, but does not cross, in the ordinal numbers. In other words, order to reach "rationality level" ω+1, it's necessary for an agent to already start out at "rationality level" ω. A caveat: I think "rationality" is not uni-dimensional, but I will continue to write as if it is.
My impression of attitudes towards convergent rationality
Broadly speaking, MIRI seem to be strong believers in convergent rationality, but their reasons for this view haven't been very well-articulated (TODO: except the inner optimizer paper?). AI safety people more broadly seem to have a wide range of views, with many people disagreeing with MIRI's views and/or not feeling confident that they understand them well/fully.
Again, broadly speaking, machine learning (ML) people often seem to think it's a confused viewpoint bred out of anthropomorphism, ignorance of current/practical ML, and paranoia. People who are more familiar with evolutionary/genetic algorithms and artificial life communities might be a bit more sympathetic, and similarly for people who are concerned with feedback loops in the context of algorithmic decision making.
I think a lot of people with working on ML-based AI safety consider convergent rationality to be less relevant than MIRI does, because 1) so far it is more of a hypothetical/theoretical concern, whereas we've done a lot of and 2) current ML (e.g. deep RL with bells and whistles) seems dangerous enough because of known and demonstrated specification and robustness problems (e.g. reward hacking and adversarial examples).
In the many conversations I've had with people from all these groups, I've found it pretty hard to find concrete points of disagreement that don't reduce to differences in values (e.g. regarding long-termism), time-lines, or bare intuition. I think "level of paranoia about convergent rationality" is likely an important underlying crux.
Relation to capability control
A plethora of naive approaches to solving safety problems by limiting what agents can do have been proposed and rejected on the grounds that advanced AIs will be smart and rational enough to subvert them. Hyperbolically, the traditional AI safety view is that "capability control" is useless. Irrationality can be viewed as a form of capability control.
Naively, approaches which deliberately reduce an agent's intelligence or rationality should be an effective form of capability control method (I'm guessing that's a proposal in the Artificial Stupidity paper, but I haven't read it). If this were true, then we might be able to build very intelligent and useful AI systems, but control them by, e.g. making them myopic, or restricting the hypothesis class / search space. This would reduce the "burden" on technical solutions to AI-Xrisk, making it (even) more of a global coordination problem.
But CRT suggests that these methods of capability control might fail unexpectedly. There is at least one example (I've struggled to dig up) of a memory-less RL agent learning to encode memory information in the state of the world. More generally, agents can recruit resources from their environments, implicitly expanding their intellectual capabilities, without actually "self-modifying".
Relevance of convergent rationality to AI-Xrisk
Believing CRT should lead to higher levels of "paranoia". Technically, I think this should lead to more focus on things that look more like assurance (vs. robustness or specification). Believing CRT should make us concerned that non-agenty systems (e.g. trained with supervised learning) might start behaving more like agents.
Strategically, it seems like the main implication of believing in CRT pertains to situations where we already have fairly robust global coordination and a sufficiently concerned AI community. CRT implies that these conditions are not sufficient for a good prognosis: even if everyone using AI makes a good-faith effort to make it safe, if they mistakenly don't believe CRT, they can fail. So we'd also want the AI community to behave as if CRT were true unless or until we had overwhelming evidence that it was not a concern.
On the other hand, disbelief in CRT shouldn't allay our fears overly much; AIs need not be hyperrational in order to pose significant Xrisk. For example, we might be wiped out by something more "grey goo"-like, i.e. an AI that is basically a policy hyperoptimized for the niche of the Earth, and doesn't even have anything resembling a world(/universe) model, planning procedure, etc. Or we might create AIs that are like superintelligent humans: having many cognitive biases, but still agenty enough to thoroughly outcompete us, and considering lesser intelligences of dubious moral significance.
Conclusions, some arguments pro/con convergent rationality
My impression is that intelligence (as in IQ/g) and rationality are considered to be only loosely correlated. My current model is that ML systems become more intelligent with more capacity/compute/information, but not necessarily more rational. If this is true, is creates exciting prospects for forms of capability control. On the other hand, if CRT is true, this supports the practice of modelling all sufficiently advanced AIs as rational agents.
I think the main argument against CRT is that, from an ML perspective, it seems like "rationality" is more or less a design choice: we can make agents myopic, we can hard-code flawed environment models or reasoning procedures, etc.The main counter-arguments arise from VNMUT, which can be interpreted as saying "rational agents are more fit" (in an evolutionary sense). At the same time, it seems like the complexity of the real world (e.g. physical limits of communication and information processing) makes this a pretty weak argument. Humans certainly seem highly irrational, and distinguishing biases and heuristics can be difficult.
A special case of this is the "inner optimizers" idea. The strongest argument for inner optimizers I'm aware of goes like: "the simplest solution to a complex enough task (and therefor the easiest for weakly guided search, e.g. by SGD) is to instantiate a more agenty process, and have it solve the problem for you". The "inner" part comes from the postulate that a complex and flexible enough class of models will instantiate such a agenty process internally (i.e. using a subset of the model's capacity). I currently think this picture is broadly speaking correct, and is the third major (technical) pillar supporting AI-Xrisk concerns (along with Goodhart's law and instrumental goals).
The issues with tiling agents also suggest that the analogy with ordinals I made might be stronger than it seems; it may be impossible for an agent to rationally endorse a qualitatively different form of reasoning. Similarly, while "CDT wants to become UDT" (supporting CRT), my understanding is that it is not actually capable of doing so (opposing CRT) because "you have to have been UDT all along" (thanks to Jessica Taylor for explaining this stuff to me a few years back).
While I think MIRI's work on idealized reasoners has shed some light on these questions, I think in practice, random(ish) "mutation" (whether intentionally designed or imposed by the physical environment) and evolutionary-like pressures may push AIs across boundaries that the "self-improvement operator" will not cross, making analyses of idealized reasoners less useful than they might naively appear.
This article is inspired by conversations with Alex Zhu, Scott Garrabrant, Jan Leike, Rohin Shah, Micah Carrol, and many others over the past year and years.
Something which seems missing from this discussion is the level of confidence we can have for/against CRT. It doesn't make sense to just decide whether CRT seems more true or false and then go from there. If CRT seems at all possible (ie, outside-view probability at least 1%), doesn't that have most of the strategic implications of CRT itself? (Like the ones you list in the relevance-to-xrisk section.) [One could definitely make the case for probabilities lower than 1%, too, but I'm not sure where the cutoff should be, so I said 1%.]
My personal position isn't CRT (although inner-optimizer considerations have brought me closer to that position), but rather, not-obviously-not-CRT. Strategies which depend on not-CRT should go along with actually-quite-strong arguments against CRT, and/or technology for making CRT not true. It makes sense to pursue those strategies, and I sometimes think about them. But achieving confidence in not-CRT is a big obstacle.
Another obstacle to those strategies is, even if future AGI isn't sufficiently strategic/agenty/rational to fall into the "rationality attractor", it seems like it would be capable enough that someone could use it to create something agenty/rational enough for CRT. So even if CRT-type concerns don't apply to super-advanced image classifiers or whatever, the overall concern might stand because at some point someone applies the same technology to RL problems, or asks a powerful GAN to imitate agentic behavior, etc.
Of course it doesn't make sense to generically argue that we should be concerned about CRT in absence of a proof of its negation. There has to be some level of background reason for thinking CRT might be a concern. For example, although atomic weapons are concerning in many ways, it would not have made sense to raise CRT concerns about atomic weapons and ask for a proof of not-CRT before testing atomic weapons. So there has to be something about AI technology which specifically raises CRT as a concern.
One "something" is, simply, that natural instances of intelligence are associated with a relatively high degree of rationality/strategicness/agentiness (relative to non-intelligent things). But I do think there's more reasoning to be unpacked.
I also agree with other commenters about CRT not being quite the right thing to point at, but, this issue of the degree of confidence in doubt-of-CRT was the thing that struck me as most critical. The standard of evidence for raising CRT as a legitimate concern seems like it should be much lower than the standard of evidence for setting that concern aside.
I basically agree with your main point (and I didn't mean to suggest that it "[makes] sense to just decide whether CRT seems more true or false and then go from there").
But I think it's also suggestive of an underlying view that I disagree with, namely: (1) "we should aim for high-confidence solutions to AI-Xrisk". I think this is a good heuristic, but from a strategic point of view, I think what we should be doing is closer to: (2) "aim to maximize the rate of Xrisk reduction".
Practically speaking, a big implication of favoring (2) over (1) is giving a relatively higher priority to research at making unsafe-looking approaches (e.g. reward modelling + DRL) safer (in expectation).
I recall an example of a Mujoco agent whose memory was periodically wiped storing information in the position of its arms. I'm also having trouble digging it up though.
In OpenAI's Roboschool blog post:
I haven't seen anyone argue for CRT the way you describe it. I always thought the argument was that we are concerned about "rational AIs" (I would say more specifically, "AIs that run searches through a space of possible actions, in pursuit of a real-world goal"), because (1) We humans have real-world goals ("cure Alzheimer's" etc.) and the best way to accomplish a real-world goal is generally to build an agent optimizing for that goal (well, that's true right up until the agent becomes too powerful to control, and then it becomes catastrophically false), (2) We can try to build AIs that are not in this category, but screw up*, (3) Even if we here all agree to not build this type of agent, it's hard to coordinate everyone on earth to never do it forever. (See also: Rohin's two posts on goal-directedness.)
In particular, when Eliezer argued a couple years ago that we should be mainly thinking about AGIs that have real-world-anchored utility functions (e.g. here or here) I've always fleshed out that argument as: "...This type of AGI is the most effective and powerful type of AGI, and we should assume that society will keep making our AIs more and more effective and powerful until we reach that category."
*(Remember, any AI is running searches through some space in pursuit of something, otherwise you would never call it "intelligence". So one can imagine that the intelligent search may accidentally get aimed at the wrong target.)
Concerns about inner optimizers seem like a clear example of people arguing for some version of CRT (as I describe it). Would you disagree (why)?
See my response to rohin, below.
I'm potentially worried about both; let's not make a false dichotomy!
Planned summary for the Alignment Newsletter:
At a meta-level: this post might be a bit to under-developed to be worth trying to summarize in the newsletter; I'm not sure.
RE the summary:
RE the opinion:
After seeing your response, I think that's right, I'll remove it.
How is a powerful replicator / optimizer not rational? Perhaps you mean grey-goo type scenarios where we wouldn't call the replicator "intelligent", but it's nonetheless a good replicator? Are you worried about AI systems of that form? Why?
Sure, I more meant competently goal-directed.
Yes, I'm worried about systems of that form (in some sense). The reason is: I think intelligence is just one salient feature of what makes a life-form or individual able to out-compete others. I think intelligence, and fitness even more so, are multifaceted characteristics. And there are probably many possible AIs with different profiles of cognitive and physical capabilities that would pose an Xrisk for humans.
For instance, any appreciable quantity of a *hypothetical* grey goo that could use any matter on earth to replicate (i.e. duplicate itself) once per minute would almost certainly consume the earth in less than one day (I guess modulo some important problems around transportation and/or its initial distribution over the earth, but you probably get the point).
More realistically, it seems likely that we will have AI systems that have some significant flaws but are highly competent at strategically relevant cognitive skills, able to think much faster than humans, and have very different (probably larger but a bit more limited) arrays of sensors and actuators than humans, which may pose some Xrisk.
The point is just that intelligence and rationality are import traits for Xrisk, but we should certainly not make the mistake of believing one/either/both are the only traits that matter. And we should also recognize that they are both abstractions and simplifications that we believe are often useful but rarely, if ever, sufficient for thorough and effective reasoning about AI-Xrisk.
This is still, I think, not the important distinction. By "significantly restricted", I don't necessarily mean that it is limiting performance below a level of "competence". It could be highly competent, super-human, etc., but still be significantly restricted.
Maybe a good example (although maybe departing from the "restricted hypothesis space" type of example) would be an AI system that has a finite horizon of 1,000,000 years, but no other restrictions. There may be a sense in which this system is irrational (e.g. having time-inconsistent preferences), but it may still be extremely competently goal-directed.
Sure, but within AI, intelligence is the main feature that we're trying very hard to increase in our systems that would plausibly let the systems we build outcompete us. We aren't trying to make AI systems that replicate as fast as possible. So it seems like the main thing to be worried about is intelligence.
My main opposition to this is that it's not actionable: sure, lots of things could outcompete us; this doesn't change what I'll do unless there's a specific thing that could outcompete us that will plausibly exist in the future.
(It feels similar in spirit, though not in absurdity, to a claim like "it is possible that aliens left an ancient weapon buried beneath the surface of the Earth that will explode tomorrow, we should not make the mistake of ignoring that hypothesis".)
Idk, if it's superintelligent, that system sounds both rational and competently goal-directed to me.
Blaise Agüera y Arcas gave a keynote at this NeurIPS pushing ALife (motivated by specification problems, weirdly enough...: https://neurips.cc/Conferences/2019/Schedule?showEvent=15487).
The talk recording: https://slideslive.com/38921748/social-intelligence. I recommend it.
I think I was maybe trying to convey too much of my high-level views here. What's maybe more relevant and persuasive here is this line of thought:
Also, nitpicking a bit: to a large extent, society is trying to make systems that are as competitive as possible at narrow, profitable tasks. There are incentives for excellence in many domains. FWIW, I'm somewhat concerned about replicators in practice, e.g. because I think open-ended AI systems operating in the real-world might create replicators accidentally/indifferently, and we might not notice fast enough.
I think the main take-away from these concerns is to realize that there are extra risk factors that are hard to anticipate and for which we might not have good detection mechanisms. This should increase pessimism/paranoia, especially (IMO) regarding "benign" systems.
(non-hypothetical Q): What about if it has a horizon of 10^-8s? Or 0?
I'm leaning on "we're confused about what rationality means" here, and specifically, I believe time-inconsistent preferences are something that many would say seem irrational (prima face). But
With 0, the AI never does anything and so is basically a rock. With 10^-8, it still seems rational and competently goal-directed to me, just with weird-to-me preferences.
Really? I feel like that at least depends on what the preference is. I could totally imagine that people have preferences to e.g. win at least one Olympic medal, but further medals are less important (which is history-dependent), be the youngest person to achieve <some achievement> (which is finite horizon), eat ice cream in the next half hour (but not care much after that).
You might object that all of these can be made state-dependent, but you can make your example state-dependent by including the current time in the state.
I agree that we are probably not going to build superintelligent AIs that have a horizon of 10^-8s, just because our preferences don't have horizons of 10^-8s, and we'll try to build AIs that optimize our preferences.
I'm trying to point at "myopic RL", which does, in fact, do things.
I do object, and still object, since I don't think we can realistically include the current time in the state. What we can include is: an impression of what the current time is, based on past and current observations. There's an epistemic/indexical problem here you're ignoring.
I'm not an expert on AIXI, but my impression from talking to AIXI researchers and looking at their papers is: finite-horizon variants of AIXI have this "problem" of time-inconsistent preferences, despite conditioning on the entire history (which basically provides an encoding of time). So I think the problem I'm referring to exists regardless.
Ah, an off-by-one miscommunication. Sure, it's both rational and competently goal-directed.
I mean, if you want to go down that route, then "win at least one medal" is also not state-dependent, because you can't realistically include "whether Alice has won a medal" in the state: you can only include an impression of whether Alice has won a medal, based on past and current observations. So I still have the same objection.
Oh, I see. You probably mean AI systems that act as though they have goals that will only last for e.g. 5 seconds. Then, 2 seconds later, they act as though they have goals that will last for 5 more seconds, i.e. 7 seconds after the initial time. (I was thinking of agents that initially care about the next 5 seconds, and then after 2 seconds, they care about the next 3 seconds, and after 7 seconds, they don't care about anything.)
I agree that the preferences you were talking about are time-inconsistent, and such agents seem both less rational and less competently goal-directed to me.
While I generally agree with CRT as applied to advanced agents, the VNM theorem is not the reason why, because it is vacuous in this setting. I agree with steve that the real argument for it is that humans are more likely to build goal-directed agents because that's the only way we know how to get AI systems that do what we want. But we totally could build non-goal-directed agents that CRT doesn't apply to, e.g. Google Maps.
I definitely want to distinguish CRT from arguments that humans will deliberately build goal-directed agents. But let me emphasize: I think incentives for humans to build goal-directed agents are a larger and more significant and important source of risk than CRT.
RE VVMUT being vacuous: this is a good point (and also implied by the caveat from the reward modeling paper). But I think that in practice we can meaningfully identify goal-directed agents and infer their rationality/bias "profile", as suggested by your work ( http://proceedings.mlr.press/v97/shah19a.html ), and Laurent Orseau's ( https://arxiv.org/abs/1805.12387 ).
I guess my position is that CRT is only true to the extent that you build a goal-directed agent. (Technically, the inner optimizers argument is one way that CRT could be true even without building an explicitly goal-directed agent, but it seems like you view CRT as broader and more likely than inner optimizers, and I'm not sure how.)
Maybe another way to get at the underlying misunderstanding: do you see a difference between "convergent rationality" and "convergent goal-directedness"? If so, what is it? From what you've written they sound equivalent to me. ETA: Actually it's more like "convergent rationality" and "convergent competent goal-directedness".
That's a reasonable position, but I think the reality is that we just don't know. Moreover, it seems possible to build goal-directed agents that don't become hyper-rational by (e.g.) restricting their hypothesis space. Lots of potential for deconfusion, IMO.
EDIT: the above was in response to your first paragraph. I think I didn't respond RE the 2nd paragraph because I don't know what "convergent goal-directedness" refers to, and was planning to read your sequence but never got around to it.
I would guess that Chapter 2 of that sequence would be the most (relevant + important) piece of writing for you (w.r.t this post in particular), though I'm not sure about the relevance.
I would say that there are some kinds of irrationality that will be self modified or subagented away, and others that will stay. A CDT agent will not make other CDT agents. A myopic agent, one that only cares about the next hour, will create a subagent that only cares about the first hour after it was created. (Aeons later it will have taken over the universe and put all the resources into time-travel and worrying that its clock is wrong.)
I am not aware of any irrationality that I would consider to make a safe, useful and stable under self modification - subagent creation.
" I would say that there are some kinds of irrationality that will be self modified or subagented away, and others that will stay. "
^ I agree; this is the point of my analogy with ordinal numbers.
A completely myopic agent (that doesn't directly do planning over future time-steps, but only seeks to optimize its current decision) probably shouldn't make any sub-agents in the first place (except incidentally).