Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More

Ben Pace

An actual debate about instrumental convergence, in a public space! Major respect to all involved, especially Yoshua Bengio for great facilitation.

For posterity (i.e. having a good historical archive) and further discussion, I've reproduced the conversation here. I'm happy to make edits at the request of anyone in the discussion who is quoted below. I've improved formatting for clarity and fixed some typos. For people who are not researchers in this area who wish to comment, see the public version of this post here. For people who do work on the relevant areas, please sign up in the top right. It will take a day or so to confirm membership.

Original Post

Yann LeCun: "don't fear the Terminator", a short opinion piece by Tony Zador and me that was just published in Scientific American.

"We dramatically overestimate the threat of an accidental AI takeover, because we tend to conflate intelligence with the drive to achieve dominance. [...] But intelligence per se does not generate the drive for domination, any more than horns do."

https://blogs.scientificamerican.com/observations/dont-fear-the-terminator/

Comment Thread #1

Elliot Olds: Yann, the smart people who are very worried about AI seeking power and ensuring its own survival believe it's a big risk because power and survival are instrumental goals for almost any ultimate goal.

If you give a generally intelligent AI the goal to make as much money in the stock market as possible, it will resist being shut down because that would interfere with tis goal. It would try to become more powerful because then it could make money more effectively. This is the natural consequence of giving a smart agent a goal, unless we do something special to counteract this.

You've often written about how we shouldn't be so worried about AI, but I've never seen you address this point directly.

Stuart Russell: It is trivial to construct a toy MDP in which the agent's only reward comes from fetching the coffee. If, in that MDP, there is another "human" who has some probability, however small, of switching the agent off, and if the agent has available a button that switches off that human, the agent will necessarily press that button as part of the optimal solution for fetching the coffee. No hatred, no desire for power, no built-in emotions, no built-in survival instinct, nothing except the desire to fetch the coffee successfully. This point cannot be addressed because it's a simple mathematical observation.

Comment Thread #2

Yoshua Bengio: Yann, I'd be curious about your response to Stuart Russell's point.

Yann LeCun: You mean, the so-called "instrumental convergence" argument by which "a robot can't fetch you coffee if it's dead. Hence it will develop self-preservation as an instrumental sub-goal."

It might even kill you if you get in the way.

1. Once the robot has brought you coffee, its self-preservation instinct disappears. You can turn it off.

2. One would have to be unbelievably stupid to build open-ended objectives in a super-intelligent (and super-powerful) machine without some safeguard terms in the objective.

3. One would have to be rather incompetent not to have a mechanism by which new terms in the objective could be added to prevent previously-unforeseen bad behavior. For humans, we have education and laws to shape our objective functions and complement the hardwired terms built into us by evolution.

4. The power of even the most super-intelligent machine is limited by physics, and its size and needs make it vulnerable to physical attacks. No need for much intelligence here. A virus is infinitely less intelligent than you, but it can still kill you.

5. A second machine, designed solely to neutralize an evil super-intelligent machine will win every time, if given similar amounts of computing resources (because specialized machines always beat general ones).

Bottom line: there are lots and lots of ways to protect against badly-designed intelligent machines turned evil.

Stuart has called me stupid in the Vanity Fair interview linked below for allegedly not understanding the whole idea of instrumental convergence.

It's not that I don't understand it. I think it would only be relevant in a fantasy world in which people would be smart enough to design super-intelligent machines, yet ridiculously stupid to the point of giving it moronic objectives with no safeguards.

Here is the juicy bit from the article where Stuart calls me stupid:

Russell took exception to the views of Yann LeCun, who developed the forerunner of the convolutional neural nets used by AlphaGo and is Facebook’s director of A.I. research. LeCun told the BBC that there would be no Ex Machina or Terminator scenarios, because robots would not be built with human drives—hunger, power, reproduction, self-preservation. “Yann LeCun keeps saying that there’s no reason why machines would have any self-preservation instinct,” Russell said. “And it’s simply and mathematically false. I mean, it’s so obvious that a machine will have self-preservation even if you don’t program it in because if you say, ‘Fetch the coffee,’ it can’t fetch the coffee if it’s dead. So if you give it any goal whatsoever, it has a reason to preserve its own existence to achieve that goal. And if you threaten it on your way to getting coffee, it’s going to kill you because any risk to the coffee has to be countered. People have explained this to LeCun in very simple terms.”

https://www.vanityfair.com/news/2017/03/elon-musk-billion-dollar-crusade-to-stop-ai-space-x

Tony Zador: I agree with most of what Yann wrote about Stuart Russell's concern.

Specifically, I think the flaw in Stuart's argument is the assertion that "switching off the human is the optimal solution"---who says that's an optimal solution?

I guess if you posit an omnipotent robot, destroying humanity might be a possible solution. But if the robot is not omnipotent, then killing humans comes at considerable risk, ie that they will retaliate. Or humans might build special "protector robots" whose value function is solely focused on preventing the killing of humans by other robots. Presumably these robots would be at least as well armed as the coffee robots. So this really increases the risk to the coffee robots of pursuing the genocide strategy.

And if the robot is omnipotent, then there are an infinite number of alternative strategies to ensure survival (like putting up an impenetrable forcefield around the off switch) that work just as well.

So i would say that killing all humans is not only not likely to be an optimal strategy under most scenarios, the set of scenarios under which it is optimal is probably close to a set of measure 0.

Stuart Russell: Thanks for clearing that up - so 2+2 is not equal to 4, because if the 2 were a 3, the answer wouldn't be 4? I simply pointed out that in the MDP as I defined it, switching off the human is the optimal solution, despite the fact that we didn't put in any emotions of power, domination, hate, testosterone, etc etc. And your solution seems, well, frankly terrifying, although I suppose the NRA would approve. Your last suggestion, that the robot could prevent anyone from ever switching it off, is also one of the things we are trying to avoid. The point is that the behaviors we are concerned about have nothing to do with putting in emotions of survival, power, domination, etc. So arguing that there's no need to put those emotions in is completely missing the point.

Yann LeCun: Not clear whether you are referring to my comment or Tony's.

The point is that behaviors you are concerned about are easily avoidable by simple terms in the objective. In the unlikely event that these safeguards somehow fail, my partial list of escalating solutions (which you seem to find terrifying) is there to prevent a catastrophe. So arguing that emotions of survival etc will inevitably lead to dangerous behavior is completely missing the point.

It's a bit like saying that building cars without brakes will lead to fatalities.

Yes, but why would we be so stupid as to not include brakes?

That said, instrumental subgoals are much weaker drives of behavior than hardwired objectives. Else, how could one explain the lack of domination behavior in non-social animals, such as orangutans.

Francesca Rossi: @Yann Indeed it would be odd to design an AI system with a specific goal, like fetching coffee, and capabilities that include killing humans or disallowing being turned off, without equipping it also with guidelines and priorities to constrain its freedom, so it can understand for example that fetching coffee is not so important that it is worth killing a human being to do it. Value alignment is fundamental to achieve this. Why would we build machines that are not aligned to our values? Stuart, I agree that it would easy to build a coffee fetching machine that is not aligned to our values, but why would we do this? Of course value alignment is not easy, and still a research challenge, but I would make it part of the picture when we envision future intelligent machines.

Richard Mallah: Francesca, of course Stuart believes we should create value-aligned AI. The point is that there are too many caveats to explicitly add each to an objective function, and there are strong socioeconomic drives for humans to monetize AI prior to getting it sufficiently right, sufficiently safe.

Stuart Russell: "Why would be build machines that are not aligned to our values?" That's what we are doing, all the time. The standard model of AI assumes that the objective is fixed and known (check the textbook!), and we build machines on that basis - whether it's clickthrough maximization in social media content selection or total error minimization in photo labeling (Google Jacky Alciné) or, per Danny Hillis, profit maximization in fossil fuel companies. This is going to become even more untenable as machines become more powerful. There is no hope of "solving the value alignment problem" in the sense of figuring out the right value function offline and putting it into the machine. We need to change the way we do AI.

Yoshua Bengio: All right, we're making some progress towards a healthy debate. Let me try to summarize my understanding of the arguments. Yann LeCun and Tony Zadorr argue that humans would be stupid to put in explicit dominance instincts in our AIs. Stuart Russell responds that it needs not be explicit but dangerous or immoral behavior may simply arise out of imperfect value alignment and instrumental subgoals set by the machine to achieve its official goals. Yann LeCun and Tony Zador respond that we would be stupid not to program the proper 'laws of robotics' to protect humans. Stuart Russell is concerned that value alignment is not a solved problem and may be intractable (i.e. there will always remain a gap, and a sufficiently powerful AI could 'exploit' this gap, just like very powerful corporations currently often act legally but immorally). Yann LeCun and Tony Zador argue that we could also build defensive military robots designed to only kill regular AIs gone rogue by lack of value alignment. Stuart Russell did not explicitly respond to this but I infer from his NRA reference that we could be worse off with these defensive robots because now they have explicit weapons and can also suffer from the value misalignment problem.

Yoshua Bengio: So at the end of the day, it boils down to whether we can handle the value misalignment problem, and I'm afraid that it's not clear we can for sure, but it also seems reasonable to think we will be able to in the future. Maybe part of the problem is that Yann LeCun and Tony Zador are satisfied with a 99.9% probability that we can fix the value alignment problem while Stuart Russell is not satisfied with taking such an existential risk.

Yoshua Bengio: And there is another issue which was not much discussed (although the article does talk about the short-term risks of military uses of AI etc), and which concerns me: humans can easily do stupid things. So even if there are ways to mitigate the possibility of rogue AIs due to value misalignment, how can we guarantee that no single human will act stupidly (more likely, greedily for their own power) and unleash dangerous AIs in the world? And for this, we don't even need superintelligent AIs, to feel very concerned. The value alignment problem also applies to humans (or companies) who have a lot of power: the misalignment between their interests and the common good can lead to catastrophic outcomes, as we already know (e.g. tragedy of the commons, corruption, companies lying to have you buy their cigarettes or their oil, etc). It just gets worse when more power can be concentrated in the hands of a single person or organization, and AI advances can provide that power.

Francesca Rossi: I am more optimistic than Stuart about the value alignment problem. I think that a suitable combination of symbolic reasoning and various forms of machine learning can help us to both advance AI’s capabilities and get closer to solving the value alignment problem.

Tony Zador: @Stuart Russell "Thanks for clearing that up - so 2+2 is not equal to 4, because if the 2 were a 3, the answer wouldn't be 4? "

hmm. not quite what i'm saying.

If we're going for the math analogies, then i would say that a better analogy is:

Find X, Y such that X+Y=4.

The "killer coffee robot" solution is {X=642, Y = -638}. In other words: Yes, it is a solution, but not a particularly natural or likely or good solution.

But we humans are blinded but our own warped perspective. We focus on the solution that involves killing other creatures because that appears to be one of the main solutions that we humans default to. But it is not a particularly common solution in the natural world, nor do i think it's a particularly effective solution in the long run.

Yann LeCun: Humanity has been very familiar with the problem of fixing value misalignments for millenia.

We fix our children's hardwired values by teaching them how to behave.

We fix human value misalignment by laws. Laws create extrinsic terms in our objective functions and cause the appearance of instrumental subgoals ("don't steal") in order to avoid punishment. The desire for social acceptance also creates such instrumental subgoals driving good behavior.

We even fix value misalignment for super-human and super-intelligent entities, such as corporations and governments.

This last one occasionally fails, which is a considerably more immediate existential threat than AI.

Tony Zador: @Yoshua Bengio I agree with much of your summary. I agree value alignment is important, and that it is not a solved problem.

I also agree that new technologies often have unintended and profound consequences. The invention of books has led to a decline in our memories (people used to recite the entire Odyssey). Improvements in food production technology (and other factors) have led to a surprising obesity epidemic. The invention of social media is disrupting our political systems in ways that, to me anyway, have been quite surprising. So improvements in AI will undoubtedly have profound consequences for society, some of which will be negative.

But in my view, focusing on "killer robots that dominate or step on humans" is a distraction from much more serious issues.

That said, perhaps "killer robots" can be thought of as a metaphor (or metonym) for the set of all scary scenarios that result from this powerful new technology.

Yann LeCun: @Stuart Russell you write "we need to change the way we do AI". The problems you describe have nothing to do with AI per se.

They have to do with designing (not avoiding) explicit instrumental objectives for entities (e.g. corporations) so that their overall behavior works for the common good. This is a problem of law, economics, policies, ethics, and the problem of controlling complex dynamical systems composed of many agents in interaction.

What is required is a mechanism through which objectives can be changed quickly when issues surface. For example, Facebook stopped maximizing clickthroughs several years ago and stopped using the time spent in the app as a criterion about 2 years ago. It put in place measures to limit the dissemination of clickbait, and it favored content shared by friends rather than directly disseminating content from publishers.

We certainly agree that designing good objectives is hard. Humanity has struggled with designing objectives for itself for millennia. So this is not a new problem. If anything, designing objectives for machines, and forcing them to abide by them will be a lot easier than for humans, since we can physically modify their firmware.

There will be mistakes, no doubt, as with any new technology (early jetliners lost wings, early cars didn't have seat belts, roads didn't have speed limits...).

But I disagree that there is a high risk of accidentally building existential threats to humanity.

Existential threats to humanity have to be explicitly designed as such.

Yann LeCun: It will be much, much easier to control the behavior of autonomous AI systems than it has been for humans and human organizations, because we will be able to directly modify their intrinsic objective function.

This is very much unlike humans, whose objective can only be shaped through extrinsic objective functions (through education and laws), that indirectly create instrumental sub-objectives ("be nice, don't steal, don't kill, or you will be punished").

As I have pointed out in several talks in the last several years, autonomous AI systems will need to have a trainable part in their objective, which would allow their handlers to train them to behave properly, without having to directly hack their objective function by programmatic means.

Yoshua Bengio: Yann, these are good points, we indeed have much more control over machines than humans since we can design (and train) their objective function. I actually have some hopes that by using an objective-based mechanism relying on learning (to inculcate values) rather than a set of hard rules (like in much of our legal system), we could achieve more robustness to unforeseen value alignment mishaps. In fact, I surmise we should do that with human entities too, i.e., penalize companies, e.g. fiscally, when they behave in a way which hurts the common good, even if they are not directly violating an explicit law. This also suggests to me that we should try to avoid that any entity (person, company, AI) have too much power, to avoid such problems. On the other hand, although probably not in the near future, there could be AI systems which surpass human intellectual power in ways that could foil our attempts at setting objective functions which avoid harm to us. It seems hard to me to completely deny that possibility, which thus would beg for more research in (machine-) learning moral values, value alignment, and maybe even in public policies about AI (to minimize the events in which a stupid human brings about AI systems without the proper failsafes) etc.

Yann LeCun: @Yoshua Bengio if we can build "AI systems which surpass human intellectual power in ways that could foil our attempts at setting objective functions", we can also build similarly-powerful AI systems to set those objective functions.

Sort of like the discriminator in GANs....

Yann LeCun: @Yoshua Bengio a couple direct comments on your summary:

designing objectives for super-human entities is not a new problem. Human societies have been doing this through laws (concerning corporations and governments) for millennia.
the defensive AI systems designed to protect against rogue AI systems are not akin to the military, they are akin to the police, to law enforcement. Their "jurisdiction" would be strictly AI systems, not humans.

But until we have a hint of a beginning of a design, with some visible path towards autonomous AI systems with non-trivial intelligence, we are arguing about the sex of angels.

Yuri Barzov: Aren't we overestimating the ability of imperfect humans to build a perfect machine? If it will be much more powerful than humans its imperfections will be also magnified. Cute human kids grow up into criminals if they get spoiled by reinforcement i.e. addiction to rewards. We use reinforcement and backpropagation (kind of reinforcement) in modern golden standard AI systems. Do we know enough about humans to be able to build a fault-proof human friendly super intelligent machine?

Yoshua Bengio: @Yann LeCun, about discriminators in GANs, and critics in Actor-Critic RL, one thing we know is that they tend to be biased. That is why the critic in Actor-Critic is not used as an objective function but instead as a baseline to reduce the variance. Similarly, optimizing the generator wrt a fixed discriminator does not work (you would converge to a single mode - unless you balance that with entropy maximization). Anyways, just to say, there is much more research to do, lots of unknown unknowns about learning moral objective functions for AIs. I'm not afraid of research challenges, but I can understand that some people would be concerned about the safety of gradually more powerful AIs with misaligned objectives. I actually like the way that Stuart Russell is attacking this problem by thinking about it not just in terms of an objective function but also about uncertainty: the AI should avoid actions which might hurt us (according to a self-estimate of the uncertain consequences of actions), and stay the conservative course with high confidence of accomplishing the mission while not creating collateral damage. I think that what you and I are trying to say is that all this is quite different from the terminator scenarios which some people in the media are brandishing. I also agree with you that there are lots of unknown unknowns about the strengths and weaknesses of future AIs, but I think that it is not too early to start thinking about these issues.

Yoshua Bengio: @Yuri Barzov the answer to your question: no. But we don't know that it is not feasible either, and we have reasons to believe that (a) it is not for tomorrow such machines will exist and (b) we have intellectual tools which may lead to solutions. Or maybe not!

Stuart Russell: Yann's comment "Facebook stopped maximizing clickthroughs several years ago and stopped using the time spent in the app as a criterion about 2 years ago" makes my point for me. Why did they stop doing it? Because it was the wrong objective function. Yann says we'd have to be "extremely stupid" to put the wrong objective into a super-powerful machine. Facebook's platform is not super-smart but it is super-powerful, because it connects with billions of people for hours every day. And yet they put the wrong objective function into it. QED. Fortunately they were able to reset it, but unfortunately one has to assume it's still optimizing a fixed objective. And the fact that it's operating within a large corporation that's designed to maximize another fixed objective - profit - means we cannot switch it off.

Stuart Russell: Regarding "externalities" - when talking about externalities, economists are making essentially the same point I'm making: externalities are the things not stated in the given objective function that get damaged when the system optimizes that objective function. In the case of the atmosphere, it's relatively easy to measure the amount of pollution and charge for it via taxes or fines, so correcting the problem is possible (unless the offender is too powerful). In the case of manipulation of human preferences and information states, it's very hard to assess costs and impose taxes or fines. The theory of uncertain objectives suggests instead that systems be designed to be "minimally invasive", i.e., don't mess with parts of the world state whose value is unclear. In particular, as a general rule it's probably best to avoid using fixed-objective reinforcement learning in human-facing systems, because the reinforcement learner will learn how to manipulate the human to maximize its objective.

Stuart Russell: @Yann LeCun Let's talk about climate change for a change. Many argue that it's an existential or near-existential threat to humanity. Was it "explicitly designed" as such? We created the corporation, which is a fixed-objective maximizer. The purpose was not to create an existential risk to humanity. Fossil-fuel corporations became super-powerful and, in certain relevant senses, super-intelligent: they anticipated and began planning for global warming five decades ago, executing a campaign that outwitted the rest of the human race. They didn't win the academic argument but they won in the real world, and the human race lost. I just attended an NAS meeting on climate control systems, where the consensus was that it was too dangerous to develop, say, solar radiation management systems - not because they might produce unexpected disastrous effects but because the fossil fuel corporations would use their existence as a further form of leverage in their so-far successful campaign to keep burning more carbon.

Stuart Russell: @Yann LeCun This seems to be a very weak argument. The objection raised by Omohundro and others who discuss instrumental goals is aimed at any system that operates by optimizing a fixed, known objective; which covers pretty much all present-day AI systems. So the issue is: what happens if we keep to that general plan - let's call it the "standard model" - and improve the capabilities for the system to achieve the objective? We don't need to know today *how* a future system achieves objectives more successfully, to see that it would be problematic. So the proposal is, don't build systems according to the standard model.

Yann LeCun: @Stuart Russell the problem is that essentially no AI system today is autonomous.

They are all trained *in advance* to optimize an objective, and subsequently execute the task with no regards to the objective, hence with no way to spontaneously deviate from the original behavior.

As of today, as far as I can tell, we do *not* have a good design for an autonomous machine, driven by an objective, capable of coming up with new strategies to optimize this objective in the real world.

We have plenty of those in games and simple simulation. But the learning paradigms are way too inefficient to be practical in the real world.

Yuri Barzov: @Yoshua Bengio yes. If we frame the problem correctly we will be able to resolve it. AI puts natural intelligence into focus like a magnifying mirror

Yann LeCun: @Stuart Russell in pretty much everything that society does (business, government, of whatever) behaviors are shaped through incentives, penalties via contracts, regulations and laws (let's call them collectively the objective function), which are proxies for the metric that needs to be optimized.

Because societies are complex systems, because humans are complex agents, and because conditions evolve, it is a requirement that the objective function be modifiable to correct unforeseen negative effects, loopholes, inefficiencies, etc.

The Facebook story is unremarkable in that respect: when bad side effects emerge, measures are taken to correct them. Often, these measures eliminate bad actors by directly changing their economic incentive (e.g. removing the economic incentive for clickbaits).

Perhaps we agree on the following:

(0) not all consequences of a fixed set of incentives can be predicted.

(1) because of that, objectives functions must be updatable.

(2) they must be updated to correct bad effect whenever they emerge.

(3) there should be an easy way to train minor aspects of objective functions through simple interaction (similar to the process of educating children), as opposed to programmatic means.

Perhaps where we disagree is the risk of inadvertently producing systems with badly-designed and (somehow) un-modifiable objectives that would be powerful enough to constitute existential threats.

Yoshua Bengio: @Yann LeCun this is true, but one aspect which concerns me (and others) is the gradual increase in power of some agents (now mostly large companies and some governments, potentially some AI systems in the future). When it was just weak humans the cost of mistakes or value misalignment (improper laws, misaligned objective function) was always very limited and local. As we build more and more powerful and intelligent tools and organizations, (1) it becomes easier to cheat for 'smarter' agents (exploit the misalignment) and (2) the cost of these misalignments becomes greater, potentially threatening the whole of society. This then does not leave much time and warning to react to value misalignment.

There's a dynamic that's a normal part of cognitive specialization of labor, where the work other people are doing is "just X"; imagine trying to create a newspaper, for example. Most people will think of writing articles as "just journalism"; you pay journalists whatever salary, they do whatever work, and you get articles for your newspaper. Similarly the accounting is "just accounting," and so on. But the journalist can't see journalism as "just journalism"; if their model of how to write articles is "money goes in, article comes out" they won't be able to write any articles. Instead they have lots of details about how to write articles, which includes what articles are and aren't easy.

You could view both sides as doing something like this: the person who's trying to make safeguards is saying "look, you can't say 'just add safeguards', these things are really difficult" and the person who's trying to make something worth safeguarding is saying "look, you can't just 'just build an autonomous superintelligence', these things are really difficult." (Especially since I think LeCun views them as too difficult to try to do, and instead is just trying to get some subcomponents.)

I think that's part of what's going on, but mostly in how it seems to obscure the core issue (according to me), which is related to Yoshua's last point: "what safeguards we need when" is part of the safeguard science that we haven't done yet. I think we're in a situation where many people say "yes, we'll need safeguards, but it'll be easy to notice when we need them and implement them when we notice" and the people trying to build those safeguards respond with "we don't think either of those things will be easy." But notice how, in the backdrop of "everyone thinks their job is hard," this statement provides very little ability to distinguish between worlds where this actually is a crisis and worlds where things will be fine!

I see this in a different light: as far as I can tell, Yann LeCun believes that the way to advance AI is to tinker around, take opportunities to make advances when it seems feasible, find ways of fixing problems that come up in an ad-hoc, atheoretic manner (see e.g. this link), and then form some theory to explain what happened; while Stuart Russell thinks that it's important to have a theory that you really believe in drive future work. As a result, I read LeCun as saying that when problems come up, we'll see them and fix them by tinkering around, while Russell thinks that it's important to have a theory in place before-hand to ensure that bad enough problems don't come up and/or ensure that we already know how to solve them when they do.

It seems like this is the sort of deep divide that is hard to cross, since I would expect people to have strong opinions based on what they've seen work elsewhere. It has an echo of the previous concern, where Russell needs to somehow point out "look, this time it actually is important to have a theory instead of doing things ad-hoc" in a way that depends on the features of this particular issue rather than the way he likes doing work.

For reference, LeCun discussed his atheoretic/experimentalist views in more depth in this FB debate with Ali Rahimi and also this lecture. But maybe we should distinguish some distinct axes of the experimentalist/theorist divide in DL:

1) Experimentalism/theorism is a more appropriate paradigm for thinking about AI safety

2) Experimentalism/theorism is a more appropriate paradigm for making progress in AI capabilities

Where the LeCun/Russell debate is about (1) and LeCun/Rahimi is about (2). And maybe this is oversimplifying things, since "theorism" may be an overly broad way of describing Russell/Rahimi's views on safety/capabilities, but I suspect LeCun is "seeing the same ghost", or in his words (to Rahimi), seeing the same:

kind of attitude that lead the ML community to abandon neural nets for over 10 years, *despite* ample empirical evidence that they worked very well in many situations.

And whether or not Rahimi should be lumped into that "kind of attitude", I think LeCun is right (from a certain perspective) to want to push back against that attitude.

I'd even go further: given that LeCun has been more successful than Rahimi/Russell in AI research this century, all else equal I would weight the former's intuitions on research progress more. (I think the best counterargument is that while experimentalism might be better in the short-term, theorism has better payoff in the long-term, but I'm not sure about this.)

In fact, one of my major fears is that LeCun is right about this, because even if he is right about (2), I don't think that's good evidence he's right about (1) since these seem pretty orthogonal. But they don't look orthogonal until you spend a lot of time reading/thinking about AI safety, which you're not inclined to do if you already know a lot about AI and assume that knowledge transfers to AI safety.

In other words, the "correct" intuitions (on experimentalism/theorism) for modern AI research might be the opposite of the "correct" intuitions for AI safety. (I would, for instance, predict that if Superintelligence were published during the era of GOFAI, all else equal it would've made a bigger splash because AI researchers then were more receptive to abstract theorizing.)

Good comment. I disagree with this bit:

I would, for instance, predict that if Superintelligence were published during the era of GOFAI, all else equal it would've made a bigger splash because AI researchers then were more receptive to abstract theorizing.

And then it would probably have been seen as outmoded and thrown away completely when AI capabilities research progressed into realms that vastly surpassed GOFAI. I don't know that there's an easy way to get capabilities researchers to think seriously about safety concerns that haven't manifested on a sufficient scale yet.

But notice how, in the backdrop of "everyone thinks their job is hard," this statement provides very little ability to distinguish between worlds where this actually is a crisis and worlds where things will be fine!

It sounds like you have a model that "person works in a job" causes "person believes job is hard" regardless of what the job is, but the causality can go the other way: if I thought AI safety were trivial, I wouldn't be working on trying to make it safe.

On this model, you don't observe this argument because everyone is biased towards thinking their job is hard: you observe it because people formed opinions some other way and then self-selected into the jobs they thought were impactful / nontrivial.

In practice, it will be a combination of both. For this discussion in particular, I'd lean more towards the selection explanation, as opposed to the bias explanation.

It looks to me like this conversation is to some extent repeating a pattern which I've seen in AI safety conversations before:

Safety advocate: AI might destroy us if it doesn't have the right safeguards.
Safety skeptic: That's stupid, because why would anyone build it without those safeguards.

It feels like people keep talking past each other, since both essentially agree about the need for safeguards. Rather the disagreement seems to be over something more like... "does the default path of AI development involve existential risks or not", where the safety advocate argues that we should be thinking about this a lot beforehand, much more than with other technologies. On the other hand, the skeptic sees AI as being much more comparable to any other technology, in that there are risks and there will probably be accidents until we figure out how to do it safely, but we will do that figuring out as a normal part of developing the technology and we can't really do much of that figuring out until we actually have the technology.

Yann's core argument for why AGI safety is easy is interesting, and actually echoes ongoing AGI safety research. I'll paraphrase his list of five reasons that things will go well if we're not "ridiculously stupid":

We'll give AGIs non-open-ended objectives like fetching coffee. These are task-limited and therefore there's no more instrumental subgoals after the task is complete.
We will put "simple terms in the objective" to prevent obvious problems, presumably things like "don't harm people", "don't violate laws", etc.
We will put in "a mechanism" to edit the objective upon observing bad behavior;
We can physically destroy a computer housing AGI;
We can build a second AGI whose sole purpose is to destroy the first AGI if the first AGI has gotten out of control, and the latter will succeed because it's more specialized.

All of these are reasonable ideas on their face, and indeed they're similar to ongoing AGI safety research programs: (1) is myopic or task-limited AGI, (2) is related to AGI limiting and norm-following, (3) is corrigibility, (4) is boxing, and (5) is in the subfield of AIs-helping-with-AGI-safety (other things in this area include IDA, adversarial testing, recursive reward modeling, etc.).

The problem, of course, is that all five of these things, when you look at them carefully, are much harder and more complicated than they appear, and/or less likely to succeed. And meanwhile he's discouraging people from doing the work to solve those problems.. :-(

I don’t know that his arguments “echo”, it’s more like “can be translated into existing discourse”. For example, the leap from his 5) to IDA is massive, and I don’t understand why he imagines tackling the “we can’t align AGIs” problem with “build another AGI to stop the bad AGI”.

I think 5 is much closer to the "look, the first goal is to build a system that prevents anyone else from building unaligned AGI" claim, and there's a separate claim 6 of the form "more generally, we can use AGI to police AGI" that is similar to debate or IDA. And I think claim 5 is basically in line with what, say, Bostrom would discuss (where stabilization is a thing to do before we attempt to build a sovereign).

And I think claim 5 is basically in line with what, say, Bostrom would discuss (where stabilization is a thing to do before we attempt to build a sovereign).

You mean in the sense of stabilizing the whole world? I'd be surprised if that's what Yann had in mind. I took him just to mean building a specialized AI to be a check on a single other AI.

That's how I interpreted:

the defensive AI systems designed to protect against rogue AI systems are not akin to the military, they are akin to the police, to law enforcement. Their "jurisdiction" would be strictly AI systems, not humans.

To be clear, I think he would mean it more in the way that there's currently an international police order that is moderately difficult to circumvent, and that the same would be true for AGI, and not necessarily the more intense variants of stabilization (which are necessarily primarily if you think offense is highly advantaged over defense, which I don't know his opinion on).

Promoted to curated: This seems like it was a real conversation, and I also think it's particularly valuable for LessWrong to engage with more outside perspectives like the ones above.

I also in general want to encourage people to curate discussion and contributions that happen all around the web, and archive them in formats like this.

I commented on the thread (after seeing this) in order to add a link to my paper that addresses Bengio's last argument;

@Yoshua Bengio I attempted to formalize this argument somewhat in a recent paper. I don't think the argument there is particularly airtight, but I think it provides a significantly stronger argument for why we should believe that interaction between optimizing systems is fundamentally hard.

https://www.mdpi.com/2504-2289/3/2/21/htm

Paper abstract: "An important challenge for safety in machine learning and artificial intelligence systems is a set of related failures involving specification gaming, reward hacking, fragility to distributional shifts, and Goodhart’s or Campbell’s law. This paper presents additional failure modes for interactions within multi-agent systems that are closely related. These multi-agent failure modes are more complex, more problematic, and less well understood than the single-agent case, and are also already occurring, largely unnoticed. After motivating the discussion with examples from poker-playing artificial intelligence (AI), the paper explains why these failure modes are in some senses unavoidable. Following this, the paper categorizes failure modes, provides definitions, and cites examples for each of the modes: accidental steering, coordination failures, adversarial misalignment, input spoofing and filtering, and goal co-option or direct hacking. The paper then discusses how extant literature on multi-agent AI fails to address these failure modes, and identifies work which may be useful for the mitigation of these failure modes."

I think part of what may be going on here is that the approach to AI that Yann advocates happens to be one that is unusually amenable to alignment. Some discussion here:

https://www.lesswrong.com/posts/EMZeJ7vpfeF4GrWwm/self-supervised-learning-and-agi-safety

Note 1: This review is also a top-level post.

Note 2: I think that 'robust instrumentality' is a more apt name for 'instrumental convergence.' That said, for backwards compatibility, this comment often uses the latter.

In the summer of 2019, I was building up a corpus of basic reinforcement learning theory. I wandered through a sun-dappled Berkeley, my head in the clouds, my mind bent on a single ambition: proving the existence of instrumental convergence.

Somehow.

I needed to find the right definitions first, and I couldn't even imagine what the final theorems would say. The fall crept up on me... and found my work incomplete.

Let me tell you: if there's ever been a time when I wished I'd been months ahead on my research agenda, it was September 26, 2019: the day when world-famous AI experts debated whether instrumental convergence was a thing, and whether we should worry about it.

The debate unfolded below the link-preview: an imposing robot staring the reader down, a title containing 'Terminator', a byline dismissive of AI risk:

**Scientific American**
**Don’t Fear the Terminator**
"Artificial intelligence never needed to evolve, so it didn’t develop the survival instinct that leads to the impulse to dominate others."

The byline seemingly affirms the consequent: "evolution survival instinct" does not imply "no evolution $⟹$ no survival instinct." That said, the article raises at least one good point: we choose the AI's objective, and so why must that objective incentivize power-seeking?

I wanted to reach out, to say, "hey, here's a paper formalizing the question you're all confused by!" But it was too early.

Now, at least, I can say what I wanted to say back then:

This debate about instrumental convergence is really, really confused. I heavily annotated the play-by-play of the debate in a Google doc, mostly checking local validity of claims. (Most of this review's object-level content is in that document, by the way. Feel free to add comments of your own.)

This debate took place in the pre-theoretic era of instrumental convergence. Over the last year and a half, I've become a lot less confused about instrumental convergence. I think my formalisms provide great abstractions for understanding "instrumental convergence" and "power-seeking." I think that this debate suffers for lack of formal grounding, and I wouldn't dream of introducing someone to these concepts via this debate.

While the debate is clearly historically important, I don't think it belongs in the LessWrong review. I don't think people significantly changed their minds, I don't think that the debate was particularly illuminating, and I don't think it contains the philosophical insight I would expect from a LessWrong review-level essay.

Rob Bensinger's nomination reads:

May be useful to include in the review with some of the comments, or with a postmortem and analysis by Ben (or someone).
I don't think the discussion stands great on its own, but it may be helpful for:
people familiar with AI alignment who want to better understand some human factors behind 'the field isn't coordinating or converging on safety'.
people new to AI alignment who want to use the views of leaders in the field to help them orient.

I certainly agree with Rob's first bullet point. The debate did show us what certain famous AI researchers thought about instrumental convergence, circa 2019.

However, I disagree with the second bullet point: reading this debate may disorient a newcomer! While I often found myself agreeing with Russell and Bengio, while LeCun and Zador sometimes made good points, confusion hangs thick in the air: no one realizes that, with respect to a fixed task environment (representing the real world) and their beliefs about what kind of objective function the agent may have, they should be debating the probability that seeking power is optimal (or that power-seeking behavior is learned, depending on your threat model).

Absent such an understanding, the debate is needlessly ungrounded and informal. Absent such an understanding, we see reasoning like this:

Yann LeCun: ... instrumental subgoals are much weaker drives of behavior than hardwired objectives. Else, how could one explain the lack of domination behavior in non-social animals, such as orangutans.

I'm glad that this debate happened, but I think it monkeys around too much to be included in the LessWrong 2019 review.

Yann LeCun: ... instrumental subgoals are much weaker drives of behavior than hardwired objectives. Else, how could one explain the lack of domination behavior in non-social animals, such as orangutans.

What's your specific critique of this? I think it's an interesting and insightful point.

LeCun claims too much. It's true that the case of animals like orangutans points to a class of cognitive architectures which seemingly don't prioritize power-seeking. It's true that this is some evidence against power-seeking behavior being common amongst relevant cognitive architectures. However, it doesn't show that instrumental subgoals are much weaker drives of behavior than hardwired objectives.

One reading of this "drives of behavior" claim is that it has to be tautological; by definition, instrumental subgoals are always in service of the (hardwired) objective. I assume that LeCun is instead discussing "all else equal, will statistical instrumental tendencies ('instrumental convergence') be more predictive of AI behavior than its specific objective function?".

But "instrumental subgoals are much weaker drives of behavior than hardwired objectives" is not the only possible explanation of "the lack of domination behavior in non-social animals"! Maybe the orangutans aren't robust to scale. Maybe orangutans do implement non power-seeking cognition, but maybe their cognitive architecture will be hard or unlikely for us to reproduce in a machine - maybe the distribution of TAI cognitive architectures we should expect, is far different from what orangutans are like.

I do agree that there's a very good point in the neighborhood of the quoted argument. My steelman of this would be:

Some animals, like humans, seem to have power-seeking drives. Other animals, like orangutans, do not. Therefore, it's possible to design agents of some intelligence which do not seek power. Obviously, we will be trying to design agents which do not seek power. Why, then, should we expect such agents to be more like humans than like orangutans?

(This is loose for a different reason, in that it presupooses a single relevant axis of variation between humans and orangutans. Is a personal computer more like a human, or more like an orangutan? But set that aside for the moment.)

I think he's overselling the evidence. However, on reflection, I wouldn't pick out the point for such strong ridicule.

May be useful to include in the review with some of the comments, or with a postmortem and analysis by Ben (or someone).

I don't think the discussion stands great on its own, but it may be helpful for:

people familiar with AI alignment who want to better understand some human factors behind 'the field isn't coordinating or converging on safety'.
people new to AI alignment who want to use the views of leaders in the field to help them orient.

1) Experimentalism/theorism is a more appropriate paradigm for thinking about AI safety

2) Experimentalism/theorism is a more appropriate paradigm for making progress in AI capabilities

kind of attitude that lead the ML community to abandon neural nets for over 10 years, *despite* ample empirical evidence that they worked very well in many situations.

And whether or not Rahimi should be lumped into that "kind of attitude", I think LeCun is right (from a certain perspective) to want to push back against that attitude.

Good comment. I disagree with this bit:

I would, for instance, predict that if Superintelligence were published during the era of GOFAI, all else equal it would've made a bigger splash because AI researchers then were more receptive to abstract theorizing.

But notice how, in the backdrop of "everyone thinks their job is hard," this statement provides very little ability to distinguish between worlds where this actually is a crisis and worlds where things will be fine!

In practice, it will be a combination of both. For this discussion in particular, I'd lean more towards the selection explanation, as opposed to the bias explanation.

It looks to me like this conversation is to some extent repeating a pattern which I've seen in AI safety conversations before:

Safety advocate: AI might destroy us if it doesn't have the right safeguards.
Safety skeptic: That's stupid, because why would anyone build it without those safeguards.

We'll give AGIs non-open-ended objectives like fetching coffee. These are task-limited and therefore there's no more instrumental subgoals after the task is complete.
We will put "simple terms in the objective" to prevent obvious problems, presumably things like "don't harm people", "don't violate laws", etc.
We will put in "a mechanism" to edit the objective upon observing bad behavior;
We can physically destroy a computer housing AGI;
We can build a second AGI whose sole purpose is to destroy the first AGI if the first AGI has gotten out of control, and the latter will succeed because it's more specialized.

And I think claim 5 is basically in line with what, say, Bostrom would discuss (where stabilization is a thing to do before we attempt to build a sovereign).

You mean in the sense of stabilizing the whole world? I'd be surprised if that's what Yann had in mind. I took him just to mean building a specialized AI to be a check on a single other AI.

That's how I interpreted:

the defensive AI systems designed to protect against rogue AI systems are not akin to the military, they are akin to the police, to law enforcement. Their "jurisdiction" would be strictly AI systems, not humans.

Promoted to curated: This seems like it was a real conversation, and I also think it's particularly valuable for LessWrong to engage with more outside perspectives like the ones above.

I also in general want to encourage people to curate discussion and contributions that happen all around the web, and archive them in formats like this.

I commented on the thread (after seeing this) in order to add a link to my paper that addresses Bengio's last argument;

@Yoshua Bengio I attempted to formalize this argument somewhat in a recent paper. I don't think the argument there is particularly airtight, but I think it provides a significantly stronger argument for why we should believe that interaction between optimizing systems is fundamentally hard.

https://www.mdpi.com/2504-2289/3/2/21/htm

I think part of what may be going on here is that the approach to AI that Yann advocates happens to be one that is unusually amenable to alignment. Some discussion here:

https://www.lesswrong.com/posts/EMZeJ7vpfeF4GrWwm/self-supervised-learning-and-agi-safety

Note 1: This review is also a top-level post.

Note 2: I think that 'robust instrumentality' is a more apt name for 'instrumental convergence.' That said, for backwards compatibility, this comment often uses the latter.

Somehow.

I needed to find the right definitions first, and I couldn't even imagine what the final theorems would say. The fall crept up on me... and found my work incomplete.

The debate unfolded below the link-preview: an imposing robot staring the reader down, a title containing 'Terminator', a byline dismissive of AI risk:

I wanted to reach out, to say, "hey, here's a paper formalizing the question you're all confused by!" But it was too early.

Now, at least, I can say what I wanted to say back then:

Rob Bensinger's nomination reads:

May be useful to include in the review with some of the comments, or with a postmortem and analysis by Ben (or someone).
I don't think the discussion stands great on its own, but it may be helpful for:
people familiar with AI alignment who want to better understand some human factors behind 'the field isn't coordinating or converging on safety'.
people new to AI alignment who want to use the views of leaders in the field to help them orient.

I certainly agree with Rob's first bullet point. The debate did show us what certain famous AI researchers thought about instrumental convergence, circa 2019.

Absent such an understanding, the debate is needlessly ungrounded and informal. Absent such an understanding, we see reasoning like this:

Yann LeCun: ... instrumental subgoals are much weaker drives of behavior than hardwired objectives. Else, how could one explain the lack of domination behavior in non-social animals, such as orangutans.

I'm glad that this debate happened, but I think it monkeys around too much to be included in the LessWrong 2019 review.

Yann LeCun: ... instrumental subgoals are much weaker drives of behavior than hardwired objectives. Else, how could one explain the lack of domination behavior in non-social animals, such as orangutans.

What's your specific critique of this? I think it's an interesting and insightful point.

I do agree that there's a very good point in the neighborhood of the quoted argument. My steelman of this would be:

Some animals, like humans, seem to have power-seeking drives. Other animals, like orangutans, do not. Therefore, it's possible to design agents of some intelligence which do not seek power. Obviously, we will be trying to design agents which do not seek power. Why, then, should we expect such agents to be more like humans than like orangutans?

I think he's overselling the evidence. However, on reflection, I wouldn't pick out the point for such strong ridicule.

May be useful to include in the review with some of the comments, or with a postmortem and analysis by Ben (or someone).

I don't think the discussion stands great on its own, but it may be helpful for:

people familiar with AI alignment who want to better understand some human factors behind 'the field isn't coordinating or converging on safety'.
people new to AI alignment who want to use the views of leaders in the field to help them orient.