Here are some concerns which have been raised about the development of advanced AI:
- Power might become concentrated with agentic AGIs which are highly misaligned with humanity as a whole (the second species argument).
- AI might allow power to become concentrated to an unprecedented extent with elites who are misaligned with humanity as a whole.
- Competitive pressures to use narrow, non-agentic AIs trained on easily-measurable metrics might become harmful enough to cause a “slow-rolling catastrophe”. [Edit: it seems like this is not the intended interpretation of Paul's argument in What Failure Looks Like; see discussion in the comments section. So I no longer fully endorse this section, but I've left it up for reference purposes.]
- AI might make catastrophic conflicts easier or more likely; in other words, the world might become more vulnerable with respect to available technology.
- AIs might be morally relevant, but be treated badly.
I’ve already done a deep dive on the second species argument, so in this post I’m going to focus on the others - the risks which don’t depend on thinking of AIs as autonomous agents with general capabilities. Warning: this is all very speculative; I’m mainly just trying to get a feeling for the intellectual terrain, since I haven’t seen many explorations of these concerns so far.
Inequality and totalitarianism
One key longtermist concern about inequality is that certain groups might get (semi)permanently disenfranchised; in other words, suboptimal values might be locked in. Yet this does not seem to have happened in the past: moral progress has improved the treatment of slaves, women, non-Europeans, and animals over the last few centuries, despite those groups starting off with little power. It seems to me that most of these changes were driven by the moral concerns of existing elites, backed by public sentiment in wealthy countries, rather than improvements in the bargaining position of the oppressed groups which made it costlier to treat them badly (although see here for an opposing perspective). For example, ending the slave trade was very expensive for Britain; the Civil War was very expensive for the US; and so on. Perhaps the key exception is the example of anti-colonialist movements - but even then, public moral pressure (e.g. opposition to harming non-violent protesters) was a key factor.
What would reduce the efficacy of public moral pressure? One possibility is dramatic increases in economic inequality. Currently, one limiting factor on inequality is the fact that most people have a significant amount of human capital, which they can convert to income. However, AI automation will make most forms of human capital much less valuable, and therefore sharply increase inequality. This didn’t happen to humans after the industrial revolution, because human intellectual skills ended up being more valuable in absolute terms after a lot of physical labour was automated. But it did happen to horses, who lost basically all their equine capital.
Will any human skills remain valuable after AGI, or will we end up in a similar position to horses? I expect that human social skills will become more valuable even if they can be replicated by AIs, because people care about human interaction for its own sake. And even if inequality increases dramatically, we should expect the world to also become much richer, making almost everyone wealthier in absolute terms in the medium term. In particular, as long as the poor have comparable levels of political power as they do today, they can use that to push the rich to redistribute wealth. This will be easiest on a domestic level, but it also seems that citizens of wealthy countries are currently sufficiently altruistic to advocate for transfers of wealth to poorer countries, and will do so even more if international inequality grows.
So to a first approximation, we can probably think about concerns about inequality as a subset of concerns about preventing totalitarianism: mere economic inequality within a (somewhat democratic) rule of law seems insufficient to prevent the sort of progress that is historically standard, even if inequality between countries dramatically increases for a time. By contrast, given access to AI technology which is sufficiently advanced to confer a decisive strategic advantage, a small group of elites might be able to maintain power indefinitely. The more of the work of maintaining control is outsourced to AI, the smaller that group can be; the most extreme case would be permanent global totalitarianism under a single immortal dictator. Worryingly, if there’s no realistic chance of them being overthrown, they could get away with much worse behaviour than most dictators - North Korea is a salient example. Such scenarios seem more likely in a world where progress in AI is rapid, and leads to severe inequality. In particular, economic inequality makes subversion of our political systems easier; and inequality between countries marks it more likely for an authoritarian regime to gain control of the world.
In terms of direct approaches to preventing totalitarianism, I expect it will be most effective to apply existing approaches (e.g. laws against mass surveillance) to new applications powered by AI; but it’s likely that there will also be novel and valuable approaches. Note, finally, that these arguments assume a level of change comparable to the industrial revolution; however, eventually we’ll get far beyond that (e.g. by becoming posthuman). I discuss some of these long-term considerations later on.
A slow-rolling catastrophe
I’ll start this section by introducing the key idea of Paul Christiano’s “slow-rolling catastrophe” (even though it’s not framed directly in terms of competitive pressures; I’ll get to those later): “Right now humans thinking and talking about the future they want to create are a powerful force that is able to steer our trajectory. But over time human reasoning will become weaker and weaker compared to new forms of reasoning honed by trial-and-error. Eventually our society’s trajectory will be determined by powerful optimization with easily-measurable goals rather than by human intentions about the future.”
In what different domains might we see this sort of powerful optimisation? Paul identifies a few:
- Corporations will focus on “manipulating consumers, capturing regulators, extortion and theft”.
- “Instead of actually having an impact, [investors] will be surrounded by advisors who manipulate them into thinking they’ve had an impact.
- Law enforcement will be driven by “creating a false sense of security, hiding information about law enforcement failures, suppressing complaints, and coercing and manipulating citizens.”
- Legislation may be optimized for “undermining our ability to actually perceive problems and constructing increasingly convincing narratives about where the world is going and what’s important.”
I’m skeptical of this argument for a few reasons. The main one is that these are highly exceptional claims, which demand concomitantly exceptional evidence. So far we have half a blog post by Paul, plus this further investigation by Sam Clarke. Anecdotally it seems that a number of people find this line of argument compelling, but unless they have seen significant non-public evidence or arguments, I cannot see how this is justifiable. (And even then, the lack of public scrutiny should significantly reduce our confidence in these arguments.)
In particular, it seems that such arguments need to overcome two important “default” expectations about the deployment of new technology. The first is that such technologies are often very useful, with their benefits significantly outweighing their harms. So if narrow AI becomes very powerful, we should expect it to improve humanity’s ability to steer our trajectory in many ways. As one example, even though Google search is primarily optimised for easily-measurable metrics, it still gives us much better access to information than we had before it. Some more general ways in which I expect narrow AI to be useful:
- Search engines and knowledge aggregators will continue to improve our ability to access, filter and organise information.
- AI-powered collaboration tools such as automated translation and better virtual realities will allow us to coordinate and communicate more effectively.
- Narrow AI will help us carry out scientific research, by finding patterns in data, automating tedious experimental steps, and solving some computationally-intensive problems (like protein folding).
- AI tools will help us to analyse our habits and behaviour and make interventions to improve them; they’ll also aggregate this information to allow better high-level decision-making (e.g. by economic policymakers).
- AI will help scale up the best educational curricula and make them more personalised.
- Narrow AI will generally make humanity more prosperous, allowing us to dedicate more time and effort to steering society’s trajectory in beneficial directions.
The second default expectation about technology is that, if using it in certain ways is bad for humanity, we will stop people from doing so. This is a less reliable extrapolation - there are plenty of seemingly-harmful applications of technology which are still occurring. But note that we’re talking about a slow-rolling catastrophe - that is, a situation which is unprecedentedly harmful. And so we should expect an unprecedented level of support for preventing whatever is causing it, all else equal.
So what grounds do we have to expect that, despite whatever benefits these systems have, their net effect on our ability to control the future will be catastrophic, in a way that we’re unable to prevent? I’ll break down Paul’s argument into four parts, and challenge each in turn:
- The types of problems discussed in the examples above are plausible.
- Those problems will “combine with and amplify existing institutional and social dynamics that already favor easily-measured goals”.
- “Some states may really put on the brakes, but they will rapidly fall behind economically and militarily.”
- “As the system becomes more complex, [recognising and overcoming individual problems] becomes too challenging for human reasoning to solve directly and requires its own trial and error, and at the meta-level the process continues to pursue some easily measured objective (potentially over longer timescales). Eventually large-scale attempts to fix the problem are themselves opposed by the collective optimization of millions of optimizers pursuing simple goals.”
To be clear, I don’t dispute that there are some harmful uses of AI. Development of more destructive weapons is the most obvious; I also think that emotional manipulation is a serious concern, and discuss it in the next section. But many of the things Paul mentions go well beyond basic manipulation - instead they seem to require quite general reasoning and planning capabilities applied over long time periods. This is well beyond what I expect narrow systems trained on easily-measurable metrics in specific domains will be capable of achieving. How do you train on a feedback signal about whether a certain type of extortion will get you arrested a few months later? How do law enforcement AIs prevent surveys and studies from revealing the true extent of law enforcement failures? If it’s via a deliberate plan to suppress them while also overcoming human objections to doing so, then that seems less like a narrow system “optimising for an easily measured objective” and more like an agentic and misaligned AGI. In other words, the real world is so complex and interlinked that in order to optimise for an easily measured objective to an extreme extent, in the face of creative human opposition, AIs will need to take into account a broad range of factors that aren’t easily measured, which we should expect to require reasoning roughly as powerful and general as humans’. (There are a few cases in which this might not apply - I’ll discuss them in the next section.)
Now, we should expect that AIs will eventually be capable of human-level reasoning. But these don’t seem like the types of AIs which Paul is talking about here: he only raises the topic of “systems that have a detailed understanding of the world, which are able to adapt their behavior in order to achieve specific goals” after he’s finished discussing slow-rolling catastrophes. Additionally, the more general our AIs are, the weaker the argument that they will pursue easily-measurable targets, because it’ll be easier to train them to pursue vague targets. For example, human intelligence is general enough that we require very few examples to do well on novel tasks; and we can follow even vague linguistic instructions. So Paul needs to explain why AIs that are capable of optimisation which is flexible enough to be worth worrying about won’t be able to acquire behaviour which is nuanced enough to be beneficial. If his answer is that they’ll learn the wrong goals at first and then deceive us or resist our attempts to correct them, then it seems like we’re back to the standard concerns about misaligned agentic AGIs, rather than a separate concern about optimisation for easily-measurable metrics.
Secondly, let’s talk about existing pressures towards easily-measured goals. I read this as primarily referring to competitive political and economic activity - because competition is a key force pushing people towards tradeoffs which are undesirable in the long term. Perhaps the simplest example is the common critique of capitalism: money is the central example of an easily-measured objective, and under capitalism other values lose ground to the pursuit of it. If corporations don’t make use of AIs which optimise very well for easily-measurable metrics like short-term profits, then they might be driven bankrupt by competitors, or else face censure from their shareholders (who tend to feel less moral responsibility for the behaviour of the corporations they own than other stakeholders do). And so we might think that, even if corporations or institutions have some suspicions about specific AI tools, they’ll just have to use those ones rather than alternatives which optimise less hard for those metrics but are more beneficial for society overall.
What I’m skeptical about, though, is: how strong will these pressures be? Realistically speaking, what will make law enforcement departments hand over enough power to AIs that they can implement behaviour that isn’t desired by those law enforcement departments, such as widespread deception and coercion? (Presumably not standard types of crime, in an era where it’ll be so easy to track and measure behaviour.) Will it be legislative pressure? Will voters really be looking so closely at political activity that politicians have no choice but to pass the laws which are ruthlessly optimised for getting reelected? Right now I’d be very happy if the electorate started scrutinising laws at all! Similarly, I’d be happy if public opinion had any reasonable ability to change the police, or the military, or government bureaucracies. Far from being threatened by over-optimisation, it seems to me that most of our key institutions are near-invulnerable to it. And consider that at this point, we’ll also be seeing great prosperity from the productive uses of AI - what Drexler calls Paretotopia. I expect sufficient abundance that it’ll be easy to overlook massive inefficiencies in all sorts of places, and that the pressure to hand over power to blindly-optimising algorithms will be very low.
Even if there’s great abundance in general, though, and little pressure on most societal institutions, we might worry that individual corporations will be forced to make such tradeoffs in order to survive (or just to make more profit). This seems a little at odds with our current situation, though - the biggest tech companies have a lot of slack (often in the form of several hundreds of billions of dollars of cash reserves). Additionally, it seems fairly easy for governments to punish companies for causing serious harm; even the biggest companies have very little ability to resist punitive legislative measures if politicians and the public swing against them (especially since anti-corporate sentiment is already widespread). The largest sectors in modern economies - including financial systems, healthcare systems, educational systems, and so on - are very heavily regulated to prevent risks much smaller than the ones Paul predicts here.
What might make the economy so much more competitive, that such companies are pushed to take actions which are potentially catastrophic, and risk severe punishment? I think there are plausibly some important domains on which we’ll see races to the bottom (e.g. newsfeed addictiveness), but I don’t think most companies will rely heavily on these domains. Rather, I’d characterise the economy as primarily driven by innovation and exploration rather than optimisation of measurable metrics - and innovation is exactly the type of thing which can’t be measured easily. Perhaps if we ended up in a general technological stagnation, companies would have to squeeze every last drop of efficiency out of their operations, and burn a lot of value in the process. And perhaps if the world stagnates in this way, we’ll have time to collect enough training data that AI-powered “trial and error” will become very effective. But this scenario is very much at odds with the prospect of dramatic innovation in AI, which instead will open up new vistas of growth, and lead to far-reaching changes which render much previous data obsolete. So internal pressure on our societies doesn’t seem likely to cause this slow-rolling catastrophe, absent specific manipulation of certain pressure points (as I’ll discuss in the next section).
Thirdly: what about external pressure, from the desire not to fall behind economically or militarily? I’m pretty uncertain about how strong an incentive this provides - the US, for example, doesn’t seem very concerned right now about falling behind China. And in a world where change is occurring at a rapid and bewildering pace, I’d expect that many will actually want to slow that change down. But to be charitable, let’s consider a scenario of extreme competition, e.g. a Cold War between the US and China. In that case, we might see each side willing to throw away some of their values for a temporary advantage. Yet military advantages will likely rely on the development of a few pivotal technologies, which wouldn’t require large-scale modifications to society overall (although I’ll discuss dangers from arms races in the next section). Or perhaps the combatants will focus on boosting overall economic growth. Yet this would also incentivise increased high-level oversight and intervention into capitalism, which actually helps prevent a lot of the harmful competitive dynamics I discussed previously (e.g. increasingly addictive newsfeeds).
Fourthly: Paul might respond by saying that the economy will by then have become some “incomprehensible system” which we can’t steer, and which “pursues some easily measured objective” at the meta level. But there are also a lot of comprehensibility benefits to a highly automated economy compared with our current economy, in which a lot of the key variables are totally opaque to central planners. Being able to track many more metrics, and analyse them with sophisticated machine learning tools, seems like a big step up. And remember that, despite this, Paul is predicting catastrophic outcomes from a lack of measurability! So the bar for incomprehensibility is very high - not just that we’ll be confused by some parts of the automated economy, but that we won’t have enough insight into it to notice a disaster occurring. This may be possible if there are whole chunks of it which involve humans minimally or not at all. But as I’ve argued already, if AIs are capable and general enough to manage whole chunks of the economy in this way, they’re unlikely to be restricted to pursuing short-term metrics.
Further, I don’t see where the meta-level optimisation for easily measured objectives comes from. Is it based on choices made by political leaders? If only they optimised harder on easily-measured objectives like GDP or atmospheric carbon concentrations! Instead it seems that politicians are very slow to seize opportunities to move towards extreme outcomes. Additionally, we have no good reason to think that “the collective optimization of millions of optimizers pursuing simple goals” will be pushing in a unified direction, rather than mostly cancelling each other out.
Overall I am highly skeptical about the “slow-rolling catastrophe” in the abstract. I think that in general, high-level general-purpose reasoning skills (whether in humans or AIs) will remain much more capable of producing complex results than AI trial and error, even if people don’t try very hard to address problems arising from the latter (we might call this position “optimism of the intellect, pessimism of the will”). I’m more sympathetic to arguments identifying specific aspects of the world which may become increasingly important and increasingly vulnerable; let’s dig into these now.
A vulnerable world
This section is roughly in line with Bostrom’s discussion of the vulnerable world hypothesis, although at the end I also talk about some ways in which new technologies might lead to problematic structural shifts rather than direct vulnerabilities. Note that I discuss some of these only briefly; I’d encourage others to investigate them in greater detail.
It may be the case that human psychology is very vulnerable to manipulation by AIs. This is the type of task on which a lot of data can be captured (because there are many humans who can give detailed feedback); the task is fairly isolated (manipulating one human doesn’t depend much on the rest of the world); and the data doesn’t become obsolete as the world changes (because human psychology is fairly stable). Even assuming that narrow AIs aren’t able to out-argue humans in general, they may nevertheless be very good at emotional manipulation and subtle persuasion, especially against humans who aren’t on their guard. So we might be concerned that some people will train narrow AIs which can be used to manipulate people’s beliefs or attitudes. We can also expect that there will be a spectrum of such technologies: perhaps the most effective will be direct interaction with an AI able to choose an avatar and voice for itself. AIs might also be able to make particularly persuasive films, or ad campaigns. One approach I expect to be less powerful, but perhaps relevant early on, is an AI capable of instructing a human on how to be persuasive to another human.
How might this be harmful to the long-term human trajectory? I see two broad possibilities. The first is large-scale rollouts of weaker versions of these technologies, for example by political campaigns in order to persuade voters, which harms our ability to make good collective decisions; I’ll call this the AI propaganda problem. (This might also be used by corporations to defend themselves from the types of punishments I discussed in the previous section). The second is targeted rollouts of more powerful versions of this technology, for example aimed at specific politicians by special interest groups, which will allow the attackers to persuade or coerce the targets into taking certain actions; I’ll call this the AI mind-hacking problem. I expect that, if mind-hacking is a real problem we will face, then the most direct forms of it will quickly become illegal. But in order to enforce that, detection of it will be necessary. So tools which can distinguish an AI-generated avatar from a video stream of a real human would be useful; but I expect that they will tend to be one step behind the most sophisticated generative tools (as is currently the case for adversarial examples, and cybersecurity). Meanwhile it seems difficult to prevent AIs being trained to manipulate humans by making persuasive videos, because by then I expect AIs to be crucial in almost every step of video production.
However, this doesn’t mean that detection will be impossible. Even if there’s no way to differentiate between a video stream of a real human versus an AI avatar, in order to carry out mind-hacking the AI will need to display some kind of unusual behaviour; at that point it can be flagged and shut down. Such detection tools might also monitor the mental states of potential victims. I expect that there would also be widespread skepticism about mind-hacking at first, until convincing demonstrations help muster the will to defend against them. Eventually, if humans are really vulnerable in this way, I expect protective tools to be as ubiquitous as spam filters - although it’s not clear whether the offense-defense balance will be as favourable to defense as it is in the case of spam. Yet because elites will be the most valuable targets for the most extreme forms of mind-hacking, I expect prompt action against it.
AI propaganda, by contrast, will be less targeted and therefore likely have weaker effects on average than mind-hacking (although if it’s deployed more widely, it may be more impactful overall). I think the main effect here would be to make totalitarian takeovers more likely, because propaganda could provoke strong emotional reactions and political polarisation, and use them to justify extreme actions. It would also be much more difficult to clamp down on than direct mind-hacking; and it’d target an audience which is less informed and less likely to take protective measures than elites.
One closely-related possibility is that of AI-induced addiction. We’re already seeing narrow AI used to make various social media more addictive. However, even if it’s as addictive as heroin, plenty of people manage to avoid using that, because of the widespread knowledge of its addictiveness. Even though certain AI applications are much easier to start using than heroin, I expect similar widespread knowledge to arise, and tools (such as website blockers) to help people avoid addiction. So it seems plausible that AI-driven addiction will be a large public health problem, but not a catastrophic threat.
The last possibility along these lines I’ll discuss is AI-human interactions replacing human-human interactions - for example, if AI friends and partners become more satisfying than human friends and partners. Whether this would actually be a bad outcome is a tricky moral question; but either way, it definitely opens up more powerful attack vectors for other forms of harmful manipulation, such as the ones previously discussed.
Centralised control of important services
It may be the case that our reliance on certain services - e.g. the Internet, the electrical grid, and so on - becomes so great that their failure would cause a global catastrophe. If these services become more centralised - e.g. because it’s efficient to have a single AI system which manages them - then we might worry that a single bug or virus could wreak havoc.
I think this is a fairly predictable problem that normal mechanisms will handle, though, especially given widespread mistrust of AI, and skepticism about its robustness.
Structural risks and destructive capabilities
Zwetsloot and Dafoe have argued that AI may exacerbate (or be exacerbated by) structural problems. The possibility which seems most pressing is AI increasing the likelihood of great power conflict. As they identify, the cybersecurity dilemma is a relevant consideration; and so is the potential insecurity of second-strike capabilities. Novel weapons may also have very different offense-defense balances, or costs of construction; we currently walk a fine line between nuclear weapons being sufficiently easy to build to allow Mutually Assured Destruction, and being sufficiently hard to build to prevent further proliferation. If those weapons are many times more powerful than nuclear weapons, then preventing proliferation becomes correspondingly more important. However, I don’t have much to say right now on this topic, beyond what has already been said.
A digital world
We should expect that we will eventually build AIs which are moral patients, and which are capable of suffering. If these AIs are more economically useful than other AIs, we may end up exploiting them at industrial scales, in a way analogous to factory farming today.
This possibility relies on several confusing premises. First is the question of moral patienthood. It seems intuitive to give moral weight to any AIs that are conscious, but if anything this makes the problem thornier. How can we determine which AIs are conscious? And what does it even mean, in general, for AIs very different from current sentient organisms to experience positive or negative hedonic states? Shulman and Bostrom discuss some general issues in the ethics of digital minds, but largely skim over these most difficult questions.
It’s easier to talk about digital minds which are very similar to human minds - in particular, digital emulations of humans (aka ems). We should expect that ems differ from humans mainly in small ways at first - for example, they will likely feel more happiness and less pain - and then diverge much more later on. Hanson outlines a scenario where ems, for purposes of economic efficiency, are gradually engineered to lack many traits we consider morally valuable in our successors, and then end up dominating the world. Although I’m skeptical about the details of his scenario, it does raise the crucial point that the editability and copyability of ems undermine many of the safeguards which prevent dramatic value drift in our current civilisation.
Even aside from resource constraints, though, other concerns arise in a world containing millions or billions of ems. Because it’s easy to create and delete ems, it will be difficult to enforce human-like legal rights for them, unless the sort of hardware they can run on is closely monitored. But centralised control over hardware comes with other problems - in particular, physical control over hardware allows control over all the ems running on it. And although naturally more robust than biological humans in many ways, ems face other vulnerabilities. For example, once most humans are digital ems, computer viruses will be a much larger (and potentially existential) threat.
Based on this preliminary exploration, I’m leaning towards thinking about risks which might arise from the development of advanced narrow, non-agentic AI primarily in terms of the following four questions:
- What makes global totalitarianism more likely?
- What makes great power conflict more likely?
- What makes misuse of AIs more likely or more harmful?
- What vulnerabilities may arise for morally relevant AIs or digital emulations?
I didn't mean to make any distinction of this kind. I don't think I said anything about narrowness or agency. The systems I describe do seem to be optimizing for easily measurable objectives, but that seems mostly orthogonal to these other axes.
I'm pretty agnostic on whether AI will in fact be optimizing for the easily measured objectives used in training or for unrelated values that arise naturally in the learning process (or more likely some complicated mix), and part of my point is that it doesn't seem to much mater.
I'm saying: it's easier to pursue easily-measured goals, and so successful organizations and individuals tend to do that and to outcompete those whose goals are harder to measure (and to get better at / focus on the parts of their goals that are easy to measure, etc.). I'm not positing any change in the strength of competition, I'm positing a change in the extent to which goals that are easier to measure are in fact easier to pursue.
Regarding the extent and nature of competition I do think I disagree with you fairly strongly but it doesn't seem like a central point.
I think this is in fact quite high on the list of concerns for US policy-makers and especially the US defense establishment.
Firms and governments and people pursue a whole mix of objectives, some of which are easily measured. The ones pursuing easily-measured objectives are more successful, and so control an increasing fraction of resources.
I don't disagree with this at all. The point is that right now human future-steering is basically the only game in town. We are going to introduce inhuman reasoning that can also steer the future, and over time human reasoning will lose out in relative terms. (If you classify all future-steering machinery as "agentic" then evidently I'm talking about agents and I agree with the informal claim that "non-agentic" reasoning isn't concerning.) That's compatible with us benefiting enormously, if all of those benefits also accrue to automated reasoners---as your examples seem to. We will try to ensure that all this new reasoning will benefit humanity, but I describe two reasons that might be difficult and say a little bit about how that difficulty might materialize.
I don't really know if or how this is distinct from what you call the second species argument. It feels like you are objecting to a distinction I'm not intending to make.
In the second half of WFLL, you talk about "systems that have a detailed understanding of the world, which are able to adapt their behavior in order to achieve specific goals". Does the first half of WFLL also primarily refer to systems with these properties? And if so, does "reasoning honed by trial-and-error" refer to the reasoning that those systems do?
If yes, then this undermines your core argument that "[some things] can’t be done by trial and error. To solve such tasks we need to understand what we are doing and why it will yield good outcomes", because "systems that have a detailed understanding of the world" don't need to operate by trial and error; they understand what they're doing.
We do need to train them by trial and error, but it's very difficult to do so on real-world tasks which have long feedback loops, like most of the ones you discuss. Instead, we'll likely train them to have good reasoning skills on tasks which have short feedback loops, and then transfer them to real-world with long feedback loops. But in that case, I don't see much reason why systems that have a detailed understanding of the world will have a strong bias towards easily-measurable goals on real-world tasks with long feedback loops. (Analogously: when you put humans in a new domain, and give them tasks and feedback via verbal instructions, then we can quickly learn sophisticated concepts in that new domain, and optimise for those, not just the easily-measured concepts in that new domain.)
Why is your scenario called "You get what you measure" if you're agnostic about whether we actually get what we measure, even on the level of individual AIs?
Or do you mean part 1 to be the case where we do get what we measure, and part 2 to be the case where we don't?
Firstly, I think this is only true for organisations whose success is determined by people paying attention to easily-measured metrics, and not by reality. For example, an organisation which optimises for its employees having beliefs which are correct in easily-measured ways will lose out to organisations where employees think in useful ways. An organisation which optimises for revenue growth is more likely to go bankrupt than an organisation which optimises for sustainable revenue growth. An organisation which optimises for short-term customer retention loses long-term customer retention. Etc.
The case in which this is more worrying is when an organisation's success is determined by (for example) whether politicians like it, and politicians only pay attention to easily-measurable metrics. In this case, organisations which pursue easily-measured goals will be more successful than ones which pursue the goals the politicians actually want to achieve. This is why I make the argument that actually the pressure on politicians to pursue easily-measurable metrics is pretty weak (hence why they're ignoring most economists' recommendations on how to increase GDP).
I agree that you've described some potential harms; but in order to make this a plausible long-term concern, you need to give reasons to think that the harms outweigh the benefits of AI enhancing (the effective capabilities of) human reasoning. If you'd written a comparable post a few centuries ago talking about how human physical power will lose out to inhuman physical power, I would have had the same complaint.
I classify Facebook's newsfeed as future-steering in a weak sense (it steers the future towards political polarisation), but non-agentic. Do you agree with this? If so, do you agree that if FB-like newsfeeds became prominent in many ways that would not be very concerning from a longtermist perspective?
I think this is the key point and it's glossed over in my original post, so it seems worth digging in a bit more.
I think there are many plausible models that generalize successfully to longer horizons, e.g. from 100 days to 10,000 days:
This is roughly why I'm afraid that models we train will ultimately be able to plan over long horizons than those that appear in training.
But many of these would end up pursuing goals that are closely related to the goals they pursue over short horizons (and in particular the first 4 above seem like they'd all be undesirable if generalizing from easily-measured goals, and would lead to the kinds of failures I describe in part I of WFLL).
I think one reason that my posts about this are confusing is that I often insist that we don't rely on generalization because I don't expect it to work reliably in the way we hope. But that's about what assumptions we want to make when designing our algorithms---I still think that the "generalizes in the natural way" model is important for getting a sense of what AI systems are going to do, even if I think there is a good chance that it's not a good enough approximation to make the systems do exactly what we want. (And of course I think if you are relying on generalization in this way you have very little ability to avoid the out-with-a-bang failure mode, so I have further reasons to be unhappy about relying on generalization.)
I agree that it's only us who are operating by trial and error---the system understands what it's doing. I don't think that undermines my argument. The point is that we pick the system, and so determine what it's doing, by trial and error, because we have no understanding of what it's doing (under the current paradigm). For some kinds of goals we may be able to pick systems that achieve those goals by trial and error (modulo empirical uncertainty about generalization, as discussed in the second part). For other goals there isn't a plausible way to do that.
To clarify your position: if I train a system that makes good predictions over 1 minute and 10 minutes and 100 minutes, is your position that there's not much reason that this system would make a good prediction over 1000 minutes? Analogously, if I train a system by meta-learning to get high rewards over a wide range of simulated environments, is your position that there's not much reason to think it will try to get high rewards when deployed in the real world?
I consider those pretty wide open empirical questions. The view that we can get good generalization of this kind is fairly common within ML.
I do agree once you generalize motivations from easily measurable tasks with short feedback loops to tasks with long feedback loops then you may also be able to get "good" generalizations, and this is a way that you can solve the alignment problem. It seems to me that there are lots of plausible ways to generalize to longer horizons without also generalizing to "better" answers (according to humans' idealized reasoning).
(Another salient way in which you get long horizons is by doing something like TD learning, i.e. train a model that predicts its own judgment in 1 minute. I don't know if it's important to get into the details of all the ways people can try to get things to generalize over longer time horizons, it seems like there are many candidates. I agree that there are analogously candidates for getting models to optimize the things we want even if we can't measure them easily, and as I've said I think it's most likely those techniques will be successful, but this is a post about what happens if we fail, and I think it's completely unclear that "we can generalize to longer horizons" implies "we can generalize from the measurable to the unmeasurable.".)
When we deploy humans in the real world they do seem to have many desires resembling various plausible generalizations of evolutionary fitness (e.g. to intrinsically want kids even in unfamiliar situations, to care about very long-term legacies, etc.). I totally agree that humans also want a bunch of kind of random spandrels. This is related to the basic uncertainty discussed in the previous paragraphs. I think the situation with ML may well differ because, if we wanted to, we can use training procedures that are much more likely to generalize than evolution.
I don't think it's relevant to my argument that humans can learn sophisticated concepts in a new domain, the question is about the motivations of humans.
Yes, I'm saying that part 1 is where you are able to get what you measure and part 2 is where you aren't.
Also, as I say, I expect the real world to be some complicated mish-mash of these kinds of failures (and for real motivations to be potentially influenced both by natural generalizations of what happens at training time and also by randomness / architecture / etc., as seems to be the case with humans).
Wanting to earn more money or grow users or survive over the long term is also an easily measured goal, and in practice firms crucially exploit the fact that these goals are contiguous with their shorter easily measured proxies. Non-profits that act in the world often have bottom-line metrics that they use to guide their action and seem better at optimizing goals that can be captured by such metrics (or metrics like donor acquisition).
The mechanism by which you are better at pursuing easily-measureable goals is primarily via internal coherence / stability.
I've said that previously human world-steering is the only game in town but soon it won't be, so the future is more likely to be steered in ways that a human wouldn't steer it, and that in turn is more likely to be a direction humans don't like. This doesn't speak to whether the harms on balance outweigh the benefits, which would require an analysis of the benefits but is also pretty irrelevant to my claim (especially so given that all of the world-steerers enjoy these benefits at all and we are primarily concerned with relative influence over the very long term). I'm trying to talk about how the future could get steered in a direction that we don't like if AI development goes in a bad direction, I'm not trying to argue something like "Shorter AI timelines are worse" (which I also think is probably true but about which I'm more ambivalent).
I don't see a plausible way that humans can use physical power for some long-term goals and not other long-term goals, whereas I've suggested two ways in which automated reasoning may be more easily applied to certain long-term goals (namely the goals that are natural generalizations of training objectives, or goals that are most easily discovered in neural networks).
If Facebook's news feed would generate actions chosen to have the long-term consequence of increasing political polarization then I'd say it was steering the future towards political polarization. (And I assume you'd say it was an agent.)
As is, I don't think Facebook's newsfeed steers the future towards political polarization in a meaningful sense (it's roughly the same as a toaster steering the world towards more toast).
Maybe that's quantitatively just the same kind of thing but weak, since after all everything is about generalization anyway. In that case the concern seems like it's about world-steering that scales up as we scale up our technology/models improve (such that they will eventually become competitive with human world-steering), whereas the news feed doesn't scale up since it's just relying on some random association about how short-term events X happen to lead to polarization (and nor will a toaster if you make it better and better at toasting). I don't really have views on this kind of definitional question, and my post isn't really relying on any of these distinctions.
Something like A/B testing is much closer to future-steering, since scaling it up in the obvious way (and scaling to effects across more users and longer horizons rather than independently randomizing) would in fact steer the future towards whatever selection criteria you were using. But I agree with your point that such systems can only steer the very long-term future once there is some kind of generalization.
This makes sense to me, and seem to map somewhat onto Parts 1 and 2 of WFLL.
However, you also call those parts "going out with a whimper" and "going out with a bang", which seems to be claims about the impacts of bad generalizations. In that post, are you intending to make claims about possible kinds of bad generalizations that ML models could make, or possible ways that poorly-generalizing ML models could lead to catastrophe (or both)?
Personally, I'm pretty on board with the two types of bad generalizations as plausible things that could happen, but less on board with "going out with a whimper" as leading to catastrophe. It seems like you at least need to explain why in that situation we can't continue to work on the alignment problem and replace the agents with better-aligned AI systems in the future. (Possibly the answer is "the AI systems don't allow us to do this because it would interfere with their continued operation".)
There are a bunch of things that differ between part I and part II, I believe they are correlated with each other but not at all perfectly. In the post I'm intending to illustrate what I believe some plausible failures look like, in a way intended to capture a bunch of the probability space. I'm illustrating these kinds of bad generalizations and ways in which the resulting failures could be catastrophic. I don't really know what "making the claim" means, but I would say that any ways in which the story isn't realistic are interesting to me (and we've already discussed many, and my views have---unsurprisingly!---changed considerably in the details over the last 2 years), whether they are about the generalizations or the impacts.
I do think that the "going out with a whimper" scenario may ultimately transition into something abrupt, unless people don't have their act together enough to even put up a fight (which I do think is fairly likely conditioned on catastrophe, and may be the most likely failure mode).
We can continue to work on the alignment problem and continue to fail to solve it, e.g. because the problem is very challenging or impossible or because we don't end up putting in a giant excellent effort (e.g. if we spent a billion dollars a year on alignment right now it seems plausible it would be a catastrophic mess of people working on irrelevant stuff, generating lots of noise while we continue to make important progress at a very slow rate).
The most important reason this is possible is that change is accelerating radically, e.g. I believe that it's quite plausible we will not have massive investment in these problems until we are 5-10 years away from a singularity and so just don't have much time.
If you are saying "Well why not wait until after the singularity?" then yes, I do think that eventually it doesn't look like this. But that can just look like people failing to get their act together, and then eventually when they try to replace deployed AI systems they fail. Depending on how generalization works that may look like a failure (as in scenario 2) or everything may just look dandy from the human perspective because they are now permanently unable to effectively perceive or act in the real world (especially off of earth). I basically think that all bets are off if humans just try to sit tight while an incomprehensible AI world-outside-the-gates goes through a growth explosion.
I think there's a perspective where the post-singularity failure is still the important thing to talk about, and that's an error I made in writing the post. I skipped it because there is no real action after the singularity---the damage is irreversibly done, all of the high-stakes decisions are behind us---but it still matters for people trying to wrap their heads around what's going on. And moreover, the only reason it looks that way to me is because I'm bringing in a ton of background empirical assumptions (e.g. I believe that massive acceleration in growth is quite likely), and the story will justifiably sound very different to someone who isn't coming in with those assumptions.
Fwiw I think I didn't realize you weren't making claims about what post-singularity looked like, and that was part of my confusion about this post. Interpreting it as "what's happening until the singularity" makes more sense. (And I think I'm mostly fine with the claim that it isn't that important to think about what happens after the singularity.)
In most of the cases you've discussed, trying to do tasks over much longer time horizons involves doing a very different task. Reducing reported crime over 10 minutes and reducing reported crime over 100 minutes have very little to do with reducing reported crime over a year or 10 years. The same is true for increasing my wealth, or increasing my knowledge (which over 10 minutes involves telling me things, but over a year might involve doing novel scientific research). I tend to be pretty optimistic about AI motivations generalising, but this type of generalisation seems far too underspecified. "Making predictions" is perhaps an exception, insofar as it's a very natural concept, and also one which transfers very straightforwardly from simulations to reality. But it probably depends a lot on what type of predictions we're talking about.
On meta-learning: it doesn't seem realistic to think about an AI "trying to get high rewards" on tasks where the time horizon is measured in months or years. Instead it'll try to achieve some generalisation of the goals it learned during training. But as I already argued, we're not going to be able to train on single tasks which are similar enough to real-world long-term tasks that motivations will transfer directly in any recognisable way.
Insofar as ML researchers think about this, I think their most common position is something like "we'll train an AI to follow a wide range of instructions, and then it'll generalise to following new instructions over longer time horizons". This makes a lot of sense to me, because I expect we'll be able to provide enough datapoints (mainly simulated datapoints, plus language pre-training) to pin down the concept "follow instructions" reasonably well, whereas I don't expect we can provide enough datapoints to pin down a motivation like "reduce reports of crime". (Note that I also think that we'll be able to provide enough datapoints to incentivise influence-seeking behaviour, so this isn't a general argument against AI risk, but rather an argument against the particular type of task-specific generalisation you describe.)
In other words, we should expect generalisation to long-term tasks to occur via a general motivation to follow our instructions, rather than on a task-specific basis, because the latter is so underspecified. But generalisation via following instructions doesn't have a strong bias towards easily-measurable goals.
I think that throughout your post there's an ambiguity between two types of measurement. Type one measurements are those which we can make easily enough to use as a feedback signal for training AIs. Type two measurements are those which we can make easily enough to tell us whether an AI we've deployed is doing a good job. In general many more things are type-two-measurable than type-one-measurable, because training feedback needs to be very cheap. So if we train an AI on type one measurements, we'll usually be able to use type two measurements to evaluate whether it's doing a good job post-deployment. And that AI won't game those type two measurements even if it generalises its training signal to much longer time horizons, because it will never have been trained on type two measurements.
These seem like the key disagreements, so I'll leave off here, to prevent the thread from branching too much. (Edited one out because I decided it was less important).
I feel like a very natural version of "follow instructions" is "Do things that would the instruction-giver would rate highly." (Which is the generalization I'm talking about.) I don't think any of the arguments about "long horizon versions of tasks are different from short versions" tell us anything about which of these generalizations would be learnt (since they are both equally alien over long horizons).
Other versions like "Follow instructions (without regards to what the training process cares about)" seem quite likely to perform significantly worse on the training set. It's also not clear to me that "follow the spirit of the instructions" is better-specified than "do things the instruction-giver would rate highly if we asked them"---informally I would say the latter is better-specified, and it seems like the argument here is resting crucially on some other sense of well-specification.
I've trained in simulation on tasks where I face a wide variety of environment, each with a reward signal, and I am taught to learn the dynamics of the environment and the reward and then take actions that lead to a lot of reward. In simulation my tasks can have reasonably long time horizons (as measured by how long I think), though that depends on open questions about scaling behavior. I don't agree with the claim that it's unrealistic to imagine such models generalizing to reality by wanting something-like-reward.
Trying to maximize wealth over 100 minutes is indeed very different from maximizing wealth over 1 year, and is also almost completely useless for basically the same reason (except in domains like day trading where mark to market acts as a strong value function).
My take is that people will be pushed to optimizing over longer horizons because these qualitatively different tasks over short horizons aren't useful. The useful tasks in fact do involve preparing for the future and acquiring flexible influence, and so time horizons long enough to be useful will also be long enough to be relevantly similar to yet longer horizons.
Developers will be incentivized to find any way to get good behavior over long horizons, and it seems like we have many candidates that I regard as plausible and which all seem reasonably likely to lead to the kind of behavior I discuss. To me it feels like you are quite opinionated about how that generalization will work.
It seems like your take is "consequences over long enough horizons to be useful will be way too expensive to use for training," which seems close to 50/50 to me.
I agree that this is a useful distinction and there will be some gap. I think that quantitatively I expect the gap to be much smaller than you do (e.g. getting 10k historical examples of 1-year plans seems quite realistic), and I expect people to work to design training procedures that get good performance on type two measures (roughly by definition), and I guess I'm significantly more agnostic about the likelihood of generalization from the longest type one measures to type two measures.
I'm imagining systems generalizing much more narrowly to the evaluation process used during training. This is still underspecified in some sense (are you trying to optimize the data that goes into SGD, or the data that goes into the dataset, or the data that goes into the sensors?) and in the limit that basically leads to influence-maximization and continuously fades into scenario 2. It's also true that e.g. I may be able to confirm at test-time that there is no training process holding me accountable, and for some of these generalizations that would lead to a kind of existential crisis (where I've never encountered anything like this during training and it's no longer clear what I'm even aiming at). It doesn't feel like these are the kinds of underspecification you are referring to.
The type 1 vs. type 2 feedback distinction here seems really central. I'm interested if this seems like a fair characterization to both of you.
Type 1: Feedback which we use for training (via gradient descent)
Type 2: Feedback which we use to decide whether to deploy trained agent.
(There's a bit of gray between Type 1 and 2, since choosing whether to deploy is another form of selection, but I'm assuming we're okay stating that gradient descent and model selection operate in qualitatively distinct regimes.)
The key disagreement is whether we expect type 1 feedback will be closer to type 2 feedback, or whether type 2 feedback will be closer to our true goals. If the former, our agents generalizing from type 1 to type 2 is relatively uninformative, and we still have Goodhart. In the latter case, the agent is only very weakly optimizing the type 2 feedback, and so we don't need to worry much about Goodhart, and should expect type 2 feedback to continue track our true goals well.
Main argument for type 1 ~ type 2: by definition, we design type 1 feedback (+associated learning algorithm) so that resulting agents perform well under type 2
Main argument for type 1 !~ type 2: type 2 feedback can be something like 1000-10000x more expensive, since we only have to evaluate it once, rather than enough times to be useful for gradient descent
I'd also be interested to discuss this disagreement in particular, since I could definitely go either way on it. (I plan to think about it more myself.)
I think that by default we will search for ways to build systems that do well on type 2 feedback. We do likely have a large dataset of type-2-bad behaviors from the real world, across many applications, and can make related data in simulation. It also seems quite plausible that this is a very tiny delta, if we are dealing with models that have already learned everything they would need to know about the world and this is just a matter of selecting a motivation, so that you can potentially get good type 2 behavior using a very small amount of data. Relatedly, it seems like really all you need is to train predictors for type 2 feedback (in order to use those predictions for training/planning), and that the relevant prediction problems often seem much easier than the actual sophisticated behaviors we are interested in.
Another important of my view about type 1 ~ type 2 is that if gradient descent handles the scale from [1 second, 1 month] then it's not actually very far to get from [1 month, 2 years]. It seems like we've already come 6 orders of magnitude and now we are talking about generalizing 1 more order of magnitude.
At a higher level, I feel like the important thing is that type 1 and type 2 feedback are going to be basically the same kind of thing but with a quantitative difference (or at least we can set up type 1 feedback so that this is true). On the other hand "what we really want" is a completely different thing (that we basically can't even define cleanly). So prima facie it feels to me like if models generalize "well" then we can get them to generalize from type 1 to type 2, whereas no such thing is true for "what we really care about."
A couple of clarifications:
Let's also include feedback which we can use to decide whether to stop deploying an agent; the central example in my head is an agent which has been deployed for some time before we discover that it's doing bad things.
Relatedly, another argument for type 1 !~ type 2 which seems important to me: type 2 feedback can look at long time horizons, which I expect to be very useful. (Maybe you included this in the cost estimate, but idk how to translate between longer times and higher cost directly.)
This doesn't seem right. We design type 1 feedback so that resulting agents perform well on our true goals. This only matches up with type 2 feedback insofar as type 2 feedback is closely related to our true goals. But if that's the case, then it would be strange for agents to learn the motivation of doing well on type 2 feedback without learning the motivation of doing well on our true goals.
In practice, I expect that misaligned agents which perform well on type 2 feedback will do so primarily by deception, for instrumental purposes. But it's hard to picture agents which carry out this type of deception, but which don't also decide to take over the world directly.
But type 2 feedback is (by definition) our best attempt to estimate how well the model is doing what we really care about. So in practice any results-based selection for "does what we care about" goes via selecting based on type 2 feedback. The difference only comes up when we reason mechanically about the behavior of our agents and how they are likely to generalize, but it's not clear that's an important part of the default plan (whereas I think we will clearly extensively leverage "try several strategies and see what works").
"Do things that look to a human like you are achieving X" is closely related to X, but that doesn't mean that learning to do the one implies that you will learn to do the other.
Maybe it’s helpful to imagine the world where type 1 feedback is “human evals after 1 week horizon”, type 2 feedback is “human evals after 1 year horizon,” and “what we really care about” is the "human evals after a 100 year horizon." I think that’s much better than the actual situation, but even in that case I’d have a significant probability on getting systems that work on the 1 year horizon without working indefinitely (especially if we do selection for working on 2 years + are able to use a small amount of 2 year data). Do you feel pretty confident that something that generalizes from 1 week to 1 year will go indefinitely, or is your intuition predicated on something about the nature of “be helpful” and how that’s a natural motivation for a mind? (Or maybe that we will be able to identify some other similar “natural” motivation and design our training process to be aligned with that?) In the former case, it seems like we can have an empirical discussion about how generalization tends to work. In the latter case, it seems like we need to be getting into more details about why “be helpful” is a particularly natural (or else why we should be able to pick out something else like that). In the other cases I think I haven't fully internalized your view.
I agree with the two questions you've identified as the core issues, although I'd slightly rephrase the former. It's hard to think about something being aligned indefinitely. But it seems like, if we have primarily used a given system for carrying out individual tasks, it would take quite a lot of misalignment for it to carry out a systematic plan to deceive us. So I'd rephrase the first option you mention as "feeling pretty confident that something that generalises from 1 week to 1 year won't become misaligned enough to cause disasters". This point seems more important than the second point (the nature of “be helpful” and how that’s a natural motivation for a mind), but I'll discuss both.
I think the main disagreement about the former is over the relative strength of "results-based selection" versus "intentional design". When I said above that "we design type 1 feedback so that resulting agents perform well on our true goals", I was primarily talking about "design" as us reasoning about our agents, and the training process they undergo, not the process of running them for a long time and picking the ones that do best. The latter is a very weak force! Almost all of the optimisation done by humans comes from intentional design plus rapid trial and error (on the timeframe of days or weeks). Very little of the optimisation comes from long-term trial and error (on the timeframe of a year) - by necessity, because it's just so slow.
So, conditional on our agents generalising from "one week" to "one year", we should expect that it's because we somehow designed a training procedure that produces scalable alignment (or at least scalable non-misalignment), or because they're deceptively aligned (as in your influence-seeking agents scenario), but not because long-term trial and error was responsible for steering us towards getting what we can measure.
Then there's the second question, of whether "do things that look to a human like you're achieving X" is a plausible generalisation. My intuitions on this question are very fuzzy, so I wouldn't be surprised if they're wrong. But, tentatively, here's one argument. Consider a policy which receives instructions from a human, talks to the human to clarify the concepts involved, then gets rewarded and updated based on how well it carries out those instructions. From the policy's perspective, the thing it interacts with, and which its actions are based on, is human instructions. Indeed, for most of the training process the policy plausibly won't even have the concept of "reward" (in the same way that humans didn't evolve a concept of fitness). But it will have this concept of human intentions, which is a very good proxy for reward. And so it seems much more natural for the policy's goals to be formulated in terms of human intentions and desires, which are the observable quantities that it responds to; rather than human feedback, which is the unobservable quantity that it is optimised with respect to. (Rewards can be passed as observations to the policy, but I claim that it's both safer and more useful if rewards are unobservable by the policy during training.)
This argument is weakened by the fact that, when there's a conflict between them (e.g. in cases where it's possible to fool the humans), agents aiming to "look like you're doing X" will receive more reward. But during most of training the agent won't be very good at fooling humans, and so I am optimistic that its core motivations will still be more like "do what the human says" than "look like you're doing what the human says".
Cool, thanks for the clarifications. To be clear, overall I'm much more sympathetic to the argument as I currently understand it, than when I originally thought you were trying to draw a distinction between "new forms of reasoning honed by trial-and-error" in part 1 (which I interpreted as talking about systems lacking sufficiently good models of the world to find solutions in any other way than trial and error) and "systems that have a detailed understanding of the world" in part 2.
Let me try to sum up the disagreement. The key questions are:
On 1: you think long-horizon real-world data will play a significant role in training, because we'll need it to teach agents to do the most valuable tasks. This seems plausible to me; but I think that in order for this type of training to be useful, the agents will need to already have robust motivations (else they won't be able to find rewards that are given over long time horizons). And I don't think that this training will be extensive enough to reshape those motivations to a large degree (whereas I recall that in an earlier discussion on amplification, you argued that small amounts of training could potentially reshape motivations significantly). Our disagreement about question 1 affects questions 2 and 3, but it affects question 2 less than I previously thought, as I'll discuss.
On 2: previously I thought you were arguing that we should expect very task-specific generalisations like being trained on "reduce crime" and learning "reduce reported crime", which I was calling underspecified. However, based on your last comment it seems that you're actually mainly talking about broader generalisations, like being trained on "follow instructions" and learning "do things that the instruction-giver would rate highly". This seems more plausible, because it's a generalisation that you can learn in many different types of training; and so our disagreement on 1 becomes less consequential.
I don't have a strong opinion on the likelihood of this type of generalisation. I guess your argument is that, because we're doing a lot of trial and error, we'll keep iterating until we either get something aligned with our instructions, or something which optimises for high ratings directly. But it seems to me by default, during early training periods the AI won't have much information about either the overseer's knowledge (or the overseer's existence), and may not even have the concept of rewards, making alignment with instructions much more natural. Above, you disagree; in either case my concern is that this underlying concept of "natural generalisation" is doing a lot of work, despite not having been explored in your original post (or anywhere else, to my knowledge). We could go back and forth about where the burden of proof is, but it seems more important to develop a better characterisation of natural generalisation; I might try to do this in a separate post.
On 3: it seems to me that the resources which we'll put into evaluating a single deployment are several orders of magnitude higher than the resources we'll put into evaluating each training data point - e.g. we'll likely have whole academic disciplines containing thousands of people working full-time for many years on analysing the effects of the most powerful AIs' behaviour.
You say that you expect people to work to design training procedures that get good performance on type two measures. I agree with this - but if you design an AI that gets good performance on type 2 measurements despite never being trained on them, then that rules out the most straightforward versions of the "do things that the instruction-giver would rate highly" motivation. And since the trial and error to find strategies which fool type 2 measurements will be carried out over years, the direct optimisation for fooling type 2 measurements will be weak.
I guess the earlier disagreement about question 1 is also relevant here. If you're an AI trained primarily on data and feedback which are very different from real-world long-term evaluations, then there are very few motivations which lead you to do well on real-world long-term evaluations. "Follow instructions" is one of them; some version of "do things that the instruction-giver would rate highly" is another, but it would need to be quite a specific version. In other words, the greater the disparity between the training regime and the evaluation regime, the fewer ways there are for an AI's motivations to score well on both, but also score badly on our idealised preferences.
In another comment, you give a bunch of ways in which models might generalise successfully to longer horizons, and then argue that "many of these would end up pursuing goals that are closely related to the goals they pursue over short horizons". I agree with this, but note that "aligned goals" are also closely related to the goals pursued over short time horizons. So it comes back to whether motivations will generalise in a way which prioritises the "obedience" aspect or the "produces high scores" aspect of the short-term goals.
I agree that the core question is about how generalization occurs. My two stories involve kinds of generalization, and I think there are also ways generalization could work that could lead to good behavior.
It is important to my intuition that not only can we never train for the "good" generalization, we can't even evaluate techniques to figure out which generalization "well" (since both of the bad generalizations would lead to behavior that looks good over long horizons).
If there is a disagreement it is probably that I have a much higher probability of the kind of generalization in story 1. I'm not sure if there's actually a big quantitative disagreement though rather than a communication problem.
I also think it's quite likely that the story in my post is unrealistic in a bunch of ways and I'm currently thinking more about what I think would actually happen.
Some more detailed responses that feel more in-the-weeds:
I might not understand this point. For example, suppose I'm training a 1-day predictor to make good predictions over 10 or 100 days. I expect such predictors to initially fail over long horizons, but to potentially be greatly improved with moderate amounts of fine-tuning. It seems to me that if this model has "robust motivations" then they would most likely be to predict accurately, but I'm not sure about why the model necessarily has robust motivations.
I feel similarly about goals like "plan to get high reward (defined as signals on channel X, you can learn how the channel works)." But even if prediction was a special case, if you learn a model then you can use it for planning/RL in simulation.
It feels to me like our models are already getting to the point where they respond to quirks of the labeling or evaluation process, and are basically able to build simple models of the oversight process.
Definitely, I think it's critical to what happens and not really explored in the post (which is mostly intended to provide some color for what failure might look like).
That said, a major part of my view is that it's pretty likely that we get either arbitrary motivations or reward-maximization (or something in between), and it's not a big deal which since they both seem bad and seem averted in the same way.
I think the really key question is how likely it is that we get some kind of "intended" generalization like friendliness. I'm frequently on the opposite side of this disagreement, arguing that the probability that people will get some nice generalization if they really try is at least 25% or 50%, but I'm also happy being on the pessimistic side and saying that the probability we can get nice generalizations is at most 50% or 75%.
Two kinds of generalization is an old post on this question (though I wish it had used more tasteful examples).
Turning reflection up to 11 touches on the issue as well, though coming from a very different place than you.
I think there are a bunch of Arbital posts where Eliezer tries to articulate some of his opinions on this but I don't know pointers offhand. I think most of my sense is
I haven't written that much about why I think generalizations like "just be helpful" aren't that likely. I agree with the point that these issues are underexplored by people working on alignment, and even more underdiscussed, given how important they are.
There are some google doc comment threads with MIRI where I've written about why I think those are plausible (namely that it plausible-but-challenging for breeding of animals, and that seems like one of our best anchors overall, suggesting that plausible-but-challenging is a good anchor). I think in those cases the key argument was about whether you need this to generalize far, since both me and MIRI think it's a kind of implausible generalization to go out to infinity rather than becoming distorted at some point along the way, but I am more optimistic about making a series of "short hops" where models generalize helpfully to being moderately smarter and then they can carry out the next step of training for you.
I agree that this is probably the key point; my other comment ("I think this is the key point and it's glossed over...") feels very relevant to me.