Some thoughts on risks from narrow, non-agentic AI

[-]paulfchristiano5y90

If it’s via a deliberate plan to suppress them while also overcoming human objections to doing so, then that seems less like a narrow system “optimising for an easily measured objective” and more like an agentic and misaligned AGI

I didn't mean to make any distinction of this kind. I don't think I said anything about narrowness or agency. The systems I describe do seem to be optimizing for easily measurable objectives, but that seems mostly orthogonal to these other axes.

I'm pretty agnostic on whether AI will in fact be optimizing for the easily measured objectives used in training or for unrelated values that arise naturally in the learning process (or more likely some complicated mix), and part of my point is that it doesn't seem to much mater.

Secondly, let’s talk about existing pressures towards easily-measured goals. I read this as primarily referring to competitive political and economic activity - because competition is a key force pushing people towards tradeoffs which are undesirable in the long term.

I'm saying: it's easier to pursue easily-measured goals, and so successful organizations and individuals tend to do that and to outcompete those whose goals are harder to measure (and to get better at / focus on the parts of their goals that are easy to measure, etc.). I'm not positing any change in the strength of competition, I'm positing a change in the extent to which goals that are easier to measure are in fact easier to pursue.

Regarding the extent and nature of competition I do think I disagree with you fairly strongly but it doesn't seem like a central point.

the US, for example, doesn’t seem very concerned right now about falling behind China.

I think this is in fact quite high on the list of concerns for US policy-makers and especially the US defense establishment.

Further, I don’t see where the meta-level optimisation for easily measured objectives comes from.

Firms and governments and people pursue a whole mix of objectives, some of which are easily measured. The ones pursuing easily-measured objectives are more successful, and so control an increasing fraction of resources.

So if narrow AI becomes very powerful, we should expect it to improve humanity’s ability to steer our trajectory in many ways.

I don't disagree with this at all. The point is that right now human future-steering is basically the only game in town. We are going to introduce inhuman reasoning that can also steer the future, and over time human reasoning will lose out in relative terms. (If you classify all future-steering machinery as "agentic" then evidently I'm talking about agents and I agree with the informal claim that "non-agentic" reasoning isn't concerning.) That's compatible with us benefiting enormously, if all of those benefits also accrue to automated reasoners---as your examples seem to. We will try to ensure that all this new reasoning will benefit humanity, but I describe two reasons that might be difficult and say a little bit about how that difficulty might materialize.

I don't really know if or how this is distinct from what you call the second species argument. It feels like you are objecting to a distinction I'm not intending to make.

[-]Richard_Ngo5y40

In the second half of WFLL, you talk about "systems that have a detailed understanding of the world, which are able to adapt their behavior in order to achieve specific goals". Does the first half of WFLL also primarily refer to systems with these properties? And if so, does "reasoning honed by trial-and-error" refer to the reasoning that those systems do?

If yes, then this undermines your core argument that "[some things] can’t be done by trial and error. To solve such tasks we need to understand what we are doing and why it will yield good outcomes", because "systems that have a detailed understanding of the world" don't need to operate by trial and error; they understand what they're doing.

We do need to train them by trial and error, but it's very difficult to do so on real-world tasks which have long feedback loops, like most of the ones you discuss. Instead, we'll likely train them to have good reasoning skills on tasks which have short feedback loops, and then transfer them to real-world with long feedback loops. But in that case, I don't see much reason why systems that have a detailed understanding of the world will have a strong bias towards easily-measurable goals on real-world tasks with long feedback loops. (Analogously: when you put humans in a new domain, and give them tasks and feedback via verbal instructions, then we can quickly learn sophisticated concepts in that new domain, and optimise for those, not just the easily-measured concepts in that new domain.)

I'm pretty agnostic on whether AI will in fact be optimizing for the easily measured objectives used in training or for unrelated values that arise naturally in the learning process (or more likely some complicated mix), and part of my point is that it doesn't seem to much mater.

Why is your scenario called "You get what you measure" if you're agnostic about whether we actually get what we measure, even on the level of individual AIs?

Or do you mean part 1 to be the case where we do get what we measure, and part 2 to be the case where we don't?

I'm saying: it's easier to pursue easily-measured goals, and so successful organizations and individuals tend to do that and to outcompete those whose goals are harder to measure (and to get better at / focus on the parts of their goals that are easy to measure, etc.). I'm not positing any change in the strength of competition, I'm positing a change in the extent to which goals that are easier to measure are in fact easier to pursue.

Firstly, I think this is only true for organisations whose success is determined by people paying attention to easily-measured metrics, and not by reality. For example, an organisation which optimises for its employees having beliefs which are correct in easily-measured ways will lose out to organisations where employees think in useful ways. An organisation which optimises for revenue growth is more likely to go bankrupt than an organisation which optimises for sustainable revenue growth. An organisation which optimises for short-term customer retention loses long-term customer retention. Etc.

The case in which this is more worrying is when an organisation's success is determined by (for example) whether politicians like it, and politicians only pay attention to easily-measurable metrics. In this case, organisations which pursue easily-measured goals will be more successful than ones which pursue the goals the politicians actually want to achieve. This is why I make the argument that actually the pressure on politicians to pursue easily-measurable metrics is pretty weak (hence why they're ignoring most economists' recommendations on how to increase GDP).

I don't disagree with [AI improving our ability to steer our future] at all. The point is that right now human future-steering is basically the only game in town. We are going to introduce inhuman reasoning that can also steer the future, and over time human reasoning will lose out in relative terms. That's compatible with us benefiting enormously, if all of those benefits also accrue to automated reasoners---as your examples seem to. We will try to ensure that all this new reasoning will benefit humanity, but I describe two reasons that might be difficult.

I agree that you've described some potential harms; but in order to make this a plausible long-term concern, you need to give reasons to think that the harms outweigh the benefits of AI enhancing (the effective capabilities of) human reasoning. If you'd written a comparable post a few centuries ago talking about how human physical power will lose out to inhuman physical power, I would have had the same complaint.

(If you classify all future-steering machinery as "agentic" then evidently I'm talking about agents and I agree with the informal claim that "non-agentic" reasoning isn't concerning.)

I classify Facebook's newsfeed as future-steering in a weak sense (it steers the future towards political polarisation), but non-agentic. Do you agree with this? If so, do you agree that if FB-like newsfeeds became prominent in many ways that would not be very concerning from a longtermist perspective?

[-]paulfchristiano5y90

We do need to train them by trial and error, but it's very difficult to do so on real-world tasks which have long feedback loops, like most of the ones you discuss. Instead, we'll likely train them to have good reasoning skills on tasks which have short feedback loops, and then transfer them to real-world with long feedback loops. But in that case, I don't see much reason why systems that have a detailed understanding of the world will have a strong bias towards easily-measurable goals on real-world tasks with long feedback loops.

I think this is the key point and it's glossed over in my original post, so it seems worth digging in a bit more.

I think there are many plausible models that generalize successfully to longer horizons, e.g. from 100 days to 10,000 days:

Acquire money and other forms of flexible influence, and then tomorrow switch to using a 99-day (or 9999-day) horizon policy.
Have a short-term predictor, and apply it over more and more steps to predict longer horizons (if your predictor generalizes then there are tons of approaches to acting that would generalize).
Deductively reason about what actions are good over 100 days (vs 10,000 days), since deduction appears to generalize well from a big messy set of facts to new very different facts.
If I've learned to abstract seconds into minutes, minutes into hours, hours into days, days into weeks, and then plan over weeks, its pretty plausible that the same procedure can abstract weeks into months and months into years. (It's kind of like I'm now I'm working on a log scale and asking the model to generalize from 1, 2, ..., 10 to 11, 12, 13.)
Most possible ways of reasoning are hard to write down in a really simple list, but I expect that many hard-to-describe models also generalize. If some generalize and some do not, then training my model over longer and longer horizons (3 seconds, 30 seconds, 5 minutes...) will gradually knock out the non-generalizing modes of reasoning and leave me with the modes that do generalize to longer horizons.

This is roughly why I'm afraid that models we train will ultimately be able to plan over long horizons than those that appear in training.

But many of these would end up pursuing goals that are closely related to the goals they pursue over short horizons (and in particular the first 4 above seem like they'd all be undesirable if generalizing from easily-measured goals, and would lead to the kinds of failures I describe in part I of WFLL).

I think one reason that my posts about this are confusing is that I often insist that we don't rely on generalization because I don't expect it to work reliably in the way we hope. But that's about what assumptions we want to make when designing our algorithms---I still think that the "generalizes in the natural way" model is important for getting a sense of what AI systems are going to do, even if I think there is a good chance that it's not a good enough approximation to make the systems do exactly what we want. (And of course I think if you are relying on generalization in this way you have very little ability to avoid the out-with-a-bang failure mode, so I have further reasons to be unhappy about relying on generalization.)

[-]paulfchristiano5y*50

In the second half of WFLL, you talk about "systems that have a detailed understanding of the world, which are able to adapt their behavior in order to achieve specific goals". Does the first half of WFLL also primarily refer to systems with these properties? And if so, does "reasoning honed by trial-and-error" refer to the reasoning that those systems do?

Yes.

If yes, then this undermines your core argument that "[some things] can’t be done by trial and error. To solve such tasks we need to understand what we are doing and why it will yield good outcomes", because "systems that have a detailed understanding of the world" don't need to operate by trial and error; they understand what they're doing.

I agree that it's only us who are operating by trial and error---the system understands what it's doing. I don't think that undermines my argument. The point is that we pick the system, and so determine what it's doing, by trial and error, because we have no understanding of what it's doing (under the current paradigm). For some kinds of goals we may be able to pick systems that achieve those goals by trial and error (modulo empirical uncertainty about generalization, as discussed in the second part). For other goals there isn't a plausible way to do that.

We do need to train them by trial and error, but it's very difficult to do so on real-world tasks which have long feedback loops, like most of the ones you discuss. Instead, we'll likely train them to have good reasoning skills on tasks which have short feedback loops, and then transfer them to real-world with long feedback loops. But in that case, I don't see much reason why systems that have a detailed understanding of the world will have a strong bias towards easily-measurable goals on real-world tasks with long feedback loops.

To clarify your position: if I train a system that makes good predictions over 1 minute and 10 minutes and 100 minutes, is your position that there's not much reason that this system would make a good prediction over 1000 minutes? Analogously, if I train a system by meta-learning to get high rewards over a wide range of simulated environments, is your position that there's not much reason to think it will try to get high rewards when deployed in the real world?

I consider those pretty wide open empirical questions. The view that we can get good generalization of this kind is fairly common within ML.

I do agree once you generalize motivations from easily measurable tasks with short feedback loops to tasks with long feedback loops then you may also be able to get "good" generalizations, and this is a way that you can solve the alignment problem. It seems to me that there are lots of plausible ways to generalize to longer horizons without also generalizing to "better" answers (according to humans' idealized reasoning).

(Another salient way in which you get long horizons is by doing something like TD learning, i.e. train a model that predicts its own judgment in 1 minute. I don't know if it's important to get into the details of all the ways people can try to get things to generalize over longer time horizons, it seems like there are many candidates. I agree that there are analogously candidates for getting models to optimize the things we want even if we can't measure them easily, and as I've said I think it's most likely those techniques will be successful, but this is a post about what happens if we fail, and I think it's completely unclear that "we can generalize to longer horizons" implies "we can generalize from the measurable to the unmeasurable.".)

(Analogously: when you put humans in a new domain, and give them tasks and feedback via verbal instructions, then we can quickly learn sophisticated concepts in that new domain, and optimise for those, not just the easily-measured concepts in that new domain.)

When we deploy humans in the real world they do seem to have many desires resembling various plausible generalizations of evolutionary fitness (e.g. to intrinsically want kids even in unfamiliar situations, to care about very long-term legacies, etc.). I totally agree that humans also want a bunch of kind of random spandrels. This is related to the basic uncertainty discussed in the previous paragraphs. I think the situation with ML may well differ because, if we wanted to, we can use training procedures that are much more likely to generalize than evolution.

I don't think it's relevant to my argument that humans can learn sophisticated concepts in a new domain, the question is about the motivations of humans.

Why is your scenario called "You get what you measure" if you're agnostic about whether we actually get what we measure, even on the level of individual AIs?
Or do you mean part 1 to be the case where we do get what we measure, and part 2 to be the case where we don't?

Yes, I'm saying that part 1 is where you are able to get what you measure and part 2 is where you aren't.

Also, as I say, I expect the real world to be some complicated mish-mash of these kinds of failures (and for real motivations to be potentially influenced both by natural generalizations of what happens at training time and also by randomness / architecture / etc., as seems to be the case with humans).

The case in which this is more worrying is when an organisation's success is determined by (for example) whether politicians like it, and politicians only pay attention to easily-measurable metrics. In this case, organisations which pursue easily-measured goals will be more successful than ones which pursue the goals the politicians actually want to achieve. This is why I make the argument that actually the pressure on politicians to pursue easily-measurable metrics is pretty weak (hence why they're ignoring most economists' recommendations on how to increase GDP).

Wanting to earn more money or grow users or survive over the long term is also an easily measured goal, and in practice firms crucially exploit the fact that these goals are contiguous with their shorter easily measured proxies. Non-profits that act in the world often have bottom-line metrics that they use to guide their action and seem better at optimizing goals that can be captured by such metrics (or metrics like donor acquisition).

The mechanism by which you are better at pursuing easily-measureable goals is primarily via internal coherence / stability.

I agree that you've described some potential harms; but in order to make this a plausible long-term concern, you need to give reasons to think that the harms outweigh the benefits of AI enhancing (the effective capabilities of) human reasoning.

I've said that previously human world-steering is the only game in town but soon it won't be, so the future is more likely to be steered in ways that a human wouldn't steer it, and that in turn is more likely to be a direction humans don't like. This doesn't speak to whether the harms on balance outweigh the benefits, which would require an analysis of the benefits but is also pretty irrelevant to my claim (especially so given that all of the world-steerers enjoy these benefits at all and we are primarily concerned with relative influence over the very long term). I'm trying to talk about how the future could get steered in a direction that we don't like if AI development goes in a bad direction, I'm not trying to argue something like "Shorter AI timelines are worse" (which I also think is probably true but about which I'm more ambivalent).

If you'd written a comparable post a few centuries ago talking about how human physical power will lose out to inhuman physical power, I would have had the same complaint.

I don't see a plausible way that humans can use physical power for some long-term goals and not other long-term goals, whereas I've suggested two ways in which automated reasoning may be more easily applied to certain long-term goals (namely the goals that are natural generalizations of training objectives, or goals that are most easily discovered in neural networks).

I classify Facebook's newsfeed as future-steering in a weak sense (it steers the future towards political polarisation), but non-agentic. Do you agree with this?

If Facebook's news feed would generate actions chosen to have the long-term consequence of increasing political polarization then I'd say it was steering the future towards political polarization. (And I assume you'd say it was an agent.)

As is, I don't think Facebook's newsfeed steers the future towards political polarization in a meaningful sense (it's roughly the same as a toaster steering the world towards more toast).

Maybe that's quantitatively just the same kind of thing but weak, since after all everything is about generalization anyway. In that case the concern seems like it's about world-steering that scales up as we scale up our technology/models improve (such that they will eventually become competitive with human world-steering), whereas the news feed doesn't scale up since it's just relying on some random association about how short-term events X happen to lead to polarization (and nor will a toaster if you make it better and better at toasting). I don't really have views on this kind of definitional question, and my post isn't really relying on any of these distinctions.

Something like A/B testing is much closer to future-steering, since scaling it up in the obvious way (and scaling to effects across more users and longer horizons rather than independently randomizing) would in fact steer the future towards whatever selection criteria you were using. But I agree with your point that such systems can only steer the very long-term future once there is some kind of generalization.

[-]Rohin Shah5y50

I've suggested two ways in which automated reasoning may be more easily applied to certain long-term goals (namely the goals that are natural generalizations of training objectives, or goals that are most easily discovered in neural networks).

This makes sense to me, and seem to map somewhat onto Parts 1 and 2 of WFLL.

However, you also call those parts "going out with a whimper" and "going out with a bang", which seems to be claims about the impacts of bad generalizations. In that post, are you intending to make claims about possible kinds of bad generalizations that ML models could make, or possible ways that poorly-generalizing ML models could lead to catastrophe (or both)?

Personally, I'm pretty on board with the two types of bad generalizations as plausible things that could happen, but less on board with "going out with a whimper" as leading to catastrophe. It seems like you at least need to explain why in that situation we can't continue to work on the alignment problem and replace the agents with better-aligned AI systems in the future. (Possibly the answer is "the AI systems don't allow us to do this because it would interfere with their continued operation".)

[-]paulfchristiano5y80

There are a bunch of things that differ between part I and part II, I believe they are correlated with each other but not at all perfectly. In the post I'm intending to illustrate what I believe some plausible failures look like, in a way intended to capture a bunch of the probability space. I'm illustrating these kinds of bad generalizations and ways in which the resulting failures could be catastrophic. I don't really know what "making the claim" means, but I would say that any ways in which the story isn't realistic are interesting to me (and we've already discussed many, and my views have---unsurprisingly!---changed considerably in the details over the last 2 years), whether they are about the generalizations or the impacts.

I do think that the "going out with a whimper" scenario may ultimately transition into something abrupt, unless people don't have their act together enough to even put up a fight (which I do think is fairly likely conditioned on catastrophe, and may be the most likely failure mode).

It seems like you at least need to explain why in that situation we can't continue to work on the alignment problem and replace the agents with better-aligned AI systems in the future

We can continue to work on the alignment problem and continue to fail to solve it, e.g. because the problem is very challenging or impossible or because we don't end up putting in a giant excellent effort (e.g. if we spent a billion dollars a year on alignment right now it seems plausible it would be a catastrophic mess of people working on irrelevant stuff, generating lots of noise while we continue to make important progress at a very slow rate).

The most important reason this is possible is that change is accelerating radically, e.g. I believe that it's quite plausible we will not have massive investment in these problems until we are 5-10 years away from a singularity and so just don't have much time.

If you are saying "Well why not wait until after the singularity?" then yes, I do think that eventually it doesn't look like this. But that can just look like people failing to get their act together, and then eventually when they try to replace deployed AI systems they fail. Depending on how generalization works that may look like a failure (as in scenario 2) or everything may just look dandy from the human perspective because they are now permanently unable to effectively perceive or act in the real world (especially off of earth). I basically think that all bets are off if humans just try to sit tight while an incomprehensible AI world-outside-the-gates goes through a growth explosion.

I think there's a perspective where the post-singularity failure is still the important thing to talk about, and that's an error I made in writing the post. I skipped it because there is no real action after the singularity---the damage is irreversibly done, all of the high-stakes decisions are behind us---but it still matters for people trying to wrap their heads around what's going on. And moreover, the only reason it looks that way to me is because I'm bringing in a ton of background empirical assumptions (e.g. I believe that massive acceleration in growth is quite likely), and the story will justifiably sound very different to someone who isn't coming in with those assumptions.

[-]Rohin Shah5y40

I think there's a perspective where the post-singularity failure is still the important thing to talk about, and that's an error I made in writing the post. I skipped it because there is no real action after the singularity---the damage is irreversibly done, all of the high-stakes decisions are behind us---but it still matters for people trying to wrap their heads around what's going on. And moreover, the only reason it looks that way to me is because I'm bringing in a ton of background empirical assumptions (e.g. I believe that massive acceleration in growth is quite likely), and the story will justifiably sound very different to someone who isn't coming in with those assumptions.

Fwiw I think I didn't realize you weren't making claims about what post-singularity looked like, and that was part of my confusion about this post. Interpreting it as "what's happening until the singularity" makes more sense. (And I think I'm mostly fine with the claim that it isn't that important to think about what happens after the singularity.)

[-]Richard_Ngo5y40

To clarify your position: if I train a system that makes good predictions over 1 minute and 10 minutes and 100 minutes, is your position that there's not much reason that this system would make a good prediction over 1000 minutes? Analogously, if I train a system by meta-learning to get high rewards over a wide range of simulated environments, is your position that there's not much reason to think it will try to get high rewards when deployed in the real world?

In most of the cases you've discussed, trying to do tasks over much longer time horizons involves doing a very different task. Reducing reported crime over 10 minutes and reducing reported crime over 100 minutes have very little to do with reducing reported crime over a year or 10 years. The same is true for increasing my wealth, or increasing my knowledge (which over 10 minutes involves telling me things, but over a year might involve doing novel scientific research). I tend to be pretty optimistic about AI motivations generalising, but this type of generalisation seems far too underspecified. "Making predictions" is perhaps an exception, insofar as it's a very natural concept, and also one which transfers very straightforwardly from simulations to reality. But it probably depends a lot on what type of predictions we're talking about.

On meta-learning: it doesn't seem realistic to think about an AI "trying to get high rewards" on tasks where the time horizon is measured in months or years. Instead it'll try to achieve some generalisation of the goals it learned during training. But as I already argued, we're not going to be able to train on single tasks which are similar enough to real-world long-term tasks that motivations will transfer directly in any recognisable way.

Insofar as ML researchers think about this, I think their most common position is something like "we'll train an AI to follow a wide range of instructions, and then it'll generalise to following new instructions over longer time horizons". This makes a lot of sense to me, because I expect we'll be able to provide enough datapoints (mainly simulated datapoints, plus language pre-training) to pin down the concept "follow instructions" reasonably well, whereas I don't expect we can provide enough datapoints to pin down a motivation like "reduce reports of crime". (Note that I also think that we'll be able to provide enough datapoints to incentivise influence-seeking behaviour, so this isn't a general argument against AI risk, but rather an argument against the particular type of task-specific generalisation you describe.)

In other words, we should expect generalisation to long-term tasks to occur via a general motivation to follow our instructions, rather than on a task-specific basis, because the latter is so underspecified. But generalisation via following instructions doesn't have a strong bias towards easily-measurable goals.

I agree that it's only us who are operating by trial and error---the system understands what it's doing. I don't think that undermines my argument. The point is that we pick the system, and so determine what it's doing, by trial and error, because we have no understanding of what it's doing (under the current paradigm). For some kinds of goals we may be able to pick systems that achieve those goals by trial and error (modulo empirical uncertainty about generalization, as discussed in the second part). For other goals there isn't a plausible way to do that.

I think that throughout your post there's an ambiguity between two types of measurement. Type one measurements are those which we can make easily enough to use as a feedback signal for training AIs. Type two measurements are those which we can make easily enough to tell us whether an AI we've deployed is doing a good job. In general many more things are type-two-measurable than type-one-measurable, because training feedback needs to be very cheap. So if we train an AI on type one measurements, we'll usually be able to use type two measurements to evaluate whether it's doing a good job post-deployment. And that AI won't game those type two measurements even if it generalises its training signal to much longer time horizons, because it will never have been trained on type two measurements.

These seem like the key disagreements, so I'll leave off here, to prevent the thread from branching too much. (Edited one out because I decided it was less important).

[-]paulfchristiano5y90

I feel like a very natural version of "follow instructions" is "Do things that would the instruction-giver would rate highly." (Which is the generalization I'm talking about.) I don't think any of the arguments about "long horizon versions of tasks are different from short versions" tell us anything about which of these generalizations would be learnt (since they are both equally alien over long horizons).

Other versions like "Follow instructions (without regards to what the training process cares about)" seem quite likely to perform significantly worse on the training set. It's also not clear to me that "follow the spirit of the instructions" is better-specified than "do things the instruction-giver would rate highly if we asked them"---informally I would say the latter is better-specified, and it seems like the argument here is resting crucially on some other sense of well-specification.

On meta-learning: it doesn't seem realistic to think about an AI "trying to get high rewards" on tasks where the time horizon is measured in months or years.

I've trained in simulation on tasks where I face a wide variety of environment, each with a reward signal, and I am taught to learn the dynamics of the environment and the reward and then take actions that lead to a lot of reward. In simulation my tasks can have reasonably long time horizons (as measured by how long I think), though that depends on open questions about scaling behavior. I don't agree with the claim that it's unrealistic to imagine such models generalizing to reality by wanting something-like-reward.

In most of the cases you've discussed, trying to do tasks over much longer time horizons involves doing a very different task [...]

Trying to maximize wealth over 100 minutes is indeed very different from maximizing wealth over 1 year, and is also almost completely useless for basically the same reason (except in domains like day trading where mark to market acts as a strong value function).

My take is that people will be pushed to optimizing over longer horizons because these qualitatively different tasks over short horizons aren't useful. The useful tasks in fact do involve preparing for the future and acquiring flexible influence, and so time horizons long enough to be useful will also be long enough to be relevantly similar to yet longer horizons.

Developers will be incentivized to find any way to get good behavior over long horizons, and it seems like we have many candidates that I regard as plausible and which all seem reasonably likely to lead to the kind of behavior I discuss. To me it feels like you are quite opinionated about how that generalization will work.

It seems like your take is "consequences over long enough horizons to be useful will be way too expensive to use for training," which seems close to 50/50 to me.

I think that throughout your post there's an ambiguity between two types of measurement. Type one measurements are those which we can make easily enough to use as a feedback signal for training AIs. Type two measurements are those which we can make easily enough to tell us whether an AI we've deployed is doing a good job. In general many more things are type-two-measurable than type-one-measurable, because training feedback needs to be very cheap.

I agree that this is a useful distinction and there will be some gap. I think that quantitatively I expect the gap to be much smaller than you do (e.g. getting 10k historical examples of 1-year plans seems quite realistic), and I expect people to work to design training procedures that get good performance on type two measures (roughly by definition), and I guess I'm significantly more agnostic about the likelihood of generalization from the longest type one measures to type two measures.

In other words, we should expect generalisation to long-term tasks to occur via a general motivation to follow our instructions, rather than on a task-specific basis, because the latter is so underspecified. But generalisation via following instructions doesn't have a strong bias towards easily-measurable goals.

I'm imagining systems generalizing much more narrowly to the evaluation process used during training. This is still underspecified in some sense (are you trying to optimize the data that goes into SGD, or the data that goes into the dataset, or the data that goes into the sensors?) and in the limit that basically leads to influence-maximization and continuously fades into scenario 2. It's also true that e.g. I may be able to confirm at test-time that there is no training process holding me accountable, and for some of these generalizations that would lead to a kind of existential crisis (where I've never encountered anything like this during training and it's no longer clear what I'm even aiming at). It doesn't feel like these are the kinds of underspecification you are referring to.

[-]jon_crescent5y40

The type 1 vs. type 2 feedback distinction here seems really central. I'm interested if this seems like a fair characterization to both of you.

Type 1: Feedback which we use for training (via gradient descent)
Type 2: Feedback which we use to decide whether to deploy trained agent.

(There's a bit of gray between Type 1 and 2, since choosing whether to deploy is another form of selection, but I'm assuming we're okay stating that gradient descent and model selection operate in qualitatively distinct regimes.)

The key disagreement is whether we expect type 1 feedback will be closer to type 2 feedback, or whether type 2 feedback will be closer to our true goals. If the former, our agents generalizing from type 1 to type 2 is relatively uninformative, and we still have Goodhart. In the latter case, the agent is only very weakly optimizing the type 2 feedback, and so we don't need to worry much about Goodhart, and should expect type 2 feedback to continue track our true goals well.

Main argument for type 1 ~ type 2: by definition, we design type 1 feedback (+associated learning algorithm) so that resulting agents perform well under type 2
Main argument for type 1 !~ type 2: type 2 feedback can be something like 1000-10000x more expensive, since we only have to evaluate it once, rather than enough times to be useful for gradient descent

I'd also be interested to discuss this disagreement in particular, since I could definitely go either way on it. (I plan to think about it more myself.)

[-]paulfchristiano5y20

I think that by default we will search for ways to build systems that do well on type 2 feedback. We do likely have a large dataset of type-2-bad behaviors from the real world, across many applications, and can make related data in simulation. It also seems quite plausible that this is a very tiny delta, if we are dealing with models that have already learned everything they would need to know about the world and this is just a matter of selecting a motivation, so that you can potentially get good type 2 behavior using a very small amount of data. Relatedly, it seems like really all you need is to train predictors for type 2 feedback (in order to use those predictions for training/planning), and that the relevant prediction problems often seem much easier than the actual sophisticated behaviors we are interested in.

Another important of my view about type 1 ~ type 2 is that if gradient descent handles the scale from [1 second, 1 month] then it's not actually very far to get from [1 month, 2 years]. It seems like we've already come 6 orders of magnitude and now we are talking about generalizing 1 more order of magnitude.

At a higher level, I feel like the important thing is that type 1 and type 2 feedback are going to be basically the same kind of thing but with a quantitative difference (or at least we can set up type 1 feedback so that this is true). On the other hand "what we really want" is a completely different thing (that we basically can't even define cleanly). So prima facie it feels to me like if models generalize "well" then we can get them to generalize from type 1 to type 2, whereas no such thing is true for "what we really care about."

[-]Richard_Ngo5y20

A couple of clarifications:

Type 2: Feedback which we use to decide whether to deploy trained agent.

Let's also include feedback which we can use to decide whether to stop deploying an agent; the central example in my head is an agent which has been deployed for some time before we discover that it's doing bad things.

Relatedly, another argument for type 1 !~ type 2 which seems important to me: type 2 feedback can look at long time horizons, which I expect to be very useful. (Maybe you included this in the cost estimate, but idk how to translate between longer times and higher cost directly.)

"by definition, we design type 1 feedback (+associated learning algorithm) so that resulting agents perform well under type 2"

This doesn't seem right. We design type 1 feedback so that resulting agents perform well on our true goals. This only matches up with type 2 feedback insofar as type 2 feedback is closely related to our true goals. But if that's the case, then it would be strange for agents to learn the motivation of doing well on type 2 feedback without learning the motivation of doing well on our true goals.

In practice, I expect that misaligned agents which perform well on type 2 feedback will do so primarily by deception, for instrumental purposes. But it's hard to picture agents which carry out this type of deception, but which don't also decide to take over the world directly.

[-]paulfchristiano5y30

This doesn't seem right. We design type 1 feedback so that resulting agents perform well on our true goals. This only matches up with type 2 feedback insofar as type 2 feedback is closely related to our true goals.

But type 2 feedback is (by definition) our best attempt to estimate how well the model is doing what we really care about. So in practice any results-based selection for "does what we care about" goes via selecting based on type 2 feedback. The difference only comes up when we reason mechanically about the behavior of our agents and how they are likely to generalize, but it's not clear that's an important part of the default plan (whereas I think we will clearly extensively leverage "try several strategies and see what works").

But if that's the case, then it would be strange for agents to learn the motivation of doing well on type 2 feedback without learning the motivation of doing well on our true goals.

"Do things that look to a human like you are achieving X" is closely related to X, but that doesn't mean that learning to do the one implies that you will learn to do the other.

Maybe it’s helpful to imagine the world where type 1 feedback is “human evals after 1 week horizon”, type 2 feedback is “human evals after 1 year horizon,” and “what we really care about” is the "human evals after a 100 year horizon." I think that’s much better than the actual situation, but even in that case I’d have a significant probability on getting systems that work on the 1 year horizon without working indefinitely (especially if we do selection for working on 2 years + are able to use a small amount of 2 year data). Do you feel pretty confident that something that generalizes from 1 week to 1 year will go indefinitely, or is your intuition predicated on something about the nature of “be helpful” and how that’s a natural motivation for a mind? (Or maybe that we will be able to identify some other similar “natural” motivation and design our training process to be aligned with that?) In the former case, it seems like we can have an empirical discussion about how generalization tends to work. In the latter case, it seems like we need to be getting into more details about why “be helpful” is a particularly natural (or else why we should be able to pick out something else like that). In the other cases I think I haven't fully internalized your view.

[-]Richard_Ngo5y10

I agree with the two questions you've identified as the core issues, although I'd slightly rephrase the former. It's hard to think about something being aligned indefinitely. But it seems like, if we have primarily used a given system for carrying out individual tasks, it would take quite a lot of misalignment for it to carry out a systematic plan to deceive us. So I'd rephrase the first option you mention as "feeling pretty confident that something that generalises from 1 week to 1 year won't become misaligned enough to cause disasters". This point seems more important than the second point (the nature of “be helpful” and how that’s a natural motivation for a mind), but I'll discuss both.

I think the main disagreement about the former is over the relative strength of "results-based selection" versus "intentional design". When I said above that "we design type 1 feedback so that resulting agents perform well on our true goals", I was primarily talking about "design" as us reasoning about our agents, and the training process they undergo, not the process of running them for a long time and picking the ones that do best. The latter is a very weak force! Almost all of the optimisation done by humans comes from intentional design plus rapid trial and error (on the timeframe of days or weeks). Very little of the optimisation comes from long-term trial and error (on the timeframe of a year) - by necessity, because it's just so slow.

So, conditional on our agents generalising from "one week" to "one year", we should expect that it's because we somehow designed a training procedure that produces scalable alignment (or at least scalable non-misalignment), or because they're deceptively aligned (as in your influence-seeking agents scenario), but not because long-term trial and error was responsible for steering us towards getting what we can measure.

Then there's the second question, of whether "do things that look to a human like you're achieving X" is a plausible generalisation. My intuitions on this question are very fuzzy, so I wouldn't be surprised if they're wrong. But, tentatively, here's one argument. Consider a policy which receives instructions from a human, talks to the human to clarify the concepts involved, then gets rewarded and updated based on how well it carries out those instructions. From the policy's perspective, the thing it interacts with, and which its actions are based on, is human instructions. Indeed, for most of the training process the policy plausibly won't even have the concept of "reward" (in the same way that humans didn't evolve a concept of fitness). But it will have this concept of human intentions, which is a very good proxy for reward. And so it seems much more natural for the policy's goals to be formulated in terms of human intentions and desires, which are the observable quantities that it responds to; rather than human feedback, which is the unobservable quantity that it is optimised with respect to. (Rewards can be passed as observations to the policy, but I claim that it's both safer and more useful if rewards are unobservable by the policy during training.)

This argument is weakened by the fact that, when there's a conflict between them (e.g. in cases where it's possible to fool the humans), agents aiming to "look like you're doing X" will receive more reward. But during most of training the agent won't be very good at fooling humans, and so I am optimistic that its core motivations will still be more like "do what the human says" than "look like you're doing what the human says".

[-]Richard_Ngo5y30

Cool, thanks for the clarifications. To be clear, overall I'm much more sympathetic to the argument as I currently understand it, than when I originally thought you were trying to draw a distinction between "new forms of reasoning honed by trial-and-error" in part 1 (which I interpreted as talking about systems lacking sufficiently good models of the world to find solutions in any other way than trial and error) and "systems that have a detailed understanding of the world" in part 2.

Let me try to sum up the disagreement. The key questions are:

What training data will we realistically be able to train our agents on?
What types of generalisation should we expect from that training data?
How well will we be able to tell that these agents are doing the wrong thing?

On 1: you think long-horizon real-world data will play a significant role in training, because we'll need it to teach agents to do the most valuable tasks. This seems plausible to me; but I think that in order for this type of training to be useful, the agents will need to already have robust motivations (else they won't be able to find rewards that are given over long time horizons). And I don't think that this training will be extensive enough to reshape those motivations to a large degree (whereas I recall that in an earlier discussion on amplification, you argued that small amounts of training could potentially reshape motivations significantly). Our disagreement about question 1 affects questions 2 and 3, but it affects question 2 less than I previously thought, as I'll discuss.

On 2: previously I thought you were arguing that we should expect very task-specific generalisations like being trained on "reduce crime" and learning "reduce reported crime", which I was calling underspecified. However, based on your last comment it seems that you're actually mainly talking about broader generalisations, like being trained on "follow instructions" and learning "do things that the instruction-giver would rate highly". This seems more plausible, because it's a generalisation that you can learn in many different types of training; and so our disagreement on 1 becomes less consequential.

I don't have a strong opinion on the likelihood of this type of generalisation. I guess your argument is that, because we're doing a lot of trial and error, we'll keep iterating until we either get something aligned with our instructions, or something which optimises for high ratings directly. But it seems to me by default, during early training periods the AI won't have much information about either the overseer's knowledge (or the overseer's existence), and may not even have the concept of rewards, making alignment with instructions much more natural. Above, you disagree; in either case my concern is that this underlying concept of "natural generalisation" is doing a lot of work, despite not having been explored in your original post (or anywhere else, to my knowledge). We could go back and forth about where the burden of proof is, but it seems more important to develop a better characterisation of natural generalisation; I might try to do this in a separate post.

On 3: it seems to me that the resources which we'll put into evaluating a single deployment are several orders of magnitude higher than the resources we'll put into evaluating each training data point - e.g. we'll likely have whole academic disciplines containing thousands of people working full-time for many years on analysing the effects of the most powerful AIs' behaviour.

You say that you expect people to work to design training procedures that get good performance on type two measures. I agree with this - but if you design an AI that gets good performance on type 2 measurements despite never being trained on them, then that rules out the most straightforward versions of the "do things that the instruction-giver would rate highly" motivation. And since the trial and error to find strategies which fool type 2 measurements will be carried out over years, the direct optimisation for fooling type 2 measurements will be weak.

I guess the earlier disagreement about question 1 is also relevant here. If you're an AI trained primarily on data and feedback which are very different from real-world long-term evaluations, then there are very few motivations which lead you to do well on real-world long-term evaluations. "Follow instructions" is one of them; some version of "do things that the instruction-giver would rate highly" is another, but it would need to be quite a specific version. In other words, the greater the disparity between the training regime and the evaluation regime, the fewer ways there are for an AI's motivations to score well on both, but also score badly on our idealised preferences.

In another comment, you give a bunch of ways in which models might generalise successfully to longer horizons, and then argue that "many of these would end up pursuing goals that are closely related to the goals they pursue over short horizons". I agree with this, but note that "aligned goals" are also closely related to the goals pursued over short time horizons. So it comes back to whether motivations will generalise in a way which prioritises the "obedience" aspect or the "produces high scores" aspect of the short-term goals.

[-]paulfchristiano5y30

I agree that the core question is about how generalization occurs. My two stories involve kinds of generalization, and I think there are also ways generalization could work that could lead to good behavior.

It is important to my intuition that not only can we never train for the "good" generalization, we can't even evaluate techniques to figure out which generalization "well" (since both of the bad generalizations would lead to behavior that looks good over long horizons).

If there is a disagreement it is probably that I have a much higher probability of the kind of generalization in story 1. I'm not sure if there's actually a big quantitative disagreement though rather than a communication problem.

I also think it's quite likely that the story in my post is unrealistic in a bunch of ways and I'm currently thinking more about what I think would actually happen.

Some more detailed responses that feel more in-the-weeds:

you think long-horizon real-world data will play a significant role in training, because we'll need it to teach agents to do the most valuable tasks. This seems plausible to me; but I think that in order for this type of training to be useful, the agents will need to already have robust motivations (else they won't be able to find rewards that are given over long time horizons

I might not understand this point. For example, suppose I'm training a 1-day predictor to make good predictions over 10 or 100 days. I expect such predictors to initially fail over long horizons, but to potentially be greatly improved with moderate amounts of fine-tuning. It seems to me that if this model has "robust motivations" then they would most likely be to predict accurately, but I'm not sure about why the model necessarily has robust motivations.

I feel similarly about goals like "plan to get high reward (defined as signals on channel X, you can learn how the channel works)." But even if prediction was a special case, if you learn a model then you can use it for planning/RL in simulation.

But it seems to me by default, during early training periods the AI won't have much information about either the overseer's knowledge (or the overseer's existence), and may not even have the concept of rewards, making alignment with instructions much more natural.

It feels to me like our models are already getting to the point where they respond to quirks of the labeling or evaluation process, and are basically able to build simple models of the oversight process.

my concern is that this underlying concept of "natural generalisation" is doing a lot of work, despite not having been explored in your original post

Definitely, I think it's critical to what happens and not really explored in the post (which is mostly intended to provide some color for what failure might look like).

That said, a major part of my view is that it's pretty likely that we get either arbitrary motivations or reward-maximization (or something in between), and it's not a big deal which since they both seem bad and seem averted in the same way.

I think the really key question is how likely it is that we get some kind of "intended" generalization like friendliness. I'm frequently on the opposite side of this disagreement, arguing that the probability that people will get some nice generalization if they really try is at least 25% or 50%, but I'm also happy being on the pessimistic side and saying that the probability we can get nice generalizations is at most 50% or 75%.

(or anywhere else, to my knowledge)

Two kinds of generalization is an old post on this question (though I wish it had used more tasteful examples).

Turning reflection up to 11 touches on the issue as well, though coming from a very different place than you.

I think there are a bunch of Arbital posts where Eliezer tries to articulate some of his opinions on this but I don't know pointers offhand. I think most of my sense is

I haven't written that much about why I think generalizations like "just be helpful" aren't that likely. I agree with the point that these issues are underexplored by people working on alignment, and even more underdiscussed, given how important they are.

There are some google doc comment threads with MIRI where I've written about why I think those are plausible (namely that it plausible-but-challenging for breeding of animals, and that seems like one of our best anchors overall, suggesting that plausible-but-challenging is a good anchor). I think in those cases the key argument was about whether you need this to generalize far, since both me and MIRI think it's a kind of implausible generalization to go out to infinity rather than becoming distorted at some point along the way, but I am more optimistic about making a series of "short hops" where models generalize helpfully to being moderately smarter and then they can carry out the next step of training for you.

[-]paulfchristiano5y40

I agree that this is probably the key point; my other comment ("I think this is the key point and it's glossed over...") feels very relevant to me.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

19

Some thoughts on risks from narrow, non-agentic AI

19

Inequality and totalitarianism

A slow-rolling catastrophe

A vulnerable world

A digital world

Conclusion