I agree that the core question is about how generalization occurs. My two stories involve kinds of generalization, and I think there are also ways generalization could work that could lead to good behavior.
It is important to my intuition that not only can we never train for the "good" generalization, we can't even evaluate techniques to figure out which generalization "well" (since both of the bad generalizations would lead to behavior that looks good over long horizons).
If there is a disagreement it is probably that I have a much higher probability of the kind of generalization in story 1. I'm not sure if there's actually a big quantitative disagreement though rather than a communication problem.
I also think it's quite likely that the story in my post is unrealistic in a bunch of ways and I'm currently thinking more about what I think would actually happen.
Some more detailed responses that feel more in-the-weeds:
you think long-horizon real-world data will play a significant role in training, because we'll need it to teach agents to do the most valuable tasks. This seems plausible to me; but I think that in order for this type of training to be useful, the agents will need to already have robust motivations (else they won't be able to find rewards that are given over long time horizons
I might not understand this point. For example, suppose I'm training a 1-day predictor to make good predictions over 10 or 100 days. I expect such predictors to initially fail over long horizons, but to potentially be greatly improved with moderate amounts of fine-tuning. It seems to me that if this model has "robust motivations" then they would most likely be to predict accurately, but I'm not sure about why the model necessarily has robust motivations.
I feel similarly about goals like "plan to get high reward (defined as signals on channel X, you can learn how the channel works)." But even if prediction was a special case, if you learn a model then you can use it for planning/RL in simulation.
But it seems to me by default, during early training periods the AI won't have much information about either the overseer's knowledge (or the overseer's existence), and may not even have the concept of rewards, making alignment with instructions much more natural.
It feels to me like our models are already getting to the point where they respond to quirks of the labeling or evaluation process, and are basically able to build simple models of the oversight process.
my concern is that this underlying concept of "natural generalisation" is doing a lot of work, despite not having been explored in your original post
Definitely, I think it's critical to what happens and not really explored in the post (which is mostly intended to provide some color for what failure might look like).
That said, a major part of my view is that it's pretty likely that we get either arbitrary motivations or reward-maximization (or something in between), and it's not a big deal which since they both seem bad and seem averted in the same way.
I think the really key question is how likely it is that we get some kind of "intended" generalization like friendliness. I'm frequently on the opposite side of this disagreement, arguing that the probability that people will get some nice generalization if they really try is at least 25% or 50%, but I'm also happy being on the pessimistic side and saying that the probability we can get nice generalizations is at most 50% or 75%.
(or anywhere else, to my knowledge)
Two kinds of generalization is an old post on this question (though I wish it had used more tasteful examples).
Turning reflection up to 11 touches on the issue as well, though coming from a very different place than you.
I think there are a bunch of Arbital posts where Eliezer tries to articulate some of his opinions on this but I don't know pointers offhand. I think most of my sense is
I haven't written that much about why I think generalizations like "just be helpful" aren't that likely. I agree with the point that these issues are underexplored by people working on alignment, and even more underdiscussed, given how important they are.
There are some google doc comment threads with MIRI where I've written about why I think those are plausible (namely that it plausible-but-challenging for breeding of animals, and that seems like one of our best anchors overall, suggesting that plausible-but-challenging is a good anchor). I think in those cases the key argument was about whether you need this to generalize far, since both me and MIRI think it's a kind of implausible generalization to go out to infinity rather than becoming distorted at some point along the way, but I am more optimistic about making a series of "short hops" where models generalize helpfully to being moderately smarter and then they can carry out the next step of training for you.
I agree that this is probably the key point; my other comment ("I think this is the key point and it's glossed over...") feels very relevant to me.
I feel like a very natural version of "follow instructions" is "Do things that would the instruction-giver would rate highly." (Which is the generalization I'm talking about.) I don't think any of the arguments about "long horizon versions of tasks are different from short versions" tell us anything about which of these generalizations would be learnt (since they are both equally alien over long horizons).
Other versions like "Follow instructions (without regards to what the training process cares about)" seem quite likely to perform significantly worse on the training set. It's also not clear to me that "follow the spirit of the instructions" is better-specified than "do things the instruction-giver would rate highly if we asked them"---informally I would say the latter is better-specified, and it seems like the argument here is resting crucially on some other sense of well-specification.
On meta-learning: it doesn't seem realistic to think about an AI "trying to get high rewards" on tasks where the time horizon is measured in months or years.
I've trained in simulation on tasks where I face a wide variety of environment, each with a reward signal, and I am taught to learn the dynamics of the environment and the reward and then take actions that lead to a lot of reward. In simulation my tasks can have reasonably long time horizons (as measured by how long I think), though that depends on open questions about scaling behavior. I don't agree with the claim that it's unrealistic to imagine such models generalizing to reality by wanting something-like-reward.
In most of the cases you've discussed, trying to do tasks over much longer time horizons involves doing a very different task [...]
Trying to maximize wealth over 100 minutes is indeed very different from maximizing wealth over 1 year, and is also almost completely useless for basically the same reason (except in domains like day trading where mark to market acts as a strong value function).
My take is that people will be pushed to optimizing over longer horizons because these qualitatively different tasks over short horizons aren't useful. The useful tasks in fact do involve preparing for the future and acquiring flexible influence, and so time horizons long enough to be useful will also be long enough to be relevantly similar to yet longer horizons.
Developers will be incentivized to find any way to get good behavior over long horizons, and it seems like we have many candidates that I regard as plausible and which all seem reasonably likely to lead to the kind of behavior I discuss. To me it feels like you are quite opinionated about how that generalization will work.
It seems like your take is "consequences over long enough horizons to be useful will be way too expensive to use for training," which seems close to 50/50 to me.
I think that throughout your post there's an ambiguity between two types of measurement. Type one measurements are those which we can make easily enough to use as a feedback signal for training AIs. Type two measurements are those which we can make easily enough to tell us whether an AI we've deployed is doing a good job. In general many more things are type-two-measurable than type-one-measurable, because training feedback needs to be very cheap.
I agree that this is a useful distinction and there will be some gap. I think that quantitatively I expect the gap to be much smaller than you do (e.g. getting 10k historical examples of 1-year plans seems quite realistic), and I expect people to work to design training procedures that get good performance on type two measures (roughly by definition), and I guess I'm significantly more agnostic about the likelihood of generalization from the longest type one measures to type two measures.
In other words, we should expect generalisation to long-term tasks to occur via a general motivation to follow our instructions, rather than on a task-specific basis, because the latter is so underspecified. But generalisation via following instructions doesn't have a strong bias towards easily-measurable goals.
I'm imagining systems generalizing much more narrowly to the evaluation process used during training. This is still underspecified in some sense (are you trying to optimize the data that goes into SGD, or the data that goes into the dataset, or the data that goes into the sensors?) and in the limit that basically leads to influence-maximization and continuously fades into scenario 2. It's also true that e.g. I may be able to confirm at test-time that there is no training process holding me accountable, and for some of these generalizations that would lead to a kind of existential crisis (where I've never encountered anything like this during training and it's no longer clear what I'm even aiming at). It doesn't feel like these are the kinds of underspecification you are referring to.
We do need to train them by trial and error, but it's very difficult to do so on real-world tasks which have long feedback loops, like most of the ones you discuss. Instead, we'll likely train them to have good reasoning skills on tasks which have short feedback loops, and then transfer them to real-world with long feedback loops. But in that case, I don't see much reason why systems that have a detailed understanding of the world will have a strong bias towards easily-measurable goals on real-world tasks with long feedback loops.
I think this is the key point and it's glossed over in my original post, so it seems worth digging in a bit more.
I think there are many plausible models that generalize successfully to longer horizons, e.g. from 100 days to 10,000 days:
This is roughly why I'm afraid that models we train will ultimately be able to plan over long horizons than those that appear in training.
But many of these would end up pursuing goals that are closely related to the goals they pursue over short horizons (and in particular the first 4 above seem like they'd all be undesirable if generalizing from easily-measured goals, and would lead to the kinds of failures I describe in part I of WFLL).
I think one reason that my posts about this are confusing is that I often insist that we don't rely on generalization because I don't expect it to work reliably in the way we hope. But that's about what assumptions we want to make when designing our algorithms---I still think that the "generalizes in the natural way" model is important for getting a sense of what AI systems are going to do, even if I think there is a good chance that it's not a good enough approximation to make the systems do exactly what we want. (And of course I think if you are relying on generalization in this way you have very little ability to avoid the out-with-a-bang failure mode, so I have further reasons to be unhappy about relying on generalization.)
In the second half of WFLL, you talk about "systems that have a detailed understanding of the world, which are able to adapt their behavior in order to achieve specific goals". Does the first half of WFLL also primarily refer to systems with these properties? And if so, does "reasoning honed by trial-and-error" refer to the reasoning that those systems do?
If yes, then this undermines your core argument that "[some things] can’t be done by trial and error. To solve such tasks we need to understand what we are doing and why it will yield good outcomes", because "systems that have a detailed understanding of the world" don't need to operate by trial and error; they understand what they're doing.
I agree that it's only us who are operating by trial and error---the system understands what it's doing. I don't think that undermines my argument. The point is that we pick the system, and so determine what it's doing, by trial and error, because we have no understanding of what it's doing (under the current paradigm). For some kinds of goals we may be able to pick systems that achieve those goals by trial and error (modulo empirical uncertainty about generalization, as discussed in the second part). For other goals there isn't a plausible way to do that.
To clarify your position: if I train a system that makes good predictions over 1 minute and 10 minutes and 100 minutes, is your position that there's not much reason that this system would make a good prediction over 1000 minutes? Analogously, if I train a system by meta-learning to get high rewards over a wide range of simulated environments, is your position that there's not much reason to think it will try to get high rewards when deployed in the real world?
I consider those pretty wide open empirical questions. The view that we can get good generalization of this kind is fairly common within ML.
I do agree once you generalize motivations from easily measurable tasks with short feedback loops to tasks with long feedback loops then you may also be able to get "good" generalizations, and this is a way that you can solve the alignment problem. It seems to me that there are lots of plausible ways to generalize to longer horizons without also generalizing to "better" answers (according to humans' idealized reasoning).
(Another salient way in which you get long horizons is by doing something like TD learning, i.e. train a model that predicts its own judgment in 1 minute. I don't know if it's important to get into the details of all the ways people can try to get things to generalize over longer time horizons, it seems like there are many candidates. I agree that there are analogously candidates for getting models to optimize the things we want even if we can't measure them easily, and as I've said I think it's most likely those techniques will be successful, but this is a post about what happens if we fail, and I think it's completely unclear that "we can generalize to longer horizons" implies "we can generalize from the measurable to the unmeasurable.".)
(Analogously: when you put humans in a new domain, and give them tasks and feedback via verbal instructions, then we can quickly learn sophisticated concepts in that new domain, and optimise for those, not just the easily-measured concepts in that new domain.)
When we deploy humans in the real world they do seem to have many desires resembling various plausible generalizations of evolutionary fitness (e.g. to intrinsically want kids even in unfamiliar situations, to care about very long-term legacies, etc.). I totally agree that humans also want a bunch of kind of random spandrels. This is related to the basic uncertainty discussed in the previous paragraphs. I think the situation with ML may well differ because, if we wanted to, we can use training procedures that are much more likely to generalize than evolution.
I don't think it's relevant to my argument that humans can learn sophisticated concepts in a new domain, the question is about the motivations of humans.
Why is your scenario called "You get what you measure" if you're agnostic about whether we actually get what we measure, even on the level of individual AIs?Or do you mean part 1 to be the case where we do get what we measure, and part 2 to be the case where we don't?
Why is your scenario called "You get what you measure" if you're agnostic about whether we actually get what we measure, even on the level of individual AIs?
Or do you mean part 1 to be the case where we do get what we measure, and part 2 to be the case where we don't?
Yes, I'm saying that part 1 is where you are able to get what you measure and part 2 is where you aren't.
Also, as I say, I expect the real world to be some complicated mish-mash of these kinds of failures (and for real motivations to be potentially influenced both by natural generalizations of what happens at training time and also by randomness / architecture / etc., as seems to be the case with humans).
The case in which this is more worrying is when an organisation's success is determined by (for example) whether politicians like it, and politicians only pay attention to easily-measurable metrics. In this case, organisations which pursue easily-measured goals will be more successful than ones which pursue the goals the politicians actually want to achieve. This is why I make the argument that actually the pressure on politicians to pursue easily-measurable metrics is pretty weak (hence why they're ignoring most economists' recommendations on how to increase GDP).
Wanting to earn more money or grow users or survive over the long term is also an easily measured goal, and in practice firms crucially exploit the fact that these goals are contiguous with their shorter easily measured proxies. Non-profits that act in the world often have bottom-line metrics that they use to guide their action and seem better at optimizing goals that can be captured by such metrics (or metrics like donor acquisition).
The mechanism by which you are better at pursuing easily-measureable goals is primarily via internal coherence / stability.
I agree that you've described some potential harms; but in order to make this a plausible long-term concern, you need to give reasons to think that the harms outweigh the benefits of AI enhancing (the effective capabilities of) human reasoning.
I've said that previously human world-steering is the only game in town but soon it won't be, so the future is more likely to be steered in ways that a human wouldn't steer it, and that in turn is more likely to be a direction humans don't like. This doesn't speak to whether the harms on balance outweigh the benefits, which would require an analysis of the benefits but is also pretty irrelevant to my claim (especially so given that all of the world-steerers enjoy these benefits at all and we are primarily concerned with relative influence over the very long term). I'm trying to talk about how the future could get steered in a direction that we don't like if AI development goes in a bad direction, I'm not trying to argue something like "Shorter AI timelines are worse" (which I also think is probably true but about which I'm more ambivalent).
If you'd written a comparable post a few centuries ago talking about how human physical power will lose out to inhuman physical power, I would have had the same complaint.
I don't see a plausible way that humans can use physical power for some long-term goals and not other long-term goals, whereas I've suggested two ways in which automated reasoning may be more easily applied to certain long-term goals (namely the goals that are natural generalizations of training objectives, or goals that are most easily discovered in neural networks).
I classify Facebook's newsfeed as future-steering in a weak sense (it steers the future towards political polarisation), but non-agentic. Do you agree with this?
If Facebook's news feed would generate actions chosen to have the long-term consequence of increasing political polarization then I'd say it was steering the future towards political polarization. (And I assume you'd say it was an agent.)
As is, I don't think Facebook's newsfeed steers the future towards political polarization in a meaningful sense (it's roughly the same as a toaster steering the world towards more toast).
Maybe that's quantitatively just the same kind of thing but weak, since after all everything is about generalization anyway. In that case the concern seems like it's about world-steering that scales up as we scale up our technology/models improve (such that they will eventually become competitive with human world-steering), whereas the news feed doesn't scale up since it's just relying on some random association about how short-term events X happen to lead to polarization (and nor will a toaster if you make it better and better at toasting). I don't really have views on this kind of definitional question, and my post isn't really relying on any of these distinctions.
Something like A/B testing is much closer to future-steering, since scaling it up in the obvious way (and scaling to effects across more users and longer horizons rather than independently randomizing) would in fact steer the future towards whatever selection criteria you were using. But I agree with your point that such systems can only steer the very long-term future once there is some kind of generalization.
If it’s via a deliberate plan to suppress them while also overcoming human objections to doing so, then that seems less like a narrow system “optimising for an easily measured objective” and more like an agentic and misaligned AGI
I didn't mean to make any distinction of this kind. I don't think I said anything about narrowness or agency. The systems I describe do seem to be optimizing for easily measurable objectives, but that seems mostly orthogonal to these other axes.
I'm pretty agnostic on whether AI will in fact be optimizing for the easily measured objectives used in training or for unrelated values that arise naturally in the learning process (or more likely some complicated mix), and part of my point is that it doesn't seem to much mater.
Secondly, let’s talk about existing pressures towards easily-measured goals. I read this as primarily referring to competitive political and economic activity - because competition is a key force pushing people towards tradeoffs which are undesirable in the long term.
I'm saying: it's easier to pursue easily-measured goals, and so successful organizations and individuals tend to do that and to outcompete those whose goals are harder to measure (and to get better at / focus on the parts of their goals that are easy to measure, etc.). I'm not positing any change in the strength of competition, I'm positing a change in the extent to which goals that are easier to measure are in fact easier to pursue.
Regarding the extent and nature of competition I do think I disagree with you fairly strongly but it doesn't seem like a central point.
the US, for example, doesn’t seem very concerned right now about falling behind China.
I think this is in fact quite high on the list of concerns for US policy-makers and especially the US defense establishment.
Further, I don’t see where the meta-level optimisation for easily measured objectives comes from.
Firms and governments and people pursue a whole mix of objectives, some of which are easily measured. The ones pursuing easily-measured objectives are more successful, and so control an increasing fraction of resources.
So if narrow AI becomes very powerful, we should expect it to improve humanity’s ability to steer our trajectory in many ways.
I don't disagree with this at all. The point is that right now human future-steering is basically the only game in town. We are going to introduce inhuman reasoning that can also steer the future, and over time human reasoning will lose out in relative terms. (If you classify all future-steering machinery as "agentic" then evidently I'm talking about agents and I agree with the informal claim that "non-agentic" reasoning isn't concerning.) That's compatible with us benefiting enormously, if all of those benefits also accrue to automated reasoners---as your examples seem to. We will try to ensure that all this new reasoning will benefit humanity, but I describe two reasons that might be difficult and say a little bit about how that difficulty might materialize.
I don't really know if or how this is distinct from what you call the second species argument. It feels like you are objecting to a distinction I'm not intending to make.
If it turns out not to be possible then the AI should never fade away.
If the humans in the container succeed in becoming wiser, then hopefully it is wise for us to leave this decision up to them than to preemptively make it now (and so I think the situation is even better than it sounds superficially).
But it seems to me that there are a small number of humans on this planet who have moved some way in the direction of being fit to run the world, and in time, more humans could move in this direction, and could move further.
It seems like the real thing up for debate will be about power struggles amongst humans---if we had just one human, then it seems to me like the grandparent's position would be straightforwardly incoherent. This includes, in particular, competing views about what kind of structure we should use to govern ourselves in the future.
I buy into the delegation framing, but I think that the best targets for delegation look more like "slightly older and wiser versions of ourselves with slightly more space" (who can themselves make decisions about whether to delegate to something more alien). In the sand-pit example, if the child opted into that arrangement then I would say they have effectively delegated to a version of themselves who is slightly constrained and shaped by the supervision of the adult. (But in the present situation, the most important thing is that the parent protects them from the outside the world while they have time to grow.)
I basically agree that humans ought to use AI to get space, safety and time to figure out what we want and grow into the people we want to be before making important decisions. This is (roughly) why I'm not concerned with some of the distinctions Gabriel raises, or that naturally come to mind when many people think of alignment.
That said, I feel your analogy misses a key point: while the child is playing in their sandbox, other stuff is happening in the world---people are building factories and armies, fighting wars and grabbing resources in space, and so on---and the child will inherit nothing at all unless their parent fights for it.
So without (fairly extreme) coordination, we need to figure out how to have the parent acquire resources and then ultimately "give" those resources to the child. It feels like that problem shouldn't be much harder than the parent acquiring resources for themselves (I explore this intuition some in this post on the "strategy stealing" assumption), so that this just comes down to whether we can create a parent who is competent while being motivated to even try to help the child. That's what I have in mind while working on the alignment problem.
On the other hand, given strong enough coordination that the parent doesn't have to fight for their child, I think that the whole shape of the alignment problem changes in more profound ways.
I think that much existing research on alignment, and my research in particular, is embedded in the "agency hand-off paradigm" only to the extent that is necessitated by that situation.
I do agree that my post on indirect normativity is embedded in a stronger version of the agency hand-off paradigm. I think the main reason for taking an approach like that is that a human embedded in the physical world is a soft target for would-be attackers and creates a. If we are happy handing off control to a hypothetical version of ourselves in the imagination of our AI, then we can achieve additional security by doing so, and this may be more appealing than other mechanisms to achieve a similar level of security (like uploading or retreating to a secure physical sanctuary). In some sense all of this is just about saying what it means to ultimately "give" the resources to the child, and it does so by trying to construct an ideal environment for them to become wiser after which they will be mature enough to provide more direct instructions. (But in practice I think that these proposals may involve a jarring transition that could be avoided by using a physical sanctuary instead or just ensuring that our local environments remain hospitable.)
Overall it feels to me like you are coming from a similar place to where I was when I wrote this post on corrigibility, and I'm curious if there are places where you would part ways with that perspective (given the consideration I raised in this comment).
(I do think "aligned with who?" is a real question since the parent needs to decide which child will ultimately get the resources, or else if there are multiple children playing together then it matters a lot how the parent's decisions shape the environment that will ultimately aggregate their preferences.)
France is the other country for which Our World in Data has figures going back to 1400 (I think from Maddison), here's the same graph for France:
There is more crazy stuff going on, but broadly the picture looks the same and there is quite a lot of acceleration between 1800 and the 1950s. The growth numbers are 0.7% for 1800-1850, 1.2% for 1850-1900, 1.2% for 1900-1950, 2.8% for 1950-2000.
And for the even messier case of China:
Growth averages 0 from 1800 to 1950, and then 3.8% from 1950-2000 and 6.9% from 2000-2016.