My thoughts on the social response to AI risk

Matthew Barnett

A common theme implicit in many AI risk stories has been that broader society will either fail to anticipate the risks of AI until it is too late, or do little to address those risks in a serious manner. In my opinion, there are now clear signs that this assumption is false, and that society will address AI with something approaching both the attention and diligence it deserves. For example, one clear sign is Joe Biden's recent executive order on AI safety^[1]. In light of recent news, it is worth comprehensively re-evaluating which sub-problems of AI risk are likely to be solved without further intervention from the AI risk community (e.g. perhaps deceptive alignment), and which ones will require more attention.

While I'm not saying we should now sit back and relax, I think recent evidence has significant implications for designing effective strategies to address AI risk. Since I think substantial AI regulation will likely occur by default, I urge effective altruists to focus more on ensuring that the regulation is thoughtful and well-targeted rather than ensuring that regulation happens at all. Ultimately, I argue in favor of a cautious and nuanced approach towards policymaking, in contrast to broader public AI safety advocacy.^[2]

In the past, when I've read stories from AI risk adjacent people about what the future could look like, I have often noticed that the author assumes that humanity will essentially be asleep at the wheel with regards to the risks of unaligned AI, and won't put in place substantial safety regulations on the technology—unless of course EA and LessWrong-aligned researchers unexpectedly upset the gameboard by achieving a pivotal act. We can call this premise the assumption of an inattentive humanity.^[3]

While most often implicit, the assumption of an inattentive humanity was sometimes stated explicitly in people's stories about the future.

For example, in a post from Marius Hobbhahn published last year about a realistic portrayal of the next few decades, Hobbhahn outlines a series of AI failure modes that occur as AI gets increasingly powerful. These failure modes include a malicious actor using an AI model to create a virus that "kills ~1000 people but is stopped in its tracks because the virus kills its hosts faster than it spreads", an AI model attempting to escape its data center after having "tried to establish a cult to “free” the model by getting access to its model weights", and a medical AI model that "hacked a large GPU cluster and then tried to contact ordinary people over the internet to participate in some unspecified experiment". Hobbhahn goes on to say,

People are concerned about this but the news is as quickly forgotten as an oil spill in the 2010s or a crypto scam in 2022. Billions of dollars of property damage have a news lifetime of a few days before they are swamped by whatever any random politician has posted on the internet or whatever famous person has gotten a new partner. The tech changed, the people who consume the news didn’t. The incentives are still the same.

Stefan Schubert subsequently commented that this scenario seems implausible,

I expect that people would freak more over such an incident than they would freak out over an oil spill or a crypto scam. For instance, an oil spill is a well-understood phenomenon, and even though people would be upset about it, it would normally not make them worry about a proliferation of further oil spills. By contrast, in this case the harm would come from a new and poorly understood technology that’s getting substantially more powerful every year. Therefore I expect the reaction to the kind of harm from AI described here to be quite different from the reaction to oil spills or crypto scams.

I believe Schubert's point has been strengthened by recent events, including Biden's executive order that touches on many aspects of AI risk^[1], the UK AI safety summit, the recent open statement signed by numerous top AI scientists warning about "extinction" from AI, the congressional hearing about AI risk and the discussion of imminent legislation, the widespread media coverage on the rise of GPT-like language models, and the open letter to "pause" model scaling. All of this has occurred despite AI still being relatively harmless, and having—so far—tiny economic impacts, especially compared to the existential threat to humanity that it poses in the long-term. Moreover, the timing of these developments strongly suggests they were mainly prompted by recent impressive developments in language models, rather than any special push from EAs.

In light of these developments, it is worth taking a closer look at how the assumption of an inattentive humanity has pervaded AI risk arguments, and re-evaluate the value of existing approaches to address AI risk in light of recent evidence.

The assumption of an inattentive humanity was perhaps most apparent in stories that posited a fast and local takeoff, in which AI goes from being powerless and hidden in the background, to suddenly achieving a decisive strategic advantage over the rest of the world in a very short period of time.

In his essay from 2017, Eliezer Yudkowsky famously argued that there is "no fire alarm for artificial general intelligence" by which he meant that there will not be an event "producing common knowledge that action [on AI risk] is now due and socially acceptable".^[4] He wrote,

Multiple leading scientists in machine learning have already published articles telling us their criterion for a fire alarm. They will believe Artificial General Intelligence is imminent:
(A) When they personally see how to construct AGI using their current tools. This is what they are always saying is not currently true in order to castigate the folly of those who think AGI might be near.
(B) When their personal jobs do not give them a sense of everything being difficult. This, they are at pains to say, is a key piece of knowledge not possessed by the ignorant layfolk who think AGI might be near, who only believe that because they have never stayed up until 2AM trying to get a generative adversarial network to stabilize.
(C) When they are very impressed by how smart their AI is relative to a human being in respects that still feel magical to them; as opposed to the parts they do know how to engineer, which no longer seem magical to them; aka the AI seeming pretty smart in interaction and conversation; aka the AI actually being an AGI already.
So there isn’t going to be a fire alarm. Period.
There is never going to be a time before the end when you can look around nervously, and see that it is now clearly common knowledge that you can talk about AGI being imminent, and take action and exit the building in an orderly fashion, without fear of looking stupid or frightened.

My understanding is that this thesis was part of a more general view from Yudkowsky that AI would not have any large, visible effects on the world up until the final moments when it takes over the world. In a live debate at Jane Street with Robin Hanson in 2011 he said,

When we try to visualize how all this is likely to go down, we tend to visualize a scenario that someone else once termed “a brain in a box in a basement.” I love that phrase, so I stole it. In other words, we tend to visualize that there’s this AI programming team, a lot like the sort of wannabe AI programming teams you see nowadays, trying to create artificial general intelligence, like the artificial general intelligence projects you see nowadays. They manage to acquire some new deep insights which, combined with published insights in the general scientific community, let them go down into their basement and work in it for a while and create an AI which is smart enough to reprogram itself, and then you get an intelligence explosion.

In that type of scenario, it makes sense that society would not rush to regulate AI, since AI would mainly be a thing done by academics and hobbyists in small labs, with no outsized impacts, up until right before the intelligence explosion, which Yudkowsky predicted would take place within "weeks or hours rather than years or decades". However, this scenario—at least as it was literally portrayed—now appears very unlikely.

Personally—as I have roughly said for over a year now^[5]—I think by far the most likely scenario is that society will adopt broad AI safety regulations as increasingly powerful systems are rolled out on a large scale, just as we have done for many previous technologies. As the capabilities of these systems increase, I expect the regulations to get stricter and become wider in scope, coinciding with popular, growing fears about losing control of the technology. Overall, I suspect governments will be sympathetic to many, but not all, of the concerns that EAs have about AI, including human disempowerment. And while sometimes failing to achieve their stated objectives, I predict governments will overwhelmingly adopt reasonable-looking regulations to stop the most salient risks, such as the risk of an untimely AI coup.

Of course, it still remains to be seen whether US and international regulatory policy will adequately address every essential sub-problem of AI risk. It is still plausible that the world will take aggressive actions to address AI safety, but that these actions will have little effect on the probability of human extinction, simply because they will be poorly designed. One possible reason for this type of pessimism is that the alignment problem might just be so difficult to solve that no “normal” amount of regulation could be sufficient to make adequate progress on the core elements of the problem—even if regulators were guided by excellent advisors—and therefore we need to clamp down hard now and pause AI worldwide indefinitely. That said, I don't see any strong evidence supporting that position.

Another reason why you might still believe regulatory policy for AI risk will be inadequate is that regulators will adopt sloppy policies that totally miss the “hard bits” of the problem. When I recently asked Oliver Habryka what type of policies he still expects won’t be adopted, he mentioned "Any kind of eval system that's robust to deceptive alignment." I believe this opinion is likely shared by many other EAs and rationalists.

In light of recent events, we should question how plausible it is that society will fail to adequately address such an integral part of the problem. Perhaps you believe that policy-makers or general society simply won’t worry much about AI deception. Or maybe people will worry about AI deception, but they will quickly feel reassured by results from superficial eval tests. Personally, I'm pretty skeptical of both of these possibilities, and for basically the same reasons why I was skeptical that there won’t be substantial regulation in the first place:

People think ahead, and frequently—though not always—rely on the advice of well-informed experts who are at least moderately intelligent.
AI capabilities will increase continuously and noticeably over years rather than appearing suddenly. This will provide us time to become acquainted with the risks from individual models, concretely demonstrate failure modes, and study them empirically.
AI safety, including the problem of having AIs not kill everyone, is a natural thing for people to care about.

Now, I don’t know exactly what Habryka means when he says he doesn’t expect to see eval regulations that are robust to deception. Does that require that the eval tests catch all deception, no matter how minor, or is it fine if we have a suite of tests that work well at detecting the most dangerous forms of deception, most of the time? However, while I agree that we shouldn’t expect regulation to be perfect, I still expect that governments will adopt sensible regulations—roughly the type you’d expect if mainstream LessWrong-aligned AI safety researchers were put in charge of regulatory policy.

To make my prediction about AI deception regulation more precise, I currently assign between 60-90% probability^[6] that AI safety regulations will be adopted in the United States before 2035 that include sensible requirements for uncovering deception in the most powerful models, such as rigorously testing the model in a simulation, getting the model “drunk” by modifying its weights and interrogating it under diverse circumstances, asking a separate “lie detector” model to evaluate the model’s responses, applying state-of-the-art mechanistic interpretability methods to unveil latent motives, or creating many slightly different copies of the same model in the hopes that one is honest and successfully identifies and demonstrates deception from the others. I have written a Manifold question about this prediction that specifies these conditions further.

To clarify, I am not making any strong claims about any of these methods being foolproof or robust to AI deception in all circumstances. I am merely suggesting that future AI regulation will likely include sensible precautions against risks like AI deception. If deception turns out to be an obscenely difficult problem, I expect evidence for that view will accumulate over time—for instance because people will build model organisms of misalignment, and show how deception is very hard to catch. As the evidence grows, I think regulators will likely adapt, adjusting policy as the difficulty of the problem becomes clearer.^[7]

I'm not saying we should be complacent. Instead, I’m advocating that we should raise the bar for what sub-problems of AI risk we consider worthy of special attention, versus what problems we think will be solved by default in the absence of further intervention from the AI risk community. Of course, it may still be true that AI deception is an extremely hard problem that reliably resists almost all attempted solutions in any “normal” regulatory regime, even as concrete evidence continues to accumulate about its difficulty—although I consider that claim unproven, to say the least.

Rather than asserting "everything is fine, don’t worry about AI risk" my point here is that we should think more carefully about what other people's incentives actually are, and how others will approach the problem, even without further intervention from this community. Answering these questions critically informs how valuable the actions we take now will be, since it would shed light on the question of which problems will remain genuinely neglected in the future, and which ones won’t be. It’s still necessary for people to work on AI risk, of course. We should just try to make sure we’re spending our time wisely, and focus on improving policy and strategy along the axes on which things are most likely to go poorly.

Edited to add: To give a concrete example of an important problem I think might not be solved by default, several months ago I proposed treating long-term value drift from future AIs as a serious issue. I currently think that value drift is a "hard bit" of the problem that we do not appear to be close to seriously addressing, perhaps because people expect easier problems won't be solved either without heroic effort. I'm also sympathetic to Dan Hendrycks' argument about AI evolution. If these problems turn out to be easy or intractable, I think it may be worth turning more of our focus to other important problems, such as improving our institutions or preventing s-risks.

Nothing in this post should be interpreted as indicating that I'm incredibly optimistic about how AI policy will go. Though politicians usually don't flat-out ignore safety risks, I believe history shows that they can easily mess up tech regulation in subtler ways.

For instance, when the internet was still new, the U.S. Congress passed the Digital Millennium Copyright Act (DMCA) in 1998 to crack down on copyright violators, with strong bipartisan support. While the law had several provisions, one particularly contentious element was its anti-circumvention rule, which made it illegal to bypass digital rights management (DRM) or other technological protection measures. Perversely, this criminalized the act of circumvention even in scenarios where the underlying activity—like copying or sharing—didn't actually infringe on copyright. Some have argued that because of these provisions, there has been a chilling effect on worldwide cryptography research, arguably making our infrastructure less secure with only a minor impact on copyright infringement.

While it is unclear what direct lessons we should draw from incidents like this one, I think a basic takeaway is that it is easy for legislators to get things wrong when they don't fully understand a technology. Since it seems likely that there will be strong AI regulations in the future regardless of what the AI risk community does, I am far more concerned about making sure the regulations are thoughtful, well-targeted, and grounded in the best evidence available, rather than making sure they happen at all.

Instead of worrying that the general public and policy-makers won’t take AI risks very seriously, I tend to be more worried that we will hastily implement poorly thought-out regulations that are based on inaccurate risk models or limited evidence about our situation. These regulations might marginally reduce some aspects of AI risk, but at great costs to the world in other respects. For these reasons, I favor nuanced messaging and pushing for cautious, expert-guided policymaking, rather than blanket public advocacy.

^{^}
In response to Biden's executive order on AI safety, Aaron Bergman wrote,
Am I crazy for thinking the ex ante probability of something at least this good by the US federal government relative to AI progress, from the perspective of 5 years ago was ~1% Ie this seems 99th-percentile-in-2018 good to me
David Manheim replied,
I'm in the same boat. (In the set of worlds without near-term fast takeoff, and where safe AI is possible at all,) I'm increasingly convinced that the world is getting into position to actually address the risks robustly - though it's still very possible we fail.
Peter Wildeford also replied,
This checks out with me

AI capabilities is going faster than expected, but the policy response is much better than expected
Stefan Schubert also commented,
Yeah, if people think the policy response is "99th-percentile-in-2018", then that suggests their models have been seriously wrong.
That could have further implications, meaning these issues should be comprehensively rethought.
^{^}
To give one example of an approach I'm highly skeptical of in light of these arguments, I'll point to this post from last year, which argued that we should try to "Slow down AI with stupid regulations", apparently because the author believed that strategy may be the best hope we have to make things go well.
^{^}
Stefan Schubert calls the tendency to assume that humanity will be asleep at the wheel with regards to large-scale risks "sleepwalk bias". He wrote about this bias in 2016, making many similar points to the ones I make here.
^{^}
Further supporting my interpretation, in a 2013 essay, Yudkowsky states the following:
In general and across all instances I can think of so far, I do not agree with the part of your futurological forecast in which you reason, "After event W happens, everyone will see the truth of proposition X, leading them to endorse Y and agree with me about policy decision Z."
[...]
Example 2: "As AI gets more sophisticated, everyone will realize that real AI is on the way and then they'll start taking Friendly AI development seriously."
Alternative projection: As AI gets more sophisticated, the rest of society can't see any difference between the latest breakthrough reported in a press release and that business earlier with Watson beating Ken Jennings or Deep Blue beating Kasparov; it seems like the same sort of press release to them. The same people who were talking about robot overlords earlier continue to talk about robot overlords. The same people who were talking about human irreproducibility continue to talk about human specialness. Concern is expressed over technological unemployment the same as today or Keynes in 1930, and this is used to fuel someone's previous ideological commitment to a basic income guarantee, inequality reduction, or whatever. The same tiny segment of unusually consequentialist people are concerned about Friendly AI as before. If anyone in the science community does start thinking that superintelligent AI is on the way, they exhibit the same distribution of performance as modern scientists who think it's on the way, e.g. Hugo de Garis, Ben Goertzel, etc.
^{^}
See also this thread from me on X from earlier this year. I've made various other comments saying that I expect AI regulation for a few years now, but they've mostly been fragmented across the internet.
^{^}
Conditioning on transformative AI arriving before 2035, my credence range is somewhat higher, at around 75-94%. We can define transformative AI in the same way I defined it in here.
^{^}
This points to one reason why clamping down hard now might be unjustified, and why I prefer policies that start modest but adjust their strictness according to the best evidence about model capabilities and the difficulty of alignment.

I'm still pretty skeptical of what would happen without explicit focus. The Bletchley Park declaration was a super vague and applause-lighty declaration, which fortunately mentions issues of control, but just barely. It's not clear to me yet that this will end up receiving much-dedicated focus.

Regarding biosecurity and cyber, my big worry here is open-source and it seems totally plausible that a government will pass mostly sensible regulation, then create a massive gaping hole where open-source regulation should be.

I agree much of the community (including me) was wrong or directionally wrong in the past about the level of AI regulation and how quickly it would come.

Regarding the recommendations made in the post for going forward given that there will be some regulation, I feel confused in a few ways.

Can you provide examples of interventions that meet your bar for not being done by default? It's hard to understand the takeaways from your post because the negative examples are made much more concrete than the proposed positive ones
1. You argue that we perhaps shouldn't invest as much in preventing deceptive alignment because "regulators will likely adapt, adjusting policy as the difficulty of the problem becomes clearer"
2. If we are assuming that regulators will adapt and adjust regarding deception, can you provide examples of interventions that policymakers will not be able to solve themselves and why they will be less likely to notice and deal with them than deception?
3. You say "we should question how plausible it is that society will fail to adequately address such an integral part of the problem". What things aren't integral parts of the problem but that should be worked on?
  1. I feel we would need much better evidence of things being handled competently to invest significantly less into integral parts of the problem.
You say: 'Of course, it may still be true that AI deception is an extremely hard problem that reliably resists almost all attempted solutions in any “normal” regulatory regime, even as concrete evidence continues to accumulate about its difficulty—although I consider that claim unproven, to say the least'
1. If we expect some problems in AI risk to be solved by default mostly by people outside the community, it feels to me like one takeaway would be that we should shift resources to portions of the problem that we expect to be the hardest
2. To me, intuitively, deceptive alignment might be one of the hardest parts of the problem as we scale to very superhuman systems, even if we condition on having time to build model organisms of misalignment and experiment with them for a few years. So I feel confused about why you claim a high level of difficulty is "unproven" as a dismissal; of course it's unproven but you would need to argue that in worlds where the AI risk problem is fairly hard, there's not much of a chance of it being very hard.
3. As someone who is relatively optimistic about concrete evidence of deceptive alignment increasing substantially before a potential takeover, I think I still put significantly lower probability on it than you do due to the possibility of fairly fast takeoff.
I feel like this post is to some extent counting our chickens before they hatch (tbc I agree with the directional update as I said above). I'm not an expert on what's going on here but I imagine any of the following happening (non-exhaustive list) that make the current path to potentially sensible regulation in the US and internationally harder:
1. The EO doesn't lead to as many resources dedicated to AI-x-risk-reducing things as we might hope. I haven't read it myself, just the fact sheet and Zvi's summary but Zvi says "If you were hoping for or worried about potential direct or more substantive action, then the opposite applies – there is very little here in the way of concrete action, only the foundation for potential future action."
2. A Republican President comes in power in the US and reverses a lot of the effects in the EO
3. Rishi Sunak gets voted out in the UK (my sense is that this is likely) and the new Prime Minister is much less gung-ho on AI risk
I don't have strong views on the value of AI advocacy, but this post seems overconfident in calling it out as being basically not useful based on recent shifts.
1. It seems likely that much stronger regulations will be important, e.g. the model reporting threshold in the EO was set relatively high and many in the AI risk community have voiced support for an international pause if it were politically feasible, which the EO is far from.
2. The public still doesn't consider AI risk to be very important. <1% of the American public considers it the most important problem to deal with. So to the extent that raising that number was good before, it still seems pretty good now, even if slightly worse.

Can you provide examples of interventions that meet your bar for not being done by default? It's hard to understand the takeaways from your post because the negative examples are made much more concrete than the proposed positive ones

I have three things to say here:

Several months ago I proposed general, long-term value drift as a problem that I think will be hard to solve by default. I currently think that value drift is a "hard bit" of the problem that we do not appear to be close to seriously addressing, perhaps because people expect easier problems won't be solved either without heroic effort. I'm also sympathetic to Dan Hendrycks' arguments about AI evolution. I will add these points to the post.
I mostly think people should think harder about what the hard parts of AI risk are in the first place. It would not be surprising if the "hard bits" will be things that we've barely thought about, or are hard to perceive as major problems, since their relative hiddenness would be a strong reason to believe that they will not be solved by default.
The problem of "make sure policies are well-targeted, informed by the best evidence, and mindful of social/political difficulties" seems like a hard problem that societies have frequently failed to get right historically, and the relative value of solving this problem seems to get higher as you become more optimistic about the technical problems being solved.

I feel like this post is to some extent counting our chickens before they hatch (tbc I agree with the directional update as I said above). [...] I don't have strong views on the value of AI advocacy, but this post seems overconfident in calling it out as being basically not useful based on recent shifts.

I want to emphasize that the current policies were crafted in an environment in which AI still has a tiny impact on the world. My expectation is that policies will get much stricter as AI becomes a larger part of our life. I am not making the claim that current policies are sufficient; instead I am making a claim about the trajectory, i.e. how well we should expect society to respond at a time, given the evidence and level of AI capabilities at that time. I believe that current evidence supports my interpretation of our general trajectory, but I'm happy to hear someone explain why they disagree and highlight concrete predictions that could serve to operationalize this disagreement.

I have three things to say here:

Thanks for clarifying.

Several months ago I proposed general, long-term value drift as a problem that I think will be hard to solve by default. I currently think that value drift is a "hard bit" of the problem that we do not appear to be close to seriously addressing, perhaps because people expect easier problems won't be solved either without heroic effort. I'm also sympathetic to Dan Hendrycks' arguments about AI evolution. I will add these points to the post.

Don't have a strong opinion here, but intuitively feels like it would be hard to find tractable angles for work on this now.

I mostly think people should think harder about what the hard parts of AI risk are in the first place. It would not be surprising if the "hard bits" will be things that we've barely thought about, or are hard to perceive as major problems, since their relative hiddenness would be a strong reason to believe that they will not be solved by default.

Maybe. In general, I'm excited about people who have the talent for it to think about previously neglected angles.

The problem of "make sure policies are well-targeted, informed by the best evidence, and mindful of social/political difficulties" seems like a hard problem that societies have frequently failed to get right historically, and the relative value of solving this problem seems to get higher as you become more optimistic about the technical problems being solved.

I agree this is important and it was in your post but it seems like a decent description of what the majority of AI x-risk governance people are already working on, or at least not obviously a bad one. This is the phrase that I was hoping would get made more concrete.

I want to emphasize that the current policies were crafted in an environment in which AI still has a tiny impact on the world. My expectation is that policies will get much stricter as AI becomes a larger part of our life. I am not making the claim that current policies are sufficient; instead I am making a claim about the trajectory, i.e. how well we should expect society to respond at a time, given the evidence and level of AI capabilities at that time.

I understand this (sorry if wasn't clear), but I think it's less obvious than you do that this trend will continue without intervention from AI x-risk people. I agree with other commenters that AI x-risk people should get a lot of the credit for the recent push. I also provided example reasons that the trend might not continue smoothly or even reverse in my point (3).

There might also be disagreements around:

Not sharing your high confidence in slow, continuous takeoff.
The strictness of regulation needed to make a dent in AI risk, e.g. if substantial international coordination is required it seems optimistic to me to assume that the trajectory will by default lead to this.
The value in things getting done faster than they would have done otherwise, even if they would have been done either way. This indirectly provides more time to iterate and get to better, more nuanced policy.

I believe that current evidence supports my interpretation of our general trajectory, but I'm happy to hear someone explain why they disagree and highlight concrete predictions that could serve to operationalize this disagreement.

Operationalizing disagreements well is hard and time-consuming especially when we're betting on "how things would go without intervention from a community that is intervening a lot", but a few very rough forecasts, all conditional on no TAI before resolve date:

75%: In Jan 2028, less than 10% of Americans will consider AI the most important problem.
60%: In Jan 2030, Evan Hubinger will believe that if x-risk-motivated people had not worked on deceptive alignment at all, risk from deceptive alignment would be at least 50% higher, compared to a baseline of no work at all (i.e. if risk is 5% and it would be 9% with no work from anyone, it needs to have been >7% if no work from x-risk people had been done to resolve yes).
~~35%: In Jan 2028, conditional on a Republican President being elected in 2024, regulations on AI in the US will be generally less stringent than they were when the previous president left office.~~ Edit: Crossed out because not operationalized well, more want to get at the vibe of how strict the President and legislature are being on AI, and since my understanding is a lot of the stuff from the EO might not come into actual force for a while.

I agree this is important and it was in your post but it seems like a decent description of what the majority of AI x-risk governance people are already working on, or at least not obviously a bad one.

I agree. I'm not criticizing the people who are trying to make sure that policies are well-targeted and grounded in high-quality evidence. I'm arguing in favor of their work. ~~I'm mainly arguing against public AI safety advocacy work, which was~~ ~~recently upvoted highly on the EA Forum~~. [ETA, rewording: To the extent I was arguing against a single line of work, I was primarily arguing against public AI safety advocacy work, which was recently upvoted highly on the EA Forum. However, as I wrote in the post, I also think that we should re-evaluate which problems will be solved by default, which means I'm not merely letting other AI governance people off the hook.]

Operationalizing disagreements well is hard and time-consuming especially when we're betting on "how things would go without intervention from a community that is intervening a lot", but a few very rough forecasts, all conditional on no TAI before resolve date:

I appreciate these predictions, but I am not as interested in predicting personal of public opinions. I'm more interested in predicting regulatory stringency, quality, and scope.

Even if fewer than 10% of Americans consider AI to be the most important issue in 2028, I don't think that necessarily indicates that regulations will have low stringency, low quality, or poor scope. Likewise, I'm not sure whether I want to predict on Evan Hubinger's opinion, since I'd probably need to understand more about how he thinks to get it right, and I'd prefer to focus the operationalization instead on predictions about large, real world outcomes. I'm not really sure what disagreement the third prediction is meant to operationalize, although I find it to be an interesting question nonetheless.

I'm mainly arguing against public AI safety advocacy work, which was recently upvoted highly on the EA Forum.

I had the impression that it was more than just that, given the line: "In light of recent news, it is worth comprehensively re-evaluating which sub-problems of AI risk are likely to be solved without further intervention from the AI risk community (e.g. perhaps deceptive alignment), and which ones will require more attention." and the further attention devoted to deceptive alignment.

I appreciate these predictions, but I am not as interested in predicting personal of public opinions. I'm more interested in predicting regulatory stringency, quality, and scope.

If you have any you think faithfully represent a possible disagreement between us go ahead. I personally feel it will be very hard to operationalize objective stuff about policies in a satisfying way. For example, a big issue with the market you've made is that it is about what will happen in the world, not what will happen without intervention from AI x-risk people. Furthermore it has all the usual issues with forecasting on complex things 12 years in advance, regarding the extent to which it operationalizes any disagreement well (I've bet yes on it, but think it's likely that evaluating and fixing deceptive alignment will remain mostly unsolved in 2035 conditional on no superintelligence, especially if there were no intervention from x-risk people).

I had the impression that it was more than just that

Yes, the post was about more than that. To the extent I was arguing against a single line of work, it was mainly intended as a critique of public advocacy. Separately, I asked people to re-evaluate which problems will be solved by default, to refocus our efforts on the most neglected, important problems, and went into detail about what I currently expect will be solved by default.

If you have any you think faithfully represent a possible disagreement between us go ahead.

I offered a concrete prediction in the post. If people don't think my prediction operationalizes any disagreement, then I think (1) either they don't disagree with me, in which case maybe the post isn't really aimed at them, or (2) they disagree with me in some other way that I can't predict, and I'd prefer they explain where they disagree exactly.

a big issue with the market you've made is that it is about what will happen in the world, not what will happen without intervention from AI x-risk people.

It seems relatively valueless to predict on what will happen without intervention, since AI x-risk people will almost certainly intervene.

Furthermore it has all the usual issues with forecasting on complex things 12 years in advance, regarding the extent to which it operationalizes any disagreement well (I've bet yes on it, but think it's likely that evaluating and fixing deceptive alignment will remain mostly unsolved in 2035, especially if there were no intervention from x-risk people).

I mostly agree. But I think it's still better to offer a precise prediction than to only offer vague predictions, which I perceive as the more common and more serious failure mode in discussions like this one.

However, this scenario—at least as it was literally portrayed—now appears very unlikely.

Currently I'd say it is most likely to take months, second-most likely to take weeks, third-most-likely to take years, fourth-most-likely to take hours, and fifth-most-likely to take decades. I consider this a mild win for Yudkowsky's prediction, but only a mild one, it's basically a wash. I definitely disagree with the "very unlikely" claim you make however.

I think you're ignoring the qualifier "literally portrayed" in Matthew's sentence, and neglecting the prior context that he's talking about AI development being something mainly driven forward by hobbyists with no outsized impacts.

He's talking about more than just the time in which AI goes from e.g. doubling the AI software R&D output of humans to some kind of singularity. The specific details Eliezer has given about this scenario have not been borne out: for example, in his 2010 debate with Robin Hanson, he emphasized a scenario in which a few people working in a basement and keeping all of their insights secret hit upon some key software innovation that enables their piece of consumer hardware to outcompete the rest of the world.

It's worth noting that Robin Hanson also said that "takeoff" is most likely to take months. He just said it for ems, and in his world, that rate of growth was being driven by the entire world economy working as a whole rather than one local part of the world having such better software that it could outcompete everyone else with vastly less material resources. I find you saying this is a "mild win" for Eliezer's prediction incomprehensible given that we live in a world where individual AI labs are being valued at ~ $100B and raising tens of billions of dollars in capital.

Hmm, I do agree the foom debates talk a bunch about a "box in a basement team", but the conversation was pretty explicitly not about the competitive landscape and how many people are working on this box in a basement, etc. It was about whether it would be possible for a box in a basement with the right algorithms to become superhuman in a short period of time. In-particular Eliezer says:

In other words, I’m trying to separate out the question of “How dumb is this thing (points to head); how much smarter can you build an agent; if that agent were teleported into today’s world, could it take over?” versus the question of “Who develops it, in what order, and were they all trading insights or was it more like a modern-day financial firm where you don’t show your competitors your key insights, and so on, or, for that matter, modern artificial intelligence programs?”

The key question that the debate was about was whether building AGI would require maybe 1-2 major insights about how to build it, vs. it would require the discovery of a large number of algorithms that would incrementally make a system more and more up-to-par with where humans are at. That's what the "box in a basement" metaphor was about.

Eliezer also has said other things around the time that make it explicit that he wasn't intending to make any specific predictions about how smooth the on-ramp to pre-foom AGI would be, how competitive it would be, etc.

I do think there is a directional update here, but I think your summary here is approximately misleading.

The key question that the debate was about was whether building AGI would require maybe 1-2 major insights about how to build it, vs. it would require the discovery of a large number of algorithms that would incrementally make a system more and more up-to-par with where humans are at.

Robin Hanson didn't say that AGI would "require the discovery of a large number of algorithms". He emphasized instead that AGI would require a lot of "content" and would require a large "base". He said,

My opinion, which I think many AI experts will agree with at least, including say Doug Lenat who did the Eurisko program that you most admire in AI [gesturing toward Eliezer], is that it's largely about content. There are architectural insights. There are high-level things that you can do right or wrong, but they don't, in the end, add up to enough to make vast growth. What you need for vast growth is simply to have a big base. [...]
Similarly, I think that for minds, what matters is that it just has lots of good, powerful stuff in it, lots of things it knows, routines, strategies, and there isn't that much at the large architectural level.

This is all vague, but I think you can interpret his comment here as emphasizing the role of data, and making sure the model has learned a lot of knowledge, routines, strategies, and so on. That's different from saying that humans need to discover a bunch of algorithms, one by one, to incrementally make a system more up-to-par with where humans are at. It's compatible with the view that humans don't need to discover a lot of insights to build AGI. He's saying that insights are not sufficient: you need to make sure there's a lot of "content" in the AI too.

I personally find his specific view here to have been vindicated more than the alternative, even though there were many details in his general story that ended up aging very poorly (especially ems).

I agree that insofar as Yudkowsky predicted that AGI would be built by hobbyists with no outsized impacts, he was wrong.

ETA: So yes, I was ignoring the "literally portrayed" bit, my bad, I should have clarified that by "yudkowsky's prediction" I meant the prediction about takeoff speeds.

At the point where the King of England is giving prepared remarks on the importance of AI safety to an international meeting including the US Vice-president and the worlds richest man, including cabinet-or-higher-level participation from 28 of the most AI-capable countries and CEOs or executives from all the super-scalers, it's hard to claim that the world isn't taking AI safety seriously.

I think we need to move on to trying to help them make the right decisions.

I agree much of the community (including me) was wrong or directionally wrong in the past about the level of AI regulation and how quickly it would come.

Regarding the recommendations made in the post for going forward given that there will be some regulation, I feel confused in a few ways.

Can you provide examples of interventions that meet your bar for not being done by default? It's hard to understand the takeaways from your post because the negative examples are made much more concrete than the proposed positive ones
1. You argue that we perhaps shouldn't invest as much in preventing deceptive alignment because "regulators will likely adapt, adjusting policy as the difficulty of the problem becomes clearer"
2. If we are assuming that regulators will adapt and adjust regarding deception, can you provide examples of interventions that policymakers will not be able to solve themselves and why they will be less likely to notice and deal with them than deception?
3. You say "we should question how plausible it is that society will fail to adequately address such an integral part of the problem". What things aren't integral parts of the problem but that should be worked on?
  1. I feel we would need much better evidence of things being handled competently to invest significantly less into integral parts of the problem.
You say: 'Of course, it may still be true that AI deception is an extremely hard problem that reliably resists almost all attempted solutions in any “normal” regulatory regime, even as concrete evidence continues to accumulate about its difficulty—although I consider that claim unproven, to say the least'
1. If we expect some problems in AI risk to be solved by default mostly by people outside the community, it feels to me like one takeaway would be that we should shift resources to portions of the problem that we expect to be the hardest
2. To me, intuitively, deceptive alignment might be one of the hardest parts of the problem as we scale to very superhuman systems, even if we condition on having time to build model organisms of misalignment and experiment with them for a few years. So I feel confused about why you claim a high level of difficulty is "unproven" as a dismissal; of course it's unproven but you would need to argue that in worlds where the AI risk problem is fairly hard, there's not much of a chance of it being very hard.
3. As someone who is relatively optimistic about concrete evidence of deceptive alignment increasing substantially before a potential takeover, I think I still put significantly lower probability on it than you do due to the possibility of fairly fast takeoff.
I feel like this post is to some extent counting our chickens before they hatch (tbc I agree with the directional update as I said above). I'm not an expert on what's going on here but I imagine any of the following happening (non-exhaustive list) that make the current path to potentially sensible regulation in the US and internationally harder:
1. The EO doesn't lead to as many resources dedicated to AI-x-risk-reducing things as we might hope. I haven't read it myself, just the fact sheet and Zvi's summary but Zvi says "If you were hoping for or worried about potential direct or more substantive action, then the opposite applies – there is very little here in the way of concrete action, only the foundation for potential future action."
2. A Republican President comes in power in the US and reverses a lot of the effects in the EO
3. Rishi Sunak gets voted out in the UK (my sense is that this is likely) and the new Prime Minister is much less gung-ho on AI risk
I don't have strong views on the value of AI advocacy, but this post seems overconfident in calling it out as being basically not useful based on recent shifts.
1. It seems likely that much stronger regulations will be important, e.g. the model reporting threshold in the EO was set relatively high and many in the AI risk community have voiced support for an international pause if it were politically feasible, which the EO is far from.
2. The public still doesn't consider AI risk to be very important. <1% of the American public considers it the most important problem to deal with. So to the extent that raising that number was good before, it still seems pretty good now, even if slightly worse.

Can you provide examples of interventions that meet your bar for not being done by default? It's hard to understand the takeaways from your post because the negative examples are made much more concrete than the proposed positive ones

I have three things to say here:

Several months ago I proposed general, long-term value drift as a problem that I think will be hard to solve by default. I currently think that value drift is a "hard bit" of the problem that we do not appear to be close to seriously addressing, perhaps because people expect easier problems won't be solved either without heroic effort. I'm also sympathetic to Dan Hendrycks' arguments about AI evolution. I will add these points to the post.
I mostly think people should think harder about what the hard parts of AI risk are in the first place. It would not be surprising if the "hard bits" will be things that we've barely thought about, or are hard to perceive as major problems, since their relative hiddenness would be a strong reason to believe that they will not be solved by default.
The problem of "make sure policies are well-targeted, informed by the best evidence, and mindful of social/political difficulties" seems like a hard problem that societies have frequently failed to get right historically, and the relative value of solving this problem seems to get higher as you become more optimistic about the technical problems being solved.

I feel like this post is to some extent counting our chickens before they hatch (tbc I agree with the directional update as I said above). [...] I don't have strong views on the value of AI advocacy, but this post seems overconfident in calling it out as being basically not useful based on recent shifts.

I have three things to say here:

Thanks for clarifying.

Several months ago I proposed general, long-term value drift as a problem that I think will be hard to solve by default. I currently think that value drift is a "hard bit" of the problem that we do not appear to be close to seriously addressing, perhaps because people expect easier problems won't be solved either without heroic effort. I'm also sympathetic to Dan Hendrycks' arguments about AI evolution. I will add these points to the post.

Don't have a strong opinion here, but intuitively feels like it would be hard to find tractable angles for work on this now.

I mostly think people should think harder about what the hard parts of AI risk are in the first place. It would not be surprising if the "hard bits" will be things that we've barely thought about, or are hard to perceive as major problems, since their relative hiddenness would be a strong reason to believe that they will not be solved by default.

Maybe. In general, I'm excited about people who have the talent for it to think about previously neglected angles.

The problem of "make sure policies are well-targeted, informed by the best evidence, and mindful of social/political difficulties" seems like a hard problem that societies have frequently failed to get right historically, and the relative value of solving this problem seems to get higher as you become more optimistic about the technical problems being solved.

I want to emphasize that the current policies were crafted in an environment in which AI still has a tiny impact on the world. My expectation is that policies will get much stricter as AI becomes a larger part of our life. I am not making the claim that current policies are sufficient; instead I am making a claim about the trajectory, i.e. how well we should expect society to respond at a time, given the evidence and level of AI capabilities at that time.

There might also be disagreements around:

Not sharing your high confidence in slow, continuous takeoff.
The strictness of regulation needed to make a dent in AI risk, e.g. if substantial international coordination is required it seems optimistic to me to assume that the trajectory will by default lead to this.
The value in things getting done faster than they would have done otherwise, even if they would have been done either way. This indirectly provides more time to iterate and get to better, more nuanced policy.

I believe that current evidence supports my interpretation of our general trajectory, but I'm happy to hear someone explain why they disagree and highlight concrete predictions that could serve to operationalize this disagreement.

75%: In Jan 2028, less than 10% of Americans will consider AI the most important problem.
60%: In Jan 2030, Evan Hubinger will believe that if x-risk-motivated people had not worked on deceptive alignment at all, risk from deceptive alignment would be at least 50% higher, compared to a baseline of no work at all (i.e. if risk is 5% and it would be 9% with no work from anyone, it needs to have been >7% if no work from x-risk people had been done to resolve yes).
~~35%: In Jan 2028, conditional on a Republican President being elected in 2024, regulations on AI in the US will be generally less stringent than they were when the previous president left office.~~ Edit: Crossed out because not operationalized well, more want to get at the vibe of how strict the President and legislature are being on AI, and since my understanding is a lot of the stuff from the EO might not come into actual force for a while.

I agree this is important and it was in your post but it seems like a decent description of what the majority of AI x-risk governance people are already working on, or at least not obviously a bad one.

Operationalizing disagreements well is hard and time-consuming especially when we're betting on "how things would go without intervention from a community that is intervening a lot", but a few very rough forecasts, all conditional on no TAI before resolve date:

I appreciate these predictions, but I am not as interested in predicting personal of public opinions. I'm more interested in predicting regulatory stringency, quality, and scope.

I'm mainly arguing against public AI safety advocacy work, which was recently upvoted highly on the EA Forum.

I appreciate these predictions, but I am not as interested in predicting personal of public opinions. I'm more interested in predicting regulatory stringency, quality, and scope.

I had the impression that it was more than just that

If you have any you think faithfully represent a possible disagreement between us go ahead.

a big issue with the market you've made is that it is about what will happen in the world, not what will happen without intervention from AI x-risk people.

It seems relatively valueless to predict on what will happen without intervention, since AI x-risk people will almost certainly intervene.

Furthermore it has all the usual issues with forecasting on complex things 12 years in advance, regarding the extent to which it operationalizes any disagreement well (I've bet yes on it, but think it's likely that evaluating and fixing deceptive alignment will remain mostly unsolved in 2035, especially if there were no intervention from x-risk people).

However, this scenario—at least as it was literally portrayed—now appears very unlikely.

In other words, I’m trying to separate out the question of “How dumb is this thing (points to head); how much smarter can you build an agent; if that agent were teleported into today’s world, could it take over?” versus the question of “Who develops it, in what order, and were they all trading insights or was it more like a modern-day financial firm where you don’t show your competitors your key insights, and so on, or, for that matter, modern artificial intelligence programs?”

I do think there is a directional update here, but I think your summary here is approximately misleading.

The key question that the debate was about was whether building AGI would require maybe 1-2 major insights about how to build it, vs. it would require the discovery of a large number of algorithms that would incrementally make a system more and more up-to-par with where humans are at.

My opinion, which I think many AI experts will agree with at least, including say Doug Lenat who did the Eurisko program that you most admire in AI [gesturing toward Eliezer], is that it's largely about content. There are architectural insights. There are high-level things that you can do right or wrong, but they don't, in the end, add up to enough to make vast growth. What you need for vast growth is simply to have a big base. [...]
Similarly, I think that for minds, what matters is that it just has lots of good, powerful stuff in it, lots of things it knows, routines, strategies, and there isn't that much at the large architectural level.

I personally find his specific view here to have been vindicated more than the alternative, even though there were many details in his general story that ended up aging very poorly (especially ems).

31

My thoughts on the social response to AI risk

31