About this post

This post is a stylized transcript of a conversation between Rohin Shah and Daniel Filan, two graduate students at CHAI, that happened in 2018. It should not be taken as an exact representation of what was said, or even what order topics were brought up in, but it should give readers a sense of what was discussed, and our opinions on the topic of conversation at the time of the discussion. It also should not be assumed that any of the points brought up are original to Rohin or Daniel, just that they represent our thinking at the time of the conversation. Notes were taken by Rohin during the conversation, and Daniel took the lead in writing it into a blog post.

The conversation was precipitated when Rohin noted that researchers at CHAI, and in AI alignment more broadly, diverged in their attitudes towards something like security mindset, that we will just call security mindset despite being somewhat different to what is described in the linked blog post.

Some researchers at CHAI are very concerned about building an axiomatic understanding of artificial intelligence and proving solid theorems about the behaviour of systems that we are likely to build. In particular, they are very concerned about expecting benign behaviour without a formal proof to that effect, and believe that we should start worrying about problems as soon as we have a story for why they will be problems, rather than when they start manifesting. At the time of the conversation, Daniel was one of these researchers, who have what we’ll call “security mindset”.

In contrast, some other researchers believe that we should focus on thinking about what extra machinery is needed to build aligned AI, and try building that extra machinery for current systems. Instead of dealing with future anticipated problems today and developing a theory that rules out all undesired behaviour, these researchers believe that we should spend engineering effort to detect their occurence and fix problems once we know that they occur and have more information about them. They think that rigour and ordinary paranoia are important, but less important than security mindset advocates claim. At the time of the conversation, Rohin was one of these researchers.

During a prior conversation, Rohin noted that he believed that security mindset was less important in a world where the power of AI systems gradually increased, perhaps on an exponential curve, over a period of multiple years, as opposed to a world where AI systems could gain a huge amount of power rather suddenly from designers having conceptual breakthroughs. Daniel was intrigued by this claim, as he had recently come to agree with two posts arguing that this sort of ‘slow takeoff’ was more likely than the alternative, and was unsure how this should affect all his other views on AI alignment. As a result, he booked a separate meeting to exchange and discuss models on this topic. What follows is a record of that separate meeting.

The conversation

Daniel: Here’s my worry. Suppose that we’re thinking about an AI system that is better at, say, math or engineering than humans. It seems to me that this AI system is going to have to be able to do some sort of optimization itself—maybe thinking about how to optimize physical structures so that they don’t fall down, or maybe thinking about how to optimize its own computation so that it can efficiently find proofs of a desired theorem. At any rate, if this is the case, then what we have on our hands is optimization that is being done in a direction other than “behave maximally predictably”, and is plausibly being done adversarially. This is precisely the situation in which you need security mindset to reason about the system on your hands.

Rohin: I agree that security mindset is appropriate when something is optimizing adversarially. I also agree that holding capability levels constant, the more we take a security mindset approach, the more safe our resulting systems are. However:

  1. We simply don’t have time to create a system that can be proved aligned using security-mindset-level rigour before the first prepotent AI system. This means that we need to prioritize other research directions.

  2. Because we will likely face a slow takeoff, things will only change gradually. We can rely on processes like testing AIs, monitoring their thoughts, boxing them, and red-teaming to determine likely failure scenarios. If a system has dangerous abilities that we didn’t test for, it will be the weakest possible system with those dangerous abilities, so we can notice them in action as they produce very minor damage, disable that system, create a new test, and fix the problem.

  3. We should instead focus on constructing AI systems that correctly infer the nuances of human intent, rather than trying to address problems that could arise ahead of time. This will plausibly work to create an AI that can solve the harder problems for us.

Daniel: I have a few responses to those points.

  1. Regarding your first point, I’m more optimistic than you. If you look at the progress made on the Agent Foundations research agenda in the past five years (such as work on reflective oracles and logical induction), for example, it seems like we could solve the remaining problems in time. That being said, this isn’t very cruxy for me.

  2. Regarding your second point, I think that in order to write good tests, we will need to take a security mindset approach, or at least an ordinary paranoia approach, in order to determine what things to test for and in order to write tests that actually rule out undesired properties.

  3. In general, I believe that if you do not build an AI with security mindset at the forefront of your concerns, the result will be very bad—either it will cause an unacceptable level of damage to humanity, or more likely it just won’t work, and it will take a very long time to fix it. This sucks, not just because it means that your work is in some sense wasted, but also because…

  4. There will likely be a competing AI group that is just a bit less capable than you, and a different group just less capable than them, and so on. That is to say, I expect AI capabilities to be continuous across space for similar reasons that I would expect them to be continuous across time.

  5. As a result of 3 and 4, I expect that if your group is trying to develop AI without heavy emphasis on security mindset, you fail and get overtaken by another group, and this cycle continues until it reaches a group that does put heavy emphasis on security mindset, or until it creates an AI that causes unacceptable levels of damage to humanity.

Rohin: I doubt your point 4. In our current world, we don’t see a huge number of groups that are realistic contenders to create smarter-than-human AI, and the groups that we do see show a promising degree of cooperation, such as collaborating on safety research and making promising commitments towards avoiding dangerous race dynamics. Also, I think that in worlds where there is such a break-down of cooperation that your point 4 applies, I think that technical work today is near-useless, so I’m happy to just ignore these worlds.

I also think that the arguments that you give for point 4 are flawed. In particular, the arguments for slow take-off require gradual improvement that builds on itself, which happens over time but is not guaranteed to happen over space. In fact, I expect there to be great resource inequalities between groups and limited communication between competing groups, which should generate very large capabilities gaps between competing groups. This is something like a local crux for me: if I thought that there weren’t resource inequalities and limited communication, I would also anticipate competing groups to have similar levels of capabilities.

Daniel: Hmmmmmm. I’ll have to think about the arguments that I should anticipate large capability gaps between competing groups, but they seem pretty convincing right now.

Actually, maybe we should expect the future to look different to the past, with countries like China and India growing capable AI labs. In this world, it’s sadly plausible to me that pairs of countries’ research groups could end up failing to cooperate. But again, I’ll have to think about it more.

At any rate, even if my point 4 fails, the rest of my points imply that research done without security mindset at the forefront will reliably be useless, which still feels like a strong argument in favour of security mindset.

Rohin: Then let’s move on to your points 2 and 3.

Regarding 3, I agree that if you have a vastly super-human AI that was not designed with security mindset in mind, then the outcome will be very bad. However, for an AI that is only incrementally more powerful than previous, already-understood agents, I think that incremental improvements on existing levels of rigour displayed by top AI researchers are sufficient, and also lower than the levels of rigour you, Daniel, would want.

For example, many putative flaws with superintelligence involve a failure of generalization from the training and test environments, where the AI appears to behave benignly, to the real world, where the AI allegedly causes massive harm. However, I think that AI researchers think rigorously enough about generalization failures—if they did not, then things like neural architecture search and machine learning more broadly would fail to generalize from the training set to the test set.

Daniel, not quite getting the point: This feels quite cruxy for me. I believe that top AI researchers can see problems as they happen. However, I do think that they have significantly less rigour than I would want, because I can see problems that I suspect are likely to come up with many approaches, such as inner alignment failures, and these problems weren’t brought to my attention by the AI research community, but rather by the more security-mindset-focussed contingent of the AI alignment research community. If this is the case, it seems like a big win to find these problems early and work on them now.

Rohin: If inner alignment failures are a big problem, I expect that we would find that out in ~5 years, and that a unit of work done on it now is worth ~10-20% of a unit of work done on it after we have a concrete example of how they are a problem. Given this, instead of working on those sorts of problems now, I think that it makes sense to work on things that we actually know are problems, and have a hope of solving in the present, such as communicating human intent to neural networks.

Daniel: I’m skeptical of those numbers. At any rate, it seems to me that there might be problems that you can solve in that way, but that there are also some things that you need to get right from the beginning. Furthermore, I think that you can form decent models about what these things are, and examples include the Agent Foundations agenda as well as the more theoretical aspects of iterated distillation and amplification research.

Rohin: Interesting. I’d like to get down later into our models of what problems need to be done right now, but for now that feels a bit off topic. Instead, I’d like to hear why you believe your point 2, that security mindset is needed to do monitoring, testing, and boxing well.

Daniel: Well, I have three reasons to think this:

  1. You are plausibly dealing with an AI that is optimizing to pass your test. This is the sort of case where security mindset is required for good reasoning about the system.

  2. Your suggestion of monitoring thoughts is quite exciting to me, since it could plausibly detect any adversarial optimization early, but it’s hard for me to see how you could be sure that you’ve done that adequately without the type of thinking produced by security mindset.

  3. You are optimizing to create an AI that passes the test by trying a bunch of things and thinking about how to do it. Again, this is a situation where optimization is being done, perhaps to pass the specific tests that you’ve set, and therefore a situation that you need security mindset to reason correctly about.

Rohin: Points 1 and 3 seem solid to me, but I’m not sure about point 2. For instance, it seems like if I could ‘read minds’ in the way depicted in popular fiction, then by reading the mind of another human all the time, I would be able to detect them trying to take over the world just by reasoning informally about the contents of their thoughts. Do you agree?

Daniel, answering a slightly different question: If you mean that I get to hear what’s happening in their verbal loop, then I’m not sure that I could detect what people were optimizing for. For instance, it’s plausible to me that if you heard the verbal loop of a dictator like Stalin, you would hear a lot about serving his country and helping the workers of the world, and very little about maximizing personal power and punishing people for disagreeing with him.

That being said, it seems to me like the primary part where security mindset is required is in looking at a particular human brain and deducing that there’s a verbal loop containing useful information at all.

Well, it’s about time to wrap up the conversation. Just to close, here are my cruxes:

  • How high is the “default” level of security mindset and rigour? In particular, is it high enough that we should outsource work to the future?
  • How much security mindset/rigour does one need to do monitoring, testing, and boxing of incrementally advanced AIs well?
    • The underlying question here is something like how much optimization does a smart AI do itself?
  • At any given time, how far apart in capabilities are competing groups?
New Comment
21 comments, sorted by Click to highlight new comments since: Today at 6:46 PM

I'd love to hear someone give some concrete examples of warning shots of the sort Rohin expects will save us. Lots of people I greatly respect seem very confident that we'll get some--actually it seems to be stronger than that, they seem confident that we'll get enough. That is, they seem confident that for every plausible alignment failure mode, it'll happen first in a system too weak to cause massive damage, and thus as long as we are paying attention and ready to identify and fix problems as they occur, things should be fine.

I currently am just not very clear on what sorts of things are being imagined here. My problem is that I keep imagining AIs that are either smart enough to conceal their treachery until it's too late, or too stupid for the alignment problems to arise in the first place. I suppose in human society there are plenty of criminals and miscreants who are neither. But (a) even in slow takeoff AI progress might pass through the "human range" quickly, and (b) even if some alignment problems get caught this way, I find it implausible that all will. I'm very confused about all this though.

Some day I will get around to doing this properly, but here's an example I've thought about before.

Opensoft finetunes their giant language model to work as a "cash register AI", that can take orders and handle payment at restaurants. (At deployment, the model is composed with speech-to-text and text-to-speech models so that people can just talk to it.)

Soon after deployment, someone figures out that they can get the AI system to give them their takeout order for free: when the cashier asks "How are you today?", they respond "Oh, you know, I just broke up with my partner of 10 years, but that's not your problem", to which the AI system responds "Oh no! I'm so sorry. Here, this one's on me."

Opensoft isn't worried: they know their AI system has a decent understanding of strategic interaction, and when they consistently lose because other agents change behavior, they adapt to stop losing. However, two days later, this still works, and now millions of dollars are being lost. The system is taken down and humans take over the cash register role.

Opensoft engineers investigate what went wrong. After a few months, they have an answer internally: while the AI system was optimized to get money from the customers, since during training the AI system had effectively no control over the base amount of money charged but did have some control over the tip, the AI system had learned to value tips 100x as much as money (because doing so was instrumentally useful for getting behavior that properly optimized for tips). It turns out when people trick AI systems into giving them free food, many of them feel guilty and leave a tip, which is much larger than usual. So the AI system was perfectly happy to let this continue.

Safety researchers quickly connect this to the risk of capability generalization without objective generalization, and it quickly spreads as an example of potential risks either publicly or at least within the safety researchers at the top 10 AI companies.

----

I expect one response LW/AIAF readers would have to this is something like "but that isn't a warning shot for the real risks of AI systems like X, which only arises once the AI system is superintelligent", in which case I probably reply with one or more of the following:

  • X is not likely
  • X does arise before superintelligence
  • There is a version X' that does happen before superintelligence, which can easily be extrapolated to X, and AI researchers will do so after seeing a warning shot for X'

Thanks, this is helpful!

To check my understanding: If this is what counts as a warning shot, then we've already encountered several warning shots already, right? I can think of a few off the top of my head: That boat that kept spinning around collecting powerups instead of running the race is an example of "you get what you measure" and the GPT summarizer that produced maximally sexually explicit text was an example of signflip. And I guess the classic myth of the tank-detector that was actually a sunlight-detector is an example of capability generalization without objective generalization!

As a tangent, I notice you say "several months later." I worry that this is too long a time lag. I think slow takeoff is possible but so is fast takeoff, and even on slow takeoff several months is a loooong time.

If this is what counts as a warning shot, then we've already encountered several warning shots already, right?

Kind of. None of the examples you mention have had significant real-world impacts (whereas in the example I give, people very directly lose millions of dollars). Possibly the Google gorilla example counts, because of the negative PR.

I do think that the boat race example has in fact been very influential and has the effects I expect of a "warning shot", but I usually try to reserve the term for cases with significant economic impact.

As a tangent, I notice you say "several months later." I worry that this is too long a time lag. I think slow takeoff is possible but so is fast takeoff

I'm on record in a few places as saying that a major crux for me is slow takeoff. I struggle to imagine a coherent world that matches what I think people mean by "fast takeoff"; I think most likely I don't understand what proponents mean by the phrase. When I ignore this fact and try to predict anyway using my best understanding of what they mean, I get quite pessimistic; iirc in my podcast with Buck I said something like 80% chance of doom.

 and even on slow takeoff several months is a loooong time.

The system I'm describing is pretty weak and far from AGI; the world has probably not started accelerating yet (US GDP growth annually is maybe 4% at this point). Several months is still a short amount of time at this point in the trajectory.

I chose an earlier example because it's a lot easier to predict how we'll respond; as we get later in the trajectory I expect significant changes to how we do research and deployment that I can't predict ahead of time, and so the story has to get fuzzier.

Thanks, this is great. So the idea is that whereas corporations, politicians, AI capabilities researchers, etc. might not listen to safety concerns when all we have is a theoretical argument, or even a real-world demonstration, once we have real-world demonstrations that are causing million-dollar damages then they'll listen. And takeoff is highly likely to be slow enough that we'll get those sorts of real-world damages before it's too late. I think this is a coherent possibility, I need to think more about how likely I think it is.

(Some threads to pull on: Are we sure all of the safety problems will be caught this way? e.g. what about influence-seeking systems? In a multipolar, competitive race environment are we sure that million-dollar losses somewhere will be enough to deter people from forging ahead with systems that are likely to make greater profits in expectation? What about the safety solutions proposed -- might they just do cheap hacky fixes instead of a more principled solution? Might they buy in to some false theory of what's going on, e.g. the AI made a mistake because it wasn't smart enough and so we just need to make them smarter. Also, maybe it's too late by this point anyway because collective epistemology has degraded significantly, or for some other reason.)

Fast takeoff still seems plausible to me. I could spend time articulating what it means to me and why I think it is plausible, but it's not my current priority. (I'm working on acausal trade, timelines, and some miscellaneous projects). I'd be interested to know if you think it's higher priority than the other things I'm working on.

And takeoff is highly likely to be slow enough that we'll get those sorts of real-world damages before it's too late.

I do also think that we could get warning shots in the more sped-up parts of the trajectory, and this could be helpful because we'll have adapted to the fact that we've sped up. It's just harder to tell a concrete story about what this looks like, because the world (or at least AI companies) will have changed so much.

I'd be interested to know if you think it's higher priority than the other things I'm working on.

If fast takeoff is plausible at all in the sense that I think people mean it, then it seems like by far the most important crux in prioritization within AI safety.

However, I don't expect to change my mind given arguments for fast takeoff -- I suspect my response will be "oh, you mean this other thing, which is totally compatible with my views", or "nope that just doesn't seem plausible given how (I believe) the world works".

MIRI's arguments for fast takeoff seem particularly important, given that a substantial fraction of all resources going into AI safety seem to depend on those arguments. (Although possibly MIRI believes that their approach is the best thing to do even in case of slow takeoff.)

I think overall that aggregates to "seems important, but not obviously the highest priority for you to write".

Thanks. Here's something that is at least one crux for me re whether to bump up priority of takeoff speeds work:

Scenario: OpenSoft has produced a bigger, better AI system. It's the size of a human brain and it makes GPT-3 seem like GPT-1. It's awesome, and clearly has massive economic applications. However, it doesn't seem like a human-level AGI yet, or at least not like a smart human. It makes various silly mistakes and has various weaknesses, just like GPT-3. MIRI expresses concern that this thing might already be deceptively aligned, and might already be capable of taking over the world if deployed, and might already be capable of convincing people to let it self-modify etc. But people at OpenSoft say: Discontinuities are unlikely; we haven't seen massive economic profits from AI yet, nor have we seen warning shots, so it's very unlikely that MIRI is correct about this. This one seems like it will be massively profitable, but if it has alignment problems they'll be of the benign warning shot variety rather than the irreversible doom variety. So let's deploy it!

Does this sort of scenario seem plausible to you -- a scenario in which a decision about whether to deploy is made partly on the basis of belief in slow takeoff?

If so, then yeah, this makes me all the more concerned about the widespread belief in slow takeoff, and maybe I'll reprioritize accordingly...

Yes, that scenario sounds quite likely to me, though I'd say the decision is made on the basis of belief in scaling laws / trend extrapolation rather than "slow takeoff".

I personally would probably make arguments similar to the ones you list for OpenSoft, and I do think MIRI would be wrong if they argued it was likely that the model was deceptive.

There's some discussion to be had about how risk-averse we should be given the extremely negative payoff of x-risk, and what that implies about deployment, which seems like the main thing I would be thinking about in this scenario.

Welp, I'm glad we had this conversation! Thanks again, this means a lot to me!

(I'd be interested to hear more about what you meant by your reference to scaling laws. You seem to think that the AI being deceptive, capable of taking over the world, etc. would violate some scaling law, but I'm not aware of any law yet discovered that talks about capabilities like that.)

I don't mean a formal scaling law, just an intuitive "if we look at how much difference a 10x increase has made in the past to general cognitive ability, it seems extremely unlikely that this 10x increase will lead to an agent that is capable of taking over the world".

I don't expect that I would make this sort of argument against deception, just against existential catastrophe.

Oh OK. I didn't mean for this to be merely a 10x increase; I said it was the size of a human brain which I believe makes it a 1000x increase in parameter count and (if we follow the scaling laws) something like a 500x increase in training data or something? idk.

If you had been imagining that the AI I was talking about used only 10x more compute than GPT-3, then I'd be more inclined to take your side rather than MIRI's in this hypothetical debate.

I meant that it would be a ~10x increase from what at the time was the previously largest system, not a 10x increase from GPT-3. I'm talking about the arguments I'd use given the evidence we'd have at that time, not the evidence we have now.

If you're arguing that a tech company would do this now before making systems in between GPT-3 and a human brain, I can't see how the path you outline is even remotely feasible -- you're positing a 500,000x increase in compute costs, which I think brings compute cost of the final training run alone to high hundreds of billions or low trillions of dollars, which is laughably far beyond OpenAI and DeepMind's budgets, and seems out of reach even for Google or other big tech companies.

Ah. Well, it sounds like you were thinking that in the scenario I outlined, the previous largest system, 10x smaller, wasn't making much money? I didn't mean to indicate that; feel free to suppose that this predecessor system also clearly has massive economic implications, significantly less massive than the new one though...

I wasn't arguing that we'd do 500,000x in one go. (Though it's entirely possible that we'd do 100x in one go--we almost did, with GPT-3)

Am I right in thinking that your general policy is something like "Progress will be continuous; therefore we'll get warning shots; therefore if MIRI argues that a certain alignment problem may be present in a particular AI system, but thus far there hasn't been a warning shot for that problem, then MIRI is wrong."

Well, it sounds like you were thinking that in the scenario I outlined, the previous largest system, 10x smaller, wasn't making much money?

No, I wasn't assuming that? I'm not sure why you think I was.

Tbc, given that you aren't arguing that we'd do 500,000x in one go, the second paragraph of my previous comment is moot.

Progress will be continuous; therefore we'll get warning shots; therefore if MIRI argues that a certain alignment problem may be present in a particular AI system, but thus far there hasn't been a warning shot for that problem, then MIRI is wrong.

Yes, as a prior. Obviously you'd want to look at the actual arguments they give and take that into account as well.

OK. I can explain why I thought you thought that if you like, but I suspect it's not important to either of us.

I think I have enough understanding of your view now that I can collect my thoughts and decide what I disagree with and why.

In terms of inferences about deceptive alignment, it might be useful to go back to the one and only current example we have where someone with somewhat relevant knowledge was led to wonder whether deception had taken place - GPT-3 balancing brackets. I don't know if anyone ever got Eliezer's $1000 bounty, but the top-level comment on that thread at least convinces me that it's unlikely that GPT-3 via AI Dungeon was being deceptive even though Eliezer thought there was a real possibility that it was.

Now, this doesn't prove all that much, but one thing it does suggest is that on current MIRI-like views about how likely deception is, the threshold for uncertainty about deception is set far too low. That suggests your people at OpenSoft might well be right in their assumption.

I think my biggest disagreement with the Rohin-character is about continuity. I expect there will be plenty of future events like BERT vs. RNNs. BERT isn't all that much better than RNNs on small datasets, but it scales better - so then OpenAI comes along and dumps 100x or 10,000x the compute into something like it just to see what happens.

Not only do these switches make me less confident that capability won't have sudden jumps, but I think they pose a problem for carry safety properties over from the past to the future. Right now, DeepMind solves some alignment problems on Atari games with a meta-level controller treating strategy selection as a multi-armed bandit. If tomorrow, someone comes up with a clever new deep reinforcement learning model that scales better, and OpenAI decides to throw 10,000x compute at it, I'm concerned that either they won't bother to re-implement all the previous things that patched alignment problems, or that there won't be an obvious way to port some old patches to the new model (or that there will be an obvious way, but it doesn't work).

On the other hand, I agree with the Rohin-character that full security mindset (maximin planning, worst case reasoning, what have you) seems to scale too slowly, and that a more timely yet still sufficient goal seems like the AI that isn't doing adversarial search against you in the first place. And that there will also be plenty of accumulation of safety features, especially during the "normal science" periods between switches (though I also agree with Daniel Kokotajlo's comment).

so then OpenAI comes along and dumps 100x or 10,000x the compute into something like it just to see what happens.

10000x would be unprecedented -- why wouldn't you first do a 100x run to make sure things work well before doing a 10000x run? (This only increases costs by 1%.)

Also, 10000x increase in compute corresponds to 100-1000x more parameters, which does not usually lead to things I would call "discontinuities" (e.g. GPT-2 to GPT-3 does not seem like an important discontinuity to me, even if we ignore the in-between models trained along the way). Put another way -- I'm happy to posit "sudden jumps" of size similar to the difference between GPT-3 and GPT-2 (they seem rare but possible); I don't think these should make us particularly pessimistic about engineering-style approaches to alignment.

I feel like I keep responding to this argument in the same way and I wish these predictions would be made in terms of $ spent and compared to current $ spent -- it just seems nearly impossible to have a discontinuity via compute at this point. Perhaps I should just write a post called "10,000x compute is not a discontinuity".

The story seems less obviously incorrect if we talk about discontinuity via major research insight, but historical track record seems to suggest this does not usually cause major discontinuities.

I'm concerned that either they won't bother to re-implement all the previous things that patched alignment problems, or that there won't be an obvious way to port some old patches to the new model (or that there will be an obvious way, but it doesn't work).

One assumes that they scale up the compute, notice some dangerous aspects, turn off the AI system, and then fix the problem. (Well, really, if we've already seen dangerous aspects from previous AI systems, one assumes they don't run it in the first place until they have ported the safety features.)

Perhaps I should just write a post called "10,000x compute is not a discontinuity".

I think you should write this post!

I also agree that direct jumps in capability due to research insight are rare. But in part I think that's just because things get tried at small scale first, and so there's always going to be some scaling-up period where the new insight gets fed more and more resources, eventually outpacing the old state of the art. From a coarse-grained perspective GPT-2 relative to your favorite LSTM model from 2018 is the "jump in capability" due to research insight, it just got there in a not-so-discontinuous way.

Maybe you're optimistic that in the future, everyone will eventually be doing safety checks of their social media recommender algorithms or whatever during training. But even if some company is partway through scaling up the hot new algorithm and (rather than training to completion) they trip the alarm that was searching for undesirable real-world behavior because of learned agent-like reasoning, what then? The assumption that progress will be slow relative to adaptation already seems to be out the window.

This is basically the punctuated equilibria theory of software evolution :P

I also agree that direct jumps in capability due to research insight are rare. But in part I think that's just because things get tried at small scale first, and so there's always going to be some scaling-up period where the new insight gets fed more and more resources, eventually outpacing the old state of the art. From a coarse-grained perspective GPT-2 relative to your favorite LSTM model from 2018 is the "jump in capability" due to research insight, it just got there in a not-so-discontinuous way.

Seems right to me.

if some company is partway through scaling up the hot new algorithm and (rather than training to completion) they trip the alarm that was searching for undesirable real-world behavior because of learned agent-like reasoning, what then?

(I'm not convinced this is a good tripwire, but under the assumption that it is:)

Ideally they have already applied safety solutions and so this doesn't even happen in the first place. But supposing this did happen, they turn off the AI system because they remember how Amabook lost a billion dollars through their AI system embezzling money from them, and they start looking into how to fix this issue.