AI ALIGNMENT FORUM
AF

AI
Curated

49

Comparing risk from internally-deployed AI to insider and outsider threats from humans

by Buck Shlegeris
23rd Jun 2025
3 min read
7

49

AI
Curated
Comparing risk from internally-deployed AI to insider and outsider threats from humans
5RobertM
New Comment
1 comment, sorted by
top scoring
Click to highlight new comments since: Today at 7:41 PM
[-]RobertM17d53

They imagine writing small and carefully locked-down infrastructure and allowing the AIs to interact with it.

That's surprising and concerning.  As you say, if these companies expect their AIs to do end-to-end engineering and R&D tasks internally, it seems difficult to imagine how they could do that without having employee-level privileges.  Any place where they don't is a place where humans turn into a bottleneck.  I can imagine a few possible objections to this:

  1. They don't expect those bottlenecks to impose that much of a penalty.
    1. I'm not sure how this could be true unless you think that AI systems will be pretty weak, which is sort of fighting the premise.  What are we worried about, again?
  2. They expect the bottlenecks to impose a large penalty and plan on biting that bullet when the time comes.
    1. I currently roll to disbelieve based on both the publicly-held positions of the relevant organizations and also their historical track records.  Incentives here seem quite bad.
  3. They plan on trying to thread the needle by employing some control schemes where (for example) different "agents" have different permissions.  i.e. a "code writing" agent has read permissions for (some parts of) the codebase, the ability to write, deploy, and test changes to that code in a sandboxed dev environment, and the ability to open a pull request with those changes.  Another set of agents have permissions to review pull requests, and then request changes, approve/merge them, or flag the PR as suspicious.  Yet another set of agents act as gatekeepers to sensitive data that might be needed for some experiments but only with good justification.
    1. This still suffers from the incentive gradient pushing quite hard to just build end-to-end agents.  Not only will it probably work better, but it'll be straight up cheaper and easier!

Like, to be clear, I would definitely prefer a world where these organizations wrote "small and carefully locked-down infrastructure" as the limited surface their AIs were allowed to interact with; I just don't expect that to actually happen in practice.

Reply
Moderation Log
Curated and popular this week
1Comments

I’ve been thinking a lot recently about the relationship between AI control and traditional computer security. Here’s one point that I think is important.

My understanding is that there's a big qualitative distinction between two ends of a spectrum of security work that organizations do, that I’ll call “security from outsiders” and “security from insiders”.

On the “security from outsiders” end of the spectrum, you have some security invariants you try to maintain entirely by restricting affordances with static, entirely automated systems. My sense is that this is most of how Facebook or AWS relates to its users: they want to ensure that, no matter what actions the users take on their user interfaces, they can't violate fundamental security properties. For example, no matter what text I enter into the "new post" field on Facebook, I shouldn't be able to access the private messages of an arbitrary user. And on AWS, no matter what code I run on my EC2 instance, I shouldn't be able to access someone else's private S3 bucket.

This type of security is very "static": invariants are enforced by code that doesn't involve human review and so don't rely on humans reacting properly to something. This is typically how you mitigate risks from your users--users typically only need to have certain restricted types of access to your systems, and they're very numerous, so you need to handle them with automated approaches. I sometimes call systems that aim to be safe no matter what inputs a user provides "manifestly safe".

In contrast, consider how Facebook and AWS relate to their engineers. These companies want to be robust to their engineers abusing their access to violate security properties; they could do this either by abusing administrator privileges (e.g. direct access to machines that have user data) or by adding malign behavior to code that they write. In order to mitigate these risks, companies use very different types of security techniques. They rely on multi-party authorization, where multiple engineers have to sign off on executing sensitive commands, or code review, where someone needs to review code before it's pushed to environments where it has access to sensitive data. These techniques fundamentally rely on human labor.

Insider threats are harder to handle in that your engineers need to have much more fine-grained access, but easier in that your engineers are generally less numerous than your users. Moreover, insiders don't have as much ability to iterate against your security measures (because if you catch them trying to break your security measures, you can often fire them or have them arrested).

Of course, there's a spectrum here. A big part of how you mitigate threats from external hackers is having security staff investigate weird anomalies, in the hope of catching hackers who have broken some of your intended security boundaries. And a big part of how you mitigate insider threats is building safe administrative APIs, so that instead of your employees having to have the ability to run arbitrary code on production machines, you allow them to usually do their job by taking pre-specified, known-safe actions. And there’s much more along these lines.

I feel like there’s a really important distinction between “security from insiders” and “security from outsiders”, and I don’t really get why security people don’t seem to find this distinction as important.

Now that I have this concept, I can make a few points about the risks posed by AIs deployed inside AI companies. (Risks from internal deployments have been discussed by me here (as well as many other places), by a Redwood/UK AISI collaboration here, and by Apollo here.)

First, when I talk to security staff at AI companies about computer security, they often seem to fail to anticipate what insider threat from AIs will be like. In particular, when discussing how they'll deploy AI engineers and researchers internally, they often seem to imagine relating to those AIs mostly as outsiders rather than as insiders. They imagine writing small and carefully locked-down infrastructure and allowing the AIs to interact with it. But my guess is that in order to do their jobs, AIs will need to have the kind of wide-ranging access that insiders have, which leads to having an attack surface that is large and can’t be protected with static methods. (I’m not sure I’m right about this, and it seems very important to get clarity on; I plan to write more about it soon.) I think the security teams at AI companies are not putting enough work into preparing for this.

And second, insider threat from AI combines the hard parts of both problems! Like insider threat, AIs will need extremely broad, fine-grained access. And like outsider threat, you're outnumbered by your potential adversaries. (Of course, you have a bunch of advantages too, and that's why we research AI control rather than just writing off the problem as hopeless.)