Zach Stein-Perlman

AI strategy & governance. Looking for new projects.


Sorted by New

Wiki Contributions

Load More


The control-y plan I'm excited about doesn't feel to me like squeeze useful work out of clearly misaligned models. It's like use scaffolding/oversight to make using a model safer, and get decent arguments that using the model (in certain constrained ways) is pretty safe even if it's scheming. Then if you ever catch the model scheming, do some combination of (1) stop using it, (2) produce legible evidence of scheming, and (3) do few-shot catastrophe prevention. But I defer to Ryan here.

It's "a unified methodology" but I claim it has two very different uses: (1) determining whether a model is safe (in general or within particular scaffolding) and (2) directly making deployment safer. Or (1) model evals and (2) inference-time safety techniques.


I think there's another agenda like make untrusted models safe but useful by putting them in a scaffolding/bureaucracy—of filters, classifiers, LMs, humans, etc.—such that at inference time, takeover attempts are less likely to succeed and more likely to be caught. See Untrusted smart models and trusted dumb models (Shlegeris 2023). Other relevant work:

[Edit: now AI Control (Shlegeris et al. 2023) and Catching AIs red-handed (Greenblatt and Shlegeris 2024).]

[Edit: I make a bid for an expert—probably someone at Redwood—to make a public reading list on this control agenda.]

Nice. (I'm interested in figuring out what labs should do regarding security and would appreciate information/takes.)

Suggestions for what else they should think about or do?

I've collected quotes from some other sources; here are two three:

Towards best practices in AGI safety and governance (Schuett et al. 2023):

  • "Security incident response plan. AGI labs should have a plan for how they respond to security incidents (e.g. cyberattacks)."
  • "Industry sharing of security information. AGI labs should share threat intelligence and information about security incidents with each other." They should also share infosec/cyberdefense innovations and best practices.
  • "Security standards. AGI labs should comply with information security standards (e.g. ISO/IEC 27001 or NIST Cybersecurity Framework). These standards need to be tailored to an AGI context."
  • "Background checks. AGI labs should perform rigorous background checks before hiring/appointing members of the board of directors, senior executives, and key employees."
  • "Model containment. AGI labs should contain models with sufficiently dangerous capabilities (e.g. via boxing or air-gapping)."

"Appropriate security" in "Model evaluation for extreme risks" (DeepMind 2023):

Models at risk of exhibiting dangerous capabilities will require strong and novel security controls. Developers must consider multiple possible threat actors: insiders (e.g. internal staff, contractors), outsiders (e.g. users, nation-state threat actors), and the model itself as a vector of harm. We must develop new security best practices for high-risk AI development and deployment, which could include for example:

  • Red teaming: Intensive security red-teaming for the entire infrastructure on which the model is developed and deployed.
  • Monitoring: Intensive, AI-assisted monitoring of the model's behaviour, e.g. for whether the model is engaging in manipulative behaviour or making code recommendations that would lower the overall security of a system.
  • Isolation: Appropriate isolation techniques for preventing risky models from exploiting the underlying system (e.g. sole-tenant machines and clusters, and other software-based isolation). The model's network access should be tightly controlled and monitored, as well as its access to tools (e.g. code execution).
  • Rapid response: Processes and systems for rapid response to disable model actions and the model's integrations with hardware, software, and infrastructure in the event of unexpected unsafe behaviour.
  • System integrity: Formal verification that served models, memory, or infrastructure have not been tampered with. The development and serving infrastructure should require two-party authorization for any changes and auditability of all changes.

From a private doc (sharing with permission):

  1. Many common best practices are probably wise, but could take years to implement and iterate and get practice with: use multi-factor authentication, follow the principle of least privilege, use work-only laptops and smartphones with software-enforced security policies, use a Crowdstrike/similar agent and host-based intrusion detection system, allow workplace access via biometrics or devices rather than easily scanned badges, deliver internal security orientations and training and reminders, implement a standard like ISO/IEC 27001, use penetration testers regularly, do web browsing and coding in separate virtual machines, use keys instead of passwords and rotate them regularly and store them in secure enclaves, use anomaly detection software, etc.
  2. Get staff to not plug devices/cables into their work devices unless they're provided by the security team or ordered from an approved supplier.
  3. For cloud security, use AWS Nitro enclaves or gVizor or Azure confidential, create flags or blocks for large data exfiltration attempts. Separate your compute across multiple accounts. Use Terraform so that infrastructure management is more auditable and traceable. Keep code and weights on trusted cloud compute only, not on endpoints (e.g. by using Github Codespaces).
  4. Help make core packages, dependencies, and toolchains more secure, e.g. by funding the Python Software Foundation to do that. Create a mirror for the software you use and hire someone to review packages and updates (both manually and automatically) before they're added to your mirror.
  5. What about defending against zero-days, i.e. vulnerabilities that haven't been broadly discovered yet and thus haven't been patched?
    1. One idea is that every month or week, you could pick an employee at random to give a new preconfigured laptop and hand their previous laptop to a place like Citizen Lab or Mandiant that can tear the old one apart and find zero-days and notify the companies who can patch them. After several zero days are burned this way, maybe attackers with zero-days will decide you're not worth constantly burning their zero-days on.
    2. At some point in the future it'll probably be wise to run the largest training runs on air gapped compute like what Microsoft and Amazon provide to the US intelligence community already, so that you can do lots of testing and so on behind the airgap before deployment.
  6. What about insider threat? A lab could create their own version of a security clearance/personnel reliability process, and/or find a way to get most/all staff with the most privileged access to get a US security clearance (with all that entails). . . . Create a process for vetting cleaners/etc. . . .
  7. Set up stronger physical security measures than a typical tech company, e.g. armed guards and strong physical security around critical devices.

More miscellanea:

  • Insider risk program
  • Source code should live in a hardened cloud environment; it should never touch company/developer machines
  • Security audits & pentests
  • Talk about security best practices (yay Anthropic for this post) and be transparent where appropriate (yay Anthropic and OpenAI for their security portals)
  • Incident tracking and sharing/reporting; maybe a breach disclosure policy

Also OWASP Top 10 for Large Language Model Applications has a section on Model theft. (In addition to standard hacking, they include generating model outputs to use as training data to replicate the model.) They recommend:

  1. Implement strong access controls (E.G., RBAC and rule of least privilege) and strong authentication mechanisms to limit unauthorized access to LLM model repositories and training environments.
    1. This is particularly true for the first three common examples, which could cause this vulnerability due to insider threats, misconfiguration, and/or weak security controls about the infrastructure that houses LLM models, weights and architecture in which a malicious actor could infiltrate from inside[] or outside the environment.
    2. Supplier management tracking, verification and dependency vulnerabilities are important focus topics to prevent exploits of supply-chain attacks.
  2. Restrict the LLM's access to network resources, internal services, and APIs.
    1. This is particularly true for all common examples as it covers insider risk and threats, but also ultimately controls what the LLM application "has access to" and thus could be a mechanism or prevention step to prevent side-channel attacks.
  3. Regularly monitor and audit access logs and activities related to LLM model repositories to detect and respond to any suspicious or unauthorized behavior promptly.
  4. Automate MLOps deployment with governance and tracking and approval workflows to tighten access and deployment controls within the infrastructure.
  5. Implement controls and mitigation strategies to mitigate and|or reduce risk of prompt injection techniques causing side-channel attacks.
  6. Rate Limiting of API calls where applicable and|or filters to reduce risk of data exfiltration from the LLM applications, or implement techniques to detect (E.G., DLP) extraction activity from other monitoring systems.
  7. Implement adversarial robustness training to help detect extraction queries and tighten physical security measures.
  8. Implement a watermarking framework into the embedding and detection stages of an [LLM's lifecycle].

See also their wiki page on resources on LLM security.

Also has collected lots of papers.

Huh, I claim Ajeya's timelines are much more coherent if we replace 2026 with 2027.5 or 2028.* 10% between now and 2026, then 5% between 2026 and 2030, then 20% between 2030 and 2036 is really weird.

*Changing 2026 (rather than 2030) just because Ajeya's 2026 cumulative probability seems less considered than her 2030 and 2036 cumulative probabilities.

(+1. I totally agree that input growth will slow sometime if we don't get TAI soon. I just think you have to be pretty sure that it slows right around 2040 to have the specific numbers you mention, and smoothing out when it will slow down due to that uncertainty gives a smoother probability distribution for TAI.)

Good post!

I understand that the specific numbers in this post are "rough" and "volatile," but I want to note that 35% by 2036, 50% by 2040, and 60% by 2050 means 3.75% per year 2036–2040 and 1% per year 2040–2050, which is a surprisingly steep drop-off. Or as an alternative framing, conditional on TAI not having appeared by 2040, my expected credence in 2040 that TAI appears in the next 10 years is much greater than 20% (where 20% is your implied probability of TAI between 2040 and 2050, conditional on no TAI in 2040). My median timeline is somewhat shorter than yours, but my credence in TAI by 2050 is substantially higher.

(That said, I lack something like the knowledge, courage, or epistemic virtue to be more explicit about my timelines, because it's hard; strong-upvote for this useful and virtuous post, and thanks for using specific numbers so much.)

How optimistic should we be about alignment & safety for brain-like-AGI, relative to prosaic AGI?

Ask dumb questions! ... we encourage people to ask clarifying questions in the comments of this post (no matter how “dumb” they are)

ok... disclaimer: I know little about ML and I didn't read all of the report.

All of our counterexamples are based on an ontology mismatch between two different Bayes nets, one used by an ML prediction model (“the predictor”) and one used by a human.

I am confused. Perhaps the above sentence is true in some tautological sense I'm missing. But in the sections of the report listing training strategies and corresponding counterexamples, I wouldn't describe most counterexamples as based on ontology mismatch. And the above sentence seems in tension with this from the report:

We very tentatively think of ELK as having two key difficulties: ontology identification and learned optimization. ... We don’t think these two difficulties can be very precisely distinguished — they are more like genres of counterexamples

So: do some of your training strategies work perfectly in the nice-ontology case, where the model has a concept of "the diamond is in the room"? If so, I missed this in the report and this feels like quite a strong result to me; if not, there are counterexamples based on things other than ontology mismatch.

Ha, I wrote a comment like yours but slightly worse, then refreshed and your comment appeared. So now I'll just add one small note:

To the extent that (1) normatively, we care much more about the rest of the universe than our personal lives/futures, and (2) empirically, we believe that our choices are much more consequential if we are non-simulated than if we are simulated, we should in practice act as if there are greater odds that we are non-simulated than we have reason to believe for purely epistemic purposes. So in practice, I'm particularly interested in (C) (and I tentatively buy SIA doomsday as explained by Katja Grace).

Edit: also, isn't the last part of this sentence from the post wrong:

SIA therefore advises not that the Great Filter is ahead, but rather that we are in a simulation run by an intergalactic human civilization, without strong views on late filters for unsimulated reality.

Load More