The Global Call for AI Red Lines was signed by 12 Nobel Prize winners, 10 former heads of state and ministers and over 300 prominent signatories. Launched at the UN General Assembly and presented to the UN Security Council. There is still much to be done, so we need to...
TL;DR: The EU’s Code of Practice (CoP) mandates AI companies to conduct state-of-the-art Risk Modelling. However, the current SoTA is has severe flaws. By creating risk models and improving methodology, we can enhance the quality of risk management performed by AI companies. This is a neglected area, hence we encourage...
TL;DR: We wanted to benchmark supervision systems available on the market—they performed poorly. Out of curiosity, we naively asked a frontier LLM to monitor the inputs; this approach performed significantly better. However, beware: even when an LLM flags a question as harmful, it will often still answer it. Full paper...
Adapted from this twitter thread. See this as a quick take. Mitigation Strategies How to mitigate Scheming? 1. Architectural choices: ex-ante mitigation 2. Control systems: post-hoc containment 3. White box techniques: post-hoc detection 4. Black box techniques 5. Avoiding sandbagging We can combine all of those mitigation via defense-in-depth system...
> "We shape our tools and thereafter our tools shape us." > — Marshall McLuhan TL;DR: This post is a two-page introduction to risks associated with recommendation AI. The negative externalities of recommendation AI seem neglected, and there might be comparatively effective work at improving governance and deploying better recommendation...
Here is a little Q&A Can you explain your position quickly? I think autonomous replication and adaptation in the wild is under-discussed as an AI threat model. And this makes me sad because this is one of the main reasons I'm worried. I think one of AI Safety people's main...