Chipmonk As the Conceptual Boundaries Workshop (website) is coming up, and now that we're also planning Mathematical Boundaries Workshop in April, I want to get more clarity on what exactly it is that you want out of «boundaries»/membranes. So I just want to check: Is your goal with boundaries just...
Context: I sometimes find myself referring back to this tweet and wanted to give it a more permanent home. While I'm at it, I thought I would try to give a concise summary of how each distinct problem would be solved by Safeguarded AI (formerly known as an Open Agency...
1. There should be two thresholds on compute graph size: 1. the Frontier threshold, beyond which oversight during execution is mandatory 2. the Horizon threshold, beyond which execution is forbidden by default 2. Oversight during execution: 1. should be carried out by state and/or international inspectors who specialize in evaluating...
Edited to add (2024-03): This early draft is largely outdated by my ARIA programme thesis, Safeguarded AI. I, davidad, am no longer using "OAA" as a proper noun, although I still consider Safeguarded AI to be an open agency architecture. Note: This is an early draft outlining an alignment paradigm...
Threat Model There are many ways for AI systems to cause a catastrophe from which Earth-originating life could never recover. All of the following seem plausible to me: * Misuse: An AI system could help a human or group of humans to destroy or to permanently take over (and lock...
This is a brief post arguing that, although "side-channels are inevitable" is pretty good common advice, actually, you can prevent attackers inside a computation from learning about what's outside. We can prevent a task-specific AI from learning any particular facts about, say, human psychology, virology, or biochemistry—if: 1. we are...
The standard frame (Evan Hubinger, 2021) is: > * Outer alignment refers to the problem of finding a loss/reward function such that the training goal of “a model that optimizes for that loss/reward function” would be desirable. > * Inner alignment refers to the problem of constructing a training rationale...