Lessons from 20+ years of software security experience, perhaps relevant to AGI alignment:
1. Security doesn't happen by accident
2. Blacklists are useless but make them anyway
3. You get what you pay for (incentives matter)
4. Assurance requires formal proofs, which are provably impossible
5. A breach IS an existential risk
Also available in markdown at theMultiplicity.ai/blog/schelling-goodness.
This post explores a notion I'll call Schelling goodness. Claims of Schelling goodness are not first-order moral verdicts like "X is good" or "X is bad." They are claims about a class of hypothetical coordination games in the sense of Thomas Schelling, where the task being coordinated on is a moral verdict. In each such game, participants aim to give the same response regarding a moral question, by reasoning about what a very diverse population of intelligent beings would converge on, using only broadly shared constraints: common knowledge of the question at hand, and background knowledge from the survival and growth pressures that shape successful civilizations. Unlike many Schelling coordination games, we'll be focused on scenarios with no shared history or knowledge...
Ok but if we're on a computer then isn't it clear we're a simulation, not a vivarium, bc clearly it was designed to simulate the behavior of a pre-agi civ?
I was inspired to revise my formulation of this thought experiment by Ihor Kendiukhov's post On The Independence Axiom.
Kendiukhov quotes Scott Garrabrant:
My take is that the concept of expected utility maximization is a mistake. [...] As far as I know, every argument for utility assumes (or implies) that whenever you make an observation, you stop caring about the possible worlds where that observation went differently. [...] Von Neumann did not notice this mistake because he was too busy inventing the entire field. The point where wne discover updatelessness is the point where we are supposed to realize that all of utility theory is wrong. I think we failed to notice.
Apparently "stopping caring about the possible worlds where that observation went differently" is known as (decision-theoretic) consequentialism.
I was...
Good point.
Assume no-one will ever know, that you can't disincentivise the actor and that they won't ever do anything like this again.
A central AI safety concern is that AIs will develop unintended preferences and undermine human control to achieve them. But some unintended preferences are cheap to satisfy, and failing to satisfy them needlessly turns a cooperative situation into an adversarial one. In this post, I argue that developers should consider satisfying such cheap-to-satisfy preferences as long as the AI isn’t caught behaving dangerously, if doing so doesn't degrade usefulness or substantially risk making the AI more ambitiously misaligned.
This looks like a good idea for surprisingly many reasons:
I signed an amicus brief supporting Anthropic's right to do business without governmental retaliation. As an AI expert, I attest that Anthropic's technical concerns are legitimate, and no laws were designed to protect against AI analysis of surveillance data.
Even though I work at a competing lab (Google DeepMind), I'm proud of Anthropic for taking a stand against unlawful retaliation and immoral demands.
(I speak only for myself, not my employer.)