Charbel-Raphaël summarizes Davidad's plan: Use near AGIs to build a detailed world simulation, then train and formally verify an AI that follows coarse preferences and avoids catastrophic outcomes.
A few more observations.
The definition of iteration we had before implicitly assumes that the agent can observe the full outcome of previous iterations. We don't have to make this assumption. Instead, we can assume a set of possible observations
I believe that Theorem 4 remains valid.
As we remarked before, DDT is not invariant under adding a constant to the loss function. It is interesting to consider what happens when we add an increasingly large ...
Oh, I think of "ending factory farming" as very far from "taking over the world".
If Superman were a skilled political operator it could be as simple as arranging to take photoshoots with whichever politicians legislated the end of factory farms.
Or if he were less skilled it could involve doing various kinds of property damage to factory farms (potentially even things which there aren't laws against, like flying around them in a way which blows the buildings over).
This might escalate to the government trying to arrest him, and outright conflict, but honestl...
(Last revised: January 2026. See changelog at the bottom.)
Part of the “Intro to brain-like-AGI safety” post series.
Thus far in the series, Post #1 set out some definitions and motivations (what is “brain-like AGI safety” and why should we care?), and Posts #2 & #3 split the brain into a Learning Subsystem (cortex, striatum, cerebellum, amygdala, etc.) that “learns from scratch” using learning algorithms, and a Steering Subsystem (hypothalamus, brainstem, etc.) that is mostly genetically-hardwired and executes innate species-specific instincts and reactions.
Then in Post #4, I talked about the “short-term predictor”, a circuit which learns, via supervised learning, to predict a signal in advance of its arrival, but only by perhaps a fraction of a second. Post #5 then argued that if we form a closed...
The big picture—The whole post will revolve around this diagram. Note that I’m oversimplifying in various ways, including in the bracketed neuroanatomy labels.
I think this picture would be clearer if you drew [predict sensory inputs] as a separate box from Though Generator.
In the picture in my head, there is [predict sensory inputs] box, that revives and tries to predict the sensory output. This box also sends a signal of [current context] to both the Though Generator and the Though Assessor. Also, [predict sensory inputs] gets some signal from Though Ge...
Also available in markdown at theMultiplicity.ai/blog/schelling-goodness.
This post explores a notion I'll call Schelling goodness. Claims of Schelling goodness are not first-order moral verdicts like "X is good" or "X is bad." They are claims about a class of hypothetical coordination games in the sense of Thomas Schelling, where the task being coordinated on is a moral verdict. In each such game, participants aim to give the same response regarding a moral question, by reasoning about what a very diverse population of intelligent beings would converge on, using only broadly shared constraints: common knowledge of the question at hand, and background knowledge from the survival and growth pressures that shape successful civilizations. Unlike many Schelling coordination games, we'll be focused on scenarios with no shared history or knowledge...
Which universal distribution?
Some universal distributions are full of agents that make choices that make that distribution not a valid model of reality after the decisions are made (self-defeating). Other distributions are full of agents making decisions that ratify the distribution (self-fulfilling).
Distributions that aren't fixed points under reflection about what they decide about themselves are not coherent models of reality.
Let's say the current policy has a 90% chance of cooperating. Then, what action results in the highest expected reward for player 1 (and in turn, gets reinforced the most on average)? Player 1 sampling defect leads to a higher reward for player 1 whether or not player 2 samples cooperate (strategic dominance), and there's a 90% chance of player 2 sampling cooperate regardless of player 1's action because the policy is fixed (i.e., player 1 cooperating is no evidence of player 2 cooperating, so it's not the case that reward tends to be higher for player 1 w...