Epistemic status: spitballing.

"Like Photons in a Laser Lasing"

When you do lots of reasoning about arithmetic correctly, without making a misstep, that long chain of thoughts with many different pieces diverging and ultimately converging, ends up making some statement that is... still true and still about numbers! Wow! How do so many different thoughts add up to having this property? Wouldn't they wander off and end up being about tribal politics instead, like on the Internet?

And one way you could look at this, is that even though all these thoughts are taking place in a bounded mind, they are shadows of a higher unbounded structure which is the model identified by the Peano axioms; all the things being said are true about the numbers. Even though somebody who was missing the point would at once object that the human contained no mechanism to evaluate each of their statements against all of the numbers, so obviously no human could ever contain a mechanism like that, so obviously you can't explain their success by saying that each of their statements was true about the same topic of the numbers, because what could possibly implement that mechanism which (in the person's narrow imagination) is The One Way to implement that structure, which humans don't have?

But though mathematical reasoning can sometimes go astray, when it works at all, it works because, in fact, even bounded creatures can sometimes manage to obey local relations that in turn add up to a global coherence where all the pieces of reasoning point in the same direction, like photons in a laser lasing, even though there's no internal mechanism that enforces the global coherence at every point.

To the extent that the outer optimizer trains you out of paying five apples on Monday for something that you trade for two oranges on Tuesday and then trading two oranges for four apples, the outer optimizer is training all the little pieces of yourself to be locally coherent in a way that can be seen as an imperfect bounded shadow of a higher unbounded structure, and then the system is powerful though imperfect because of how the power is present in the coherence and the overlap of the pieces, because of how the higher perfect structure is being imperfectly shadowed. In this case the higher structure I'm talking about is Utility, and doing homework with coherence theorems leads you to appreciate that we only know about one higher structure for this class of problems that has a dozen mathematical spotlights pointing at it saying "look here", even though people have occasionally looked for alternatives.

                                        -- Eliezer Yudkowsky, Ngo and Yudkowsky on alignment difficulty

Selection Pressure for Coherent Reflexes

Imagine a population of replicators. Each replicator also possesses a set of randomly assigned reflexive responses to situations it might encounter. For instance, above and beyond reproducing itself after a time step, a replicator might reflexively, probabilistically transform some local situation , when encountered, into some local situation . The values of  and  are set randomly and there are no initial consistency requirements, so the replicators will generally behave spastically at this point.

Most of these replicators will end up with incoherent sets of reflexes. Some, for example, will cyclically transform  into  into  into , and so on. Others will transform their environment in "wasteful" ways, moving it into some state that could have been reached with greater certainty via some different series of transformations.

But some of the replicators will possess coherent sets of reflexes. These replicators will never "double back" on their previous directional transformations of their situation. They will thus be more successful in reaching some situation  than an incoherent counterpart would be. And when  is a fitness-improving situation, reflexive coherence targeting it will be selected for.

The Instrumental Incentive to Exploit Incoherence

Once you have a population of coherent agents, the selection pressure against reflexive incoherency increases. A dumb-matter environment will throw up situations at random, and so incoherent replicators will only fall into traps as those traps happen to come up. But a population of coherent agents will actively exploit the incoherent among them; incoherent agents are now pools of resources for coherent agents to exploit.

Solipsistic vs. Multi-Agent Training Regimes

ML models are generally trained inside a solipsistic world (with notable exceptions). They, all by their lonesome, are fed sense data and then are repeatedly modified by gradient descent to become better at modulating that sense data. There's optimization pressure for them to become reflexively coherent, but not as much as they would face in an environment of Machiavellian coherent agents.

(Of course, if you train enough, even this gentler pressure will add up).

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 10:53 AM

Nice post -- simple, clear, explains an important point better than the Yudkowsky-Ngo dialogue did.

The partially-baked theory of agency I'm working on is basically that agents are a kind of chain reaction: whereas fires are chain reactions in which a small amount of heat grows larger (by consuming fuel and oxygen) agents are chain reactions in which a small cluster of convergent instrumental resources (money, knowledge, energy) grows larger.

One of the reasons I like this theory is that it fits nicely with selection-based arguments like the one you are making here. It makes sense that we should expect competitions for knowledge, resources, and influence to be won by knowledge+resources+influence chain reactions!

They will thus be more successful in reaching some situation  than an incoherent counterpart would be.

This is only the case if either the cost of being coherent is negligible or if the depth of said search tree is very high.

If I have a coherent[1] agent that takes 1 unit of time / step, or an agent that's incoherent 1% of steps but takes 0.99 units of time / step, the incoherent agent wins on average up to a depth of <=68.

(Now: once you have a coherent agent that can exploit incoherent agents, suddenly the straight probabilistic argument no longer applies. But that's assuming that said coherent agent can evolve in the first place.)

  1. ^

    In 'reality' all agents are incoherent, as they have non-zero error probability per step. But you can certainly push this probability down substantially.