[Metadata: crossposted from https://tsvibt.blogspot.com/2023/03/the-fraught-voyage-of-aligned-novelty.html. First completed March 11, 2023.]

A voyage of novelty is fraught. If a mind takes a voyage of novelty, an observer is hard pressed to understand what the growing mind is thinking or what effects the growing mind will have on the world.

Fraughtness

If a mind grows beyond human activity then it thinks in a language you don't speak, using skills you don't possess, derived from sources you don't share. The mind incorporates new ideas, invalidating your predictions and specifications for how the mind affects the world.

"Fraught" is cognate with "freight": fraughtness is ladenness. "Fraught" is an intensive prefix "fra-" (cognate with English "for-", "fro", Ancient Greek "πρό-") followed by a descendant of Proto-Germanic "aiganą", meaning "to possess". From that etymon comes English "owe", "ought", and "own", and German "eigen". To be fraught is to be laden by intensely possessing.

Extended table of contents:

Understanding: If a mind takes a far-ranging voyage of novelty, it's difficult for an observer to understand what the mind thinks, knows, and does.
- Neurath's ship: Novelty involves change. Some of that change deeply changes how the mind operates and how the mind connects its capabilities to the effects it has on the world. Previous understanding of the mind, and in particular previously correct models of the mind's effects, might not stay valid across these changes.
- Alienness: A mind that gains a lot of structure not possessed by humans is alien to humans. It has structure (content, generators, contexts) not shared by humans. So it's hard for humans to understand the mind.
- Inexplicitness: Structure in the mind is often unavailable for relation within the mind. Such structure tends to also be unavailable for relation with an observer.
- Noncomprehensiveness: Creativity, inexplicitness, provisionality, noncartesianness, and integration tend to make a mind and its elements nonencompassable. The mind and its elements can't easily be understood in a way that implies conclusions that will be valid across a very wide range of possible future contexts.
Agency: Agency comes along with novelty. It comes along unannounced.
- Pressure towards agency on a voyage of novelty: If a mind takes a far-ranging voyage of novelty, then it likely also has large effects on the world.
- The murkiness of values: The existing concepts of values and agency don't explain what determines the direction of the effects that a mind has on the world or how an observer could specify that direction.

Understanding

If a mind takes a far-ranging voyage of novelty, it's difficult for an observer to understand what the mind thinks, knows, and does.

Neurath's ship

Novelty involves change. Some of that change deeply changes how the mind operates and how the mind connects its capabilities to the effects it has on the world. Previous understanding of the mind, and in particular previously correct models of the mind's effects, might not stay valid across these changes.

Creativity implies an influx of novel elements.
Some mental elements are essentially provisional: they're never complete, and there's always a further growth of the mind that will call on them to change, grow, or refactor. To continue to understand the role played by essentially provisional elements, as a mind grows, we may be required to grow in ways analogous to how the mind is growing.
Some novelty is diasystemic: it touches on many preexisting elements, and doesn't fit into preexisting elements in a way neatly analogous to how the preexisting elements fit in with each other. Diasystemic novelty may be necessary to create very capable minds, but more difficult to notice, appreciate as important, understand, and integrate with the mind's ultimate effects on the world. By touching on many elements, diasystemic novelty shifts a lot about how the mind works.
To the extent that a strong mind's capabilities are applied to modify the mind, there are strong pressures changing the mind. Properties of the mind that aren't stable under reflection are likely to be changed. Without knowing which properties can be reflectively stable for a given mind, we probably can't be very confident that a given property won't change.
These effects together push against there being stable reference points for understanding how a mind works and how it affects the world.
The ship has to stay afloat while its elements are changing: the mind has to stay correctable.

Alienness

A mind that gains a lot of structure not possessed by humans is alien to humans. It has structure (content, generators, contexts) not shared by humans. So it's hard for humans to understand the mind.

Suppose that a mind takes a far-ranging voyage of novelty.

If the mind takes a voyage of novelty that goes beyond human understanding, it incorporates structure that humans don't understand.
- That is, the mind gains understanding that is ectosystemic novelty for humans, i.e. structure alien to humans.
- A mind gains understanding when it comes to possess novel elements. A mind possesses an element when the element is integrated into the mind so that the mind is empowered to use the element. See here and here.
- The mind incorporates some structure (i.e. gains understanding), but the humans neither understand the structure through explicit theory, nor do the humans possess that structure. (That is, the humans haven't empowered themselves to use that structure by integrating it into themselves.)
- For example, we could listen in on set theorists talking about "HOD mice" and "proper forcing axiom" and "constructible universe" and be bewildered. We can tell that they are having thoughts, or at least doing something that looks very much like having thoughts——they are talking about some objects, properties, relations, propositions, and proofs, and they're using words the way we use words. But we can't follow their reasoning——we can't verify whether it's valid (because they may use enthymemes with hidden premises we don't know), and we can't derive the most relevant implications from it.
- This novelty is what's potentially valuable for humans about the mind. The mind understands things that the humans don't understand, so it can give humans possession of that understanding, either by helping the humans to gain possession of that understanding for themselves, or by giving the humans "possession by proxy" (as in, hiring a plumber because you don't know how plumbs work).
The mind is on a trajectory of creativity: something about it implies that it as-yet-inexhaustibly generates novelty.
- If the mind's trajectory of creativity goes beyond human understanding, then the sources of the mind's creativity are more likely to be alien to humans.
- If the sources (the proximal ex quo) of the mind's creativity are alien to humans, then the mind is hard for humans to understand.
The mind continues its trajectory of creativity as long as it operates.
- As long as the mind operates, there's an influx of novelty, which will tend to be more and more alien to the humans.
- If the humans don't understand the sources of the mind's creativity, then they don't understand how those sources play out along the trajectory of creativity.
The alienness of the mind renders gemini modeling more difficult. That is, novelty that's ectosystemic for humans but endo/diasystemic for the mind renders other elements of the mind, in their exertion, more alien to humans.
- To gemini model an element E that unfolds in context C within the mind, a human needs available a sufficiently similar context C'.
- The mind is Neurath's ship, many elements potentially changing. Suppose that elements that contribute to the context C for the element E change. Then a human's corresponding context C' will be partially invalidated, so that it now provides a context for the human's E that is less similar to C. Then E unfolds (takes on its implications) differently in C and in C', so the human less successfully gemini models the mind's E.
- This is distinct from the direct difficulties from the mind being Neurath's ship. The direct difficulty is that some element E might change, so the human's understanding of E will no longer apply. The difficulty with gemini modeling is indirect. The human's (gemini) model of some other element E2 might be invalidated when E changes because E provided key context for E2.
- Example: the behavior of a piece of code changes if the behavior of a function that it calls also changes, and the implications of a piece of code change if the code that calls that code changes.
- Example: the behavior downstream of a goal held by an agent can be greatly changed if the agent changes which decision theory it uses.
- Example: if a mind begins to habitually judge against sets of propositions that are contradictory, the sentences that the mind judges in favor of take on a different kind of meaning than they had when the mind did not habitually reject contradictions.
- The mind's trajectory of creativity brings in novelty. This novelty tends to be alien to humans. The contexts provided by the alien novelty are therefore alien to humans. These contexts may be for both preexisting elements (which are themselves less alien to humans) and also for novel elements. In this way, the alienness of alien novelty compounds. The novel element is alien; and the contexts it provides are alien; so other elements are rendered less gemini modelable and more alien.
- The presence of change itself makes the mind alien to humans. The trajectory of creativity for a human only goes so rapidly. For the mind to take a far-ranging voyage of novelty, it has to go very rapidly. A mind very rapidly moving on a trajectory of creativity is alien to humans. Elements take on new meanings and constraints rapidly, so they are adapted to do so, unlike elements as humans have elements. That is, the context for elements in the mind is a context that shifts more rapidly than any context in the human's mind, and therefore that context is not available for humans to use for gemini modeling.
- Diasystemic novelty especially changes contexts, and so it especially invalidates gemini modeling. In the extreme case, if the mind occupies other cognitive realms, gemini modeling may be extremely difficult or strictly infeasible.
- Therefore the presumption that gemini modeling is available, which is usually valid between humans, is not valid between a human and a mind that takes a far-ranging voyage of novelty.
- In other words: The problem is not just that the human can't extract, detect, find, root out, or demarcate elements in the growing mind. The problem is also that the human doesn't have the background understanding that would be needed to make much sense of what the mind is thinking.

Inexplicitness

Structure in the mind is often unavailable for relation within the mind. Such structure tends to also be unavailable for relation with an observer.

Novelty in general starts out inexplicit for the mind. See "Explicitness".
Generators and diasystemic novelty especially tend to be inexplicit for the mind, and also tend to have broad consequences for the mind.
Elements that are inexplicit for a mind tend to be less available for modeling and gemini modeling for an observer. If the mind takes a far-ranging voyage of novelty, there are many elements of the mind that are inexplicit for the mind.
If the mind takes a far-ranging voyage of novelty, then it will tend to have proximal ex quos that are distinct from any original distal ex quo. For example, brains are a more proximal ex quo of human understanding, compared to the distal ex quo, evolution and genome pools. So a mind that comes from generators that are explicit for us might still itself be inexplicit for us, having "lost" its explicitness for us through the mediation of intermediate ex quos. In other words: just because you know how to make something grow, doesn't mean you know how it works or how it thinks.
A class of inexplicitnesses is conceptual Doppelgängers. Suppose there are conceptual Doppelgängers in a mind, i.e. there are elements that each deal with some fixed $X$ , though maybe in a different way. Then humans can't rely on their understanding of [how the mind deals with $X$ ] to be anywhere near comprehensive; even if the human has understood $E_{0}$ and $E_{1}$ , there are other elements $E_{2}, . . ., E_{k}$ , not in sight, that also deal with $X$ . Dually, elements tend to be more general than their current apparent use——the possible applications that haven't been realized yet are witnesses of inexplicitness (this happens very often in mathematics).
Inexplicit novelty might not even be noticed by an observer. Then the observer doesn't know what ze doesn't know. This might for example lead the observer to overestimate how effective ze will be at gemini modeling elements of the mind, by underestimating the extent to which the context of those elements in the mind has shifted by novelty.
- For example, if a mind's object level explicit thoughts are alien to the human, then the mind has specific understanding that the human doesn't know about. That specific understanding has to be explained as coming from creative generators, but the human can't do that because ze can't even see the specific understanding. At best the human can see that the mind is thinking about something. So if the set theorists invent some new, general, powerful way of thinking that shows itself first in their set theorizing, we wouldn't notice.
Inexplicit elements tend to be harder to modify, so harder to exert control over.
Context shifts invalidate inexplicit understanding.
- That is: Suppose an element E of the mind M is not understood very explicitly by the human. In other words, E is not very available for elements of the human to relate to when relating would be suitable. It might nevertheless be the case that E is understood well enough by the human to proceed with understanding what M is up to overall, at least for now.
- For example, the ways that E is possessed by M (the ways that E is currently empowering M) might be understood well enough by the human. E.g., I'm familiar with the fact that the plumber can't soundproof my room, but can fix a shutoff valve that's gone bad (though I don't know how).
- But then, on M's voyage of novelty, more elements are integrated into M that relate to E. Since E is not explicit for the human, the human can't use a twin of E to model the ways that E relates to the new elements in M. The human didn't have elements that were available for relation in the same way that E was available for relation in M. So M will gain more possession of E, exerting E in more ways, rendering no longer valid the human's understanding of what M can do with E.

Noncomprehensiveness

Creativity, inexplicitness, provisionality, noncartesianness, and integration tend to make a mind and its elements nonencompassable. The mind and its elements can't easily be understood in a way that implies conclusions that will be valid across a very wide range of possible future contexts.

See "Uncontainability", "Strong uncontainability", and "Nearest unblocked strategy".
Comprehend = together-fore-seize. As in, something has been encompassed, preliminarily seized altogether, on all sides. To comprehend is to hold all of something together at once, to fully understand, contain, encompass.
- More precisely: To comprehend an element (to some extent in some ways) is to know the element well enough that some conclusions about the element will be valid across a very wide range of possible contexts the element might be in, in the future.
- A way of understanding an element is noncomprehensive when it doesn't comprehend the element.
- (Non)encompassing(ness) is an alternative term.
Comprehendibility
- If part or all of a mind (an element) is comprehended by some understanding, then that understanding is a sufficient summary of that element. The understanding can be used to draw conclusions about the mind and design manipulations of the mind without having to get into the details of that element.
- If a human doesn't comprehend an element, the human can't make conclusions about the mind that will hold up to strong forces and context shifts, without getting into the details of that element.
- See the X-and-only-X problem. Quoting from there: "X-and-only-X is what I call the issue where the property that's easy to verify and train is X, but the property you want is "this was optimized for X and only X and doesn't contain a whole bunch of possible subtle bad Ys that could be hard to detect formulaically from the final output of the system"."
- A mind can be more or less comprehendible by having elements that are more or less comprehensive. That is, a mind can have (comprehensive) elements that contain many elements, screening off those elements from controlling other elements. Those comprehensive elements make the mind as a whole more comprehendible, e.g. by a human, because the human doesn't have to understand the elements that were screened off. For example, a search process with a unique, verifiable answer screens off the elements involved in finding the answer (assuming a perfect implementation); then, an observer can ignore those elements.
- An element might be comprehendible indirectly. A central example of this possibility: If an agent can be rightly understood as fundamentally wanting X and not fundamentally wanting anything else, then all of the agent's mind is partly comprehended. The agent will put the strength of any novelty towards X. An observer might be unable to fully comprehend the agent, because ze doesn't know what the agent's actions will be; and the observer might be also unable to comprehend much of anything about any of the elements of the agent, in that the agent might continue growing and making new use of any element; but the observer can (somehow, by assumption) draw a conclusion about the ultimate effects of the agent, (by assumption) robustly to the agent's actions and growth.
- If there are many elements of a mind that a human doesn't comprehend, whether directly or indirectly, then the human can't know much about what effects the mind will have.
- Large elements or collections of elements can be together comprehended by smaller elements. For example, a simple perfect chess playing computer program will search a vast space of possibilities, but we can see that the output of this program would win any won position and wouldn't have any other effects (given an opponent that doesn't try to make some other effect happen). See KANSI.
Generators of creativity and free self-creation are hard to comprehend, and are central sources of a mind's strength.
Noncomprehensiveness compounds.
- To the extent that a strong mind's integrative creativity is exerted to add to or modify itself, the resulting novelty is noncomprehended by default. Since the novelty was brought in by something noncomprehended, there wasn't a robust understanding that has to apply to the novelty. In other words, self-modification breaks an observer's partial understanding, because it happens in a partially nonunderstood way.
- If novelty goes on to contribute to creativity, that novelty is especially noncomprehended because its downstream consequences include the downstream consequences of the elements it helps to create.
- (This isn't double counting, it's tallying two aspects of one event. When a non-understood element helps create a non-understood element, that event witnesses that the creator element was weird (because it made a weird element), and the event also creates a new weird element.)
An inexplicitness tends to also be a noncomprehensiveness.
- Suppose an element is unavailable for a relation that would would be suitable. Then that relation can't be used to circumscribe the element.
- For example, conceptual Doppelgängers hide the existence of elements that are doing overlapping work X. Since the analogy between the Doppelgängers is not available, an observer isn't told that to understand how the mind deals with X, both Doppelgängers should be understood. So, how the mind deals with X "sneaks around" the observer's understanding of one of the Doppelgängers, i.e. the observer's understanding of how the mind deals with X is noncomprehensive.
- The inexplicit element also tends to hide the possibility of the unavailable but possibly suitable relation. So there is a possible relation that would strengthen the mind, but that is hidden, and so the element is not comprehended.
- For example, conceptual Doppelgängers hide the possibility of an analogy or unification. So, the observer doesn't know that the mind might gain a stronger way of thinking by unifying what's shared between the Doppelgängers. For example, the precursors to group theory became unified and unlocked a rich theory that wouldn't have been easy to read off of the precursors themselves; so the precursors weren't comprehended by thinking of them on their own terms, witnessed by the fact that on further thought the precursors spilled over into the new theory of groups with new implications.
- If an observer relies on possession, all inexplicitness (and all nonpossession) is noncomprehended. In other words, if an observer only keeps track of how a mind is currently making use of an element, then the observer will be surprised by all the ways the element later becomes more generally usable and more used.
An element is provisional if it's suitably treated as being open to future revision. See "Provisionality". A provisional element tends to be noncomprehended.
- Since the element is suitably treated as if it might be revised, a comprehension of it would suitably be somehow robust to those possible future revisions.
- An element can, though, be provisional while being comprehended.
- Example: Kasparov's understanding of a chess position is provisional, open to revision by his ongoing stream of creative insights about the position, but we can still guess that he'll win the game. Kasparov's creativity is partially circumscribed: we know something about the outcome.
- Another example: a Bayesian belief system will have beliefs that are revised, so that the beliefs are provisional, but also an observer can prove theorems about how the system's beliefs change given evidence, e.g. concluding (with caveats such as cartesianness) that that Bayesian will (probably, eventually, given a rich enough prior) end up with correct beliefs.
Noncartesianness tends to also be noncomprehensiveness.
- Noncartesian elements of a mind are also provisional.
- Noncartesian elements of a mind will reflect the growing structure of the mind.
- For example, the mind's reflective understanding of itself is noncartesian. Since the mind grows, what there is for the mind to understand about itself grows. So [an observer's understanding of what the mind understands about itself] will be at least partially invalidated.
- Therefore to comprehend an element that is noncartesian for a mind, is to comprehend something that is comparably complex (surprising, creative, integrative, changing) to the mind.
More generally, any element that integrates input from many other elements tends to be noncomprehended.
- For example, any specific future behavior of a mind is potentially affected by any element of the mind. So to comprehend specific future behavior of a mind is to comprehend the whole mind, to some extent.
- Likewise, any propositional belief of a mind, especially one about the world (as opposed to about math), potentially depends on very many other elements (evidence, other beliefs, reasoning).
- Such elements might still be partially comprehendible. For example, an observer can't contain consequences of a strong mind's beliefs, but can know some things about the mind's beliefs. Suppose that if the mind believes the proposition P, then it does Z, and if it believes not-P, then it does A. The observer can't be confident what the mind will believe, or whether the mind will do Z or A, beyond what the observer knows about P. But, the observer can be confident that the mind will, if it thinks in terms of the language of P, assign probabilities satisfying $p (P) \approx 1 - p (\neg P)$ ——and therefore however the mind behaves, that behavior has to be the behavior recommended by a belief state satisfying the constraint $p (P) \approx 1 - p (\neg P)$ .
- However, the invariances that give partial comprehensibility for integrative elements are themselves not only provisional, but also highly "polygenic"——lots of elements influence the presence or absence of the preconditions for those invariances. For example: If a mind is well-understood as wanting something, then it has a telophore (e.g. it has a decision theory). But since the telophore densely helps to determine the ultimate effects of the agent on the world, any optimization flowing through the mind is incentivized to intervene on the telophore. A resulting change to the telophore would change how an observer should understand the mind as having effects on the world, hence invalidating the supposed comprehensiveness of the observer's previous understanding of the mind with its pre-modification telophore.
Gemini modeling, if feasible, is a source of comprehensive understanding.
- Suppose a mind M has an element E. Suppose an observer has a twin element E' of E, and has a mental context in which E' will fully unfold in the same way that E fully unfolds in M. Then the observer gemini models E.
- In this case the observer understands E in M comprehensively. Conclusions that the observer draws about E in M can be founded on the shared context, and so will be valid as E unfolds in M in that context.
- How can gemini modeling of E be supported well enough by mental contexts that are easier to acquire than the full mental context of E in M?

Agency

Agency comes along with novelty. It comes along unannounced.

The following uses "agent", "agency", and "values" pretheoretically. Roughly, an agent is a mind that pushes the world in a direction——a mind that has large effects on the world. An agent puts its understanding towards accomplishing tasks, and chooses actions that it predicts will lead to worlds that score highly on some fixed criterion. What the agent values is the direction it pushes the world. For partial, more-detailed parsings of the agent concept, see "Control", "Standard agent properties", and "Advanced agent properties".

Pressure towards agency on a voyage of novelty

If a mind takes a far-ranging voyage of novelty, then it likely also has large effects on the world.

Novel understanding might imply novel agency.
Seeking novelty is seeking possibilization, and possibilization might not be demarcatable from actualization.
For the mind to gain a lot of novelty, it has to autonomously investigate in novel contexts. Autonomous investigation maybe can't be separated out from agency.
A mind that gains a lot of structure is a mind that becomes strategically coherent.
- It may be that counting-up coherence can't be demarcated from counting-down coherence.
- The mind continues its trajectory of creativity; since the mind gains a lot of structure, by induction, the mind continues gaining structure.
- So the mind becomes coherent in more and more ways, i.e. the mind more and more pushes the world in a direction.
The operation of the mind that makes the mind's novel structure available for humans to possess, is not separated out from the generators of the mind's novelty. The generators of the mind's novelty have some agency in them.
To gain some kinds of novelty, agency might be required, or at least agency might be the easiest way to get that novelty. For example, what sort of entity discovers the idea of [truth in general] as a topic to investigate and have ideas about? An agent that is free to explore any domain will discover the convergently instrumental structure that there is to discover. That structure would tend to also be useful for humans, if it could be possessed by humans. (This is maybe related to David Deutsch's idea that creativity is what happens when a mind solves problems. Agents are what have problems.) Also, gaining possession of novelty (that is, rendering the structure usable for purposes beyond its first use) is a shared goal between the agent and humans. So agents are at least one way to bring in needed and widely useful structure in needful domains. (What non-agent ways could there be?)
Even if the mind can gain a lot of understanding without much agency, a little bit of strategicness can collapse the mind into a world-optimizer.
(Besides minds being pressured to be agentic, agency is also in some ways a helpful property: it might point the way to reflective stability, and it might point the way to a comprehensive understanding.)

The murkiness of values

The existing concepts of values and agency don't explain what determines the direction of the effects that a mind has on the world or how an observer could specify that direction.

There may be no clean separation of values from understanding, especially before a mind taking a voyage has stabilized into pursuing a fixed kind of world. In that case, the fraughtness of understanding a mind's understanding is also fraughtness in understanding a mind's values: the mind's values constantly shift in any aspect, they are alien, they are inexplicit, and they are not encompassed.
It's hard to know what ultimate end the mind is pursuing, and whether the mind is pursuing an ultimate end. Ultimate ends may be easily hidable behind convergent subgoals. Further: even the presence of ultimate ends, rather than say instrumental goals serving humans corrigibly, may be easily hidable behind convergent goals.
If the voyage of novelty requires some strong kind of autonomy, then more is staked on humans's understanding of the mind's values at the beginning of the mind's journey. The mind's ultimate ends have to be already known, and already knowably stable, before it goes without correction through a wide range of coping with novelty.
What determines what effects a mind has on the world?
What even are values? That is, what is "trying to ultimately do"?
How do values sit within a mind?
- How does a preexisting ability or "trying to do" interfaces with a novel ability or "trying to do"?
- How can values be comprehensive? How do values control the effects of a mind?
- To be comprehensive, "the values" have to "encompass the understanding". For example, the agent has to be stable under reflective self-modification. Contrast: Solomonoff induction may give malignant outputs.
How can a comprehensive "trying to do" be specified?

AI ALIGNMENT FORUM
AF