Infra-Bayesian physicalism: a formal theory of naturalized induction

[-]Vanessa Kosoy3y70Review for 2021 Review

The post is still largely up-to-date. In the intervening year, I mostly worked on the theory of regret bounds for infra-Bayesian bandits, and haven't made much progress on open problems in infra-Bayesian physicalism. On the other hand, I also haven't found any new problems with the framework.

The strongest objection to this formalism is the apparent contradiction between the monotonicity principle and the sort of preferences humans have. While my thinking about this problem evolved a little, I am still at a spot where every solution I know requires biting a strange philosophical bullet. On the other hand, IBP is still my best guess about naturalized induction, and, more generally, about the conjectured "attractor submanifold" in the space of minds, i.e. the type of mind to which all sufficiently advanced minds eventually converge.

One important development that did happen is my invention of the PreDCA alignment protocol, which critically depends on IBP. I consider PreDCA to be the most promising direction I know at present to solving alignment, and an important (informal) demonstration of the potential of the IBP formalism.

[-]Morpheus3y40

A physicalist hypothesis is a pair ), where $Φ$ is a finite^[4:2] set representing the physical states of the universe and $Θ \in □ (Γ \times Φ)$ represents a joint belief about computations and physics. [...] Our agent will have a prior over such hypotheses, ranging over different $Φ$ .

I am confused what the state space $Φ$ is adding to your formalism and how it is supposed to solve the ontology identification problem. Based on what I understood, if I want to use this for inference, I have this prior $ξ \in □^{c} (Φ, Θ)$ , and now I can use the bridge transform to project phi out again to evaluate my loss in different counterfactuals. But when looking at your loss function, it seems like most of the hard work is actually done in your relation $C \in e l^{Γ}$ that determines which universes are consistent, but its definition does not seem to depend on $Φ$ . How is that different from having a prior that is just over $ξ \in □^{c} (Γ)$ and taking the loss, if $Φ$ is projected out anyway and thus not involved?

[-]Vanessa Kosoy3y30

First, the notation makes no sense. The prior is over hypotheses, each of which is an element of $□ (Γ \times Φ)$ . $Θ$ is the notation used to denote a single hypothesis.

Second, having a prior just over $Γ$ doesn't work since both the loss function and the counterfactuals depend on $2^{Γ} \times Γ$ .

Third, the reason we don't just start with a prior over $2^{Γ} \times Γ$ , is because it's important which prior we have. Arguably, the correct prior is the image of a simplicity prior over physicalist hypotheses by the bridge transform. But, come to think about it, it might be about the same as having a simplicity prior over $2^{Γ} \times Γ$ , where each hypothesis is constrained to be invariant under the bridge transform (thanks to Proposition 2.8). So, maybe we can reformulate the framework to get rid of $Φ$ (but not of the bridge transform). Then again, finding the "ultimate prior" for general intelligence is a big open problem, and maybe in the end we will need to specify it with the help of $Φ$ .

Fourth, I wouldn't say that $Φ$ is supposed to solve the ontology identification problem. The way IBP solves the ontology identification problem is by asserting that $2^{Γ} \times Γ$ is the correct ontology. And then there are tricks how to translate between other ontologies and this ontology (which is what section 3 is about).

[-]Jon Garcia4y30

Could you explain what the monotonicity principle is, without referring to any symbols or operators? I gathered that it is important, that it is problematic, that it is a necessary consequence of physicalism absent from cartesian models, and that it has something to do with the min-(across copies of an agent) max-(across destinies of the agent copy) loss. But I seem to have missed the content and context that makes sense of all of that, or even in what sense and over what space the loss function is being monotonic.

Your discussion section is good. I would like to see more of the same without all the math notation.

If you find that you need to use novel math notation to convey your ideas precisely, I would advise you to explain what every symbol, every operator, and every formula as a whole means every time you reference them. With all the new notation, I forgot what everything meant after the first time they were defined. If I had a year to familiarize myself with all the symbols and operators and their interpretations and applications, I imagine that this post would be much clearer.

That being said, I appreciate all the work you put into this. I can tell there's important stuff to glean here. I just need some help gleaning it.

[-]Vanessa Kosoy4y40

Could you explain what the monotonicity principle is, without referring to any symbols or operators?

The loss function of a physicalist agent depends on which computational facts are physically manifest (roughly speaking, which computations the universe runs), and on the computational reality itself (the outputs of computations). The monotonicity principle requires it to be non-decreasing w.r.t. the manifesting of less facts. Roughly speaking, the more computations the universe runs, the better.

This is odd, because it implies that the total destruction of the universe is always the worst possible outcome. And, the creation of an additional, causally disconnected, world can never be net-negative. For a monotonic agent, there can be no net-negative world^[1]. In particular, for selfish monotonic agents (such that only assign value to their own observations), this means death is the worst possible outcome and the creation of additional copies of the agent can never be net-negative.

With all the new notation, I forgot what everything meant after the first time they were defined.

Well, there are the "notation" and "notation reference" subsections, that might help.

That being said, I appreciate all the work you put into this. I can tell there's important stuff to glean here.

Thank you!

At least, all of this is true if we ignore the dependence of the loss function on the other argument, namely the outputs of computations. But it seems like that doesn't qualitatively change the picture. ↩︎

[-]ViktoriaMalyasova3y40

Thank you for explaining this! But then how can this framework be used to model humans as agents? People can easily imagine outcomes worse than death or destruction of the universe.

[-]Vanessa Kosoy3y20

The short answer is, I don't know.

The long answer is, here are some possibilities, roughly ordered from "boring" to "weird":

The framework is wrong.
The framework is incomplete, there is some extension which gets rid of monotonicity. There are some obvious ways to make such extensions, but they look uglier and without further research it's hard to say whether they break important things or not.
Humans are just not physicalist agents, you're not supposed to model them using this framework, even if this framework can be useful for AI. This is why humans took so much time coming up with science.
Like #3, and also if we thought long enough we would become convinced of some kind of simulation/deity hypothesis (where the simulator/deity is a physicalist), and this is normatively correct for us.
Because the universe is effectively finite (since it's asymptotically de Sitter), there are only so many computations that can run. Therefore, even if you only assign positive value to running certain computations, it effectively implies that running other computations is bad. Moreover, the fact the universe is finite is unsurprising since infinite universes tend to have all possible computations running which makes them roughly irrelevant hypotheses for a physicalist.
We are just confused about hell being worse than death. For example, maybe people in hell have no qualia. This makes some sense if you endorse the (natural for physicalists) anthropic theory that only the best-off future copy of you matters. You can imagine there always being a "dead copy" of you, so that if something worst-than-death happens to the apparent-you, your subjective experiences go into the "dead copy".

[-]Jon Garcia4y00

The monotonicity principle requires it to be non-decreasing w.r.t. the manifesting of less facts. Roughly speaking, the more computations the universe runs, the better.

I think this is what I was missing. Thanks.

So, then, the monotonicity principle sets a baseline for the agent's loss function that corresponds to how much less stuff can happen to whatever subset of the universe it cares about, getting worse the fewer opportunities become available, due to death or some other kind of stifling. Then the agent's particular value function over universe-states gets added/subtracted on top of that, correct?

[-]Vanessa Kosoy4y30

No, it's not a baseline, it's just an inequality. Let's do a simple example. Suppose the agent is selfish and cares only about (i) the experience of being in a red room and (ii) the experience of being in a green room. And, let's suppose these are the only two possible experiences, it can't experience going from a room in one color to a room in another color or anything like that (for example, because the agent has no memory). Denote the program corresponding to "the agent deciding on an action after it sees a green room" and $R$ the program corresponding to "the agent deciding on an action after it sees a red room". Then, roughly speaking^[1], there are 4 possibilities:

$α_{\emptyset}$ : The universe runs neither $R$ nor $G$ .
$α_{R}$ : The universe runs $R$ but not $G$ .
$α_{G}$ : The universe runs $G$ but not $R$ .
$α_{R G}$ : The universe runs both $R$ and $G$ .

In this case, the monotonicity principle imposes the following inequalities on the loss function $L$ :

$L (α_{\emptyset}) \geq L (α_{R})$ $L (α_{\emptyset}) \geq L (α_{G})$ $L (α_{R}) \geq L (α_{R G})$ $L (α_{G}) \geq L (α_{R G})$

That is, $α_{\emptyset}$ must be the worst case and $α_{R G}$ must be the best case.

In fact, manifesting of computational facts doesn't amount to selecting a set of realized programs, because programs can be entangled with each other, but let's ignore this for simplicity's sake. ↩︎

[-]David Johnston4y00

Γ=Σ^R, it's a function from programs to what result they output. It can be thought of as a computational universe, for it specifies what all the functions do.

Should this say "elements are function... They can be thought of as...?"

Can you make a similar theory/special case with probability theory, or do you really need infra-bayesianism? If the second, is there a simple explanation of where probability theory fails?

[-]Vanessa Kosoy4y20

Should this say "elements are function... They can be thought of as...?"

Yes, the phrasing was confusing, I fixed it, thanks.

Can you make a similar theory/special case with probability theory, or do you really need infra-bayesianism?

We really need infrabayesianism. On bayesian hypotheses, the bridge transform degenerates: it says that, more or less, all programs are always running. And, the counterfactuals degenerate too, because selecting most policies would produce "Nirvana".

The idea is, you must have Knightian uncertainty about the result of a program in order to meaningfully speak about whether the universe is running it. (Roughly speaking, if you ask "is the universe running 2+2?" the answer is always yes.) And, you must have Knightian uncertainty about your own future behavior in order for counterfactuals to be meaningful.

It is not surprising that you need infrabayesianism in order to do naturalized induction: if you're thinking of the agent as part of the universe then you are by definition in the nonrealizable setting, since the agent cannot possibly have a full description of something "larger" than itself.

If we ignore ideas such as the von Neumann-Wigner interpretation of quantum mechanics. ↩︎
This other direction also raises issues with counterfactuals. These issues are also naturally resolved in our formalism. ↩︎
The notation $Θ \times Λ$ is reserved for a different, commutative, operation which we will not use here. ↩︎
A simplifying assumption we are planning to drop in future articles. ↩︎ ↩︎ ↩︎
We have considered this type of setting before with somewhat different motivation. ↩︎
The careful reader will observe that programs sometimes don't halt which means that the "true" computational universe is ill-defined. This turns out not to matter much. ↩︎
Previously we defined pullback s.t. it can only be applied to particular infra/ultradistributions. Here we avoid this limitation by using infra/ultracontributions as the codomain. ↩︎
Meaning that, this rule is part of the definition of the state rather than a claim about physical reality. ↩︎
Disintegrating a distribution into a semidirect product yields a unique result, but for contributions that's no longer true, since it's possible to move scalars between the two factors. ↩︎
One can engineer loss functions for which they are not suppressed, for example if the loss only depends on the actions and not on the observations. But such examples seem contrived. ↩︎
It's a self-referential definition, but we can probably resolve the self-reference by quining. ↩︎
One setting in which there is an incentive: Suppose there are multiple users and the AI is trying to find a compromise between their preferences. Then, it might decide to create disconnected worlds optimized for different users. But, this solution might be much worse than the AI thinks, if Alice's world contains Bob!atrocities. ↩︎

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

38

Infra-Bayesian physicalism: a formal theory of naturalized induction

38

0. Background

1. Formalism

Notation

Notation Reference

Setting

Bridge transform

Evaluating policies

Evaluating agents

2. Properties of the bridge transform

Sanity test

Downwards closure

Simple special case

Bound in terms of support

Idempotence

Refinement

Mixing hypotheses

Conjunction

Factoring $Γ$

Continuity

3. Constructing loss functions

Selfish agents

Unobservable states

Cellular automata

Diamond maximizer

4. Translating Cartesian to Physicalist

Ordinary laws

Turing laws

5. Discussion

Summary

What about Solomonoff's universality?

Are manifest facts objective?

Physicalism and alignment

Future research directions

38

Infra-Bayesian physicalism: a formal theory of naturalized induction

38

0. Background

1. Formalism

Notation

Notation Reference

Setting

Bridge transform

Evaluating policies

Evaluating agents

2. Properties of the bridge transform

Sanity test

Downwards closure

Simple special case

Bound in terms of support

Idempotence

Refinement

Mixing hypotheses

Conjunction

Factoring Γ

Continuity

3. Constructing loss functions

Selfish agents

Unobservable states

Cellular automata

Diamond maximizer

4. Translating Cartesian to Physicalist

Ordinary laws

Turing laws

5. Discussion

Summary

What about Solomonoff's universality?

Are manifest facts objective?

Physicalism and alignment

Future research directions

Factoring $Γ$