Vanessa Kosoy

AI alignment researcher supported by MIRI and LTFF. Working on the learning-theoretic agenda. Based in Israel.

E-mail: vanessa DOT kosoy AT {the thing reverse stupidity is not} DOT org

Physicalist agents see themselves as inhabiting an *unprivileged* position within the universe. However, it's unclear whether humans should be regarded as such agents. Indeed, monotonicity is highly counterintuitive for humans. Moreover, historically human civilization struggled a lot with accepting the Copernican principle (and is still confused about issues such as free will, anthropics and quantum physics which physicalist agents shouldn't be confused about). This presents a problem for superimitation.

What if humans are actually cartesian agents? Then, it makes sense to consider a variant of physicalist superimitation where instead of just seeing itself as unprivileged, the AI sees the *user* as a privileged agent. We call such agents "transcartesian". Here is how this can be formalized as a modification of IBP.

In IBP, a hypothesis is specified by choosing the state space and the belief . In the transcartesian framework, we require that a hypothesis is augmented by a mapping , where is the action set of the *reference agent* (user) and is the observation set of the reference agent. Given the source code of the reference agent, we require that is supported on the set

That is, the actions of the reference agent are indeed computed by the source code of the reference agent.

Now, instead of using a loss function of the form , we can use a loss function of the form which doesn't have to satisfy any monotonicity constraint. (More generally, we can consider hybrid loss functions of the form monotonic in the second argument.) This can also be generalized to reference agents with hidden rewards.

As opposed to physicalist agents, transcartesian agents *do* suffer from penalties associated with the description complexity of bridge rules (for the reference agent). Such an agent can (for example) come to believe in a simulation hypothesis that is unlikely from a physicalist perspective. However, since such a simulation hypothesis would be compelling for the reference agent as well, this is not an alignment problem (epistemic alignment is maintained).

Until now I believed that a straightforward bounded version of the Solomonoff prior cannot be the frugal universal prior because Bayesian inference under such a prior is NP-hard. One reason it is NP-hard is the existence of pseudorandom generators. Indeed, Bayesian inference under such a prior distinguishes between a pseudorandom and a truly random sequence, whereas a polynomial-time algorithm cannot distinguish between them. It also seems plausible that, in some sense, this is the *only* obstacle: it was established that if one-way functions don't exist (which is equivalent to pseudorandom generators not existing), computing time-bounded Kolomogorov complexity is polynomial-time in the average-case^{[1]}.

However, if pseudorandom sequences are truly the only obstacle, then this problem seems remarkably similar to the password game. Indeed, correctly predicting a pseudorandom sequence requires extracting its seed, which is a piece of completely structureless random information similar to a password. This leads to the following bold conjecture: what if, **it is not only statistically, but also ****computationally**** feasible to achieve an effective ****epistemic regret bound**** for a bounded Solomonoff prior?** (Assuming some computationally bounded theory of algorithmic statistics.)

Arguably, a pseudorandom sequence with a fixed seed cannot rule this out because the seed length would count for time-bounded Kolomogorov complexity but not for time-bounded sophistication (whatever the latter means), and hence the regret bound would have a penalty exponential in the length of the seed, accounting for the computational difficulty of extracting it. A pseudorandom sequence with a random seed also cannot rule this out, because, while sampling such a sequence is easy, predicting it based on past observations is hard, so we are penalized by its superpolynomial time-bounded Kolmogorov complexity (for the right notion of "time-bounded").

^{^}Admittedly, the fact it's only average-case makes the evidence a lot weaker.

I have a question about the conjecture at the end of Direction 17.5. Let be a utility function with values in and let be a strictly monotonous function. Then and have the same maxima. can be non-linear, e.g. . Therefore, I wonder if the condition should be weaker.

No, because it changes the *expected value* of the utility function under various distributions.

Moreover, I ask myself if it is possible to modify by a small amount at a place far away from the optimal policy such that is still optimal for the modified utility function.

Good catch, the conjecture as stated is obviously false. Because, we can e.g. take to be the same as everywhere except after some action which doesn't actually take, in which case make it identically 0. Some possible ways to fix it:

- Require the utility function to be of the form (i.e. not depend on actions).
- Use (strictly) instrumental reward functions.
- Weaken the conclusion so that we're only comparing and on-policy (but this might be insufficient for superimitation).
- Require to be optimal off-policy (but it's unclear how can this generalize to finite ).

2mo4

Here’s a plausible human circular preference. You won a prize! Your three options are: (A) 5 lovely plates, (B) 5 lovely plates and 10 ugly plates, (C) 5 OK plates.

No one has done this exact experiment to my knowledge, but plausibly (based on discussion of a similar situation in

Thinking Fast And Slowchapter 15) this is a circular preference in at least some people: When people see just A & B, they'll pick B because "it's more stuff, I can always keep the ugly ones as spares or use them for target practice or whatever". When they see just B & C, they'll pick C because "the average quality is higher". When they see just C & A, they'll likewise pick A because "the average quality is higher".

This makes no sense to me. Why would you pick C over B? B Pareto dominates C since it contains 5 lovely plates whereas C only has 5 OK plates.

2mo20

I propose the axioms A1-A3 together with

B2. If then for any we have

B3. If and , then for any we have

I suspect that these imply C4.

Maybe I am confused by what you mean by . I thought it was the state space, but that isn't consistent with in your post which was defined over ?

I'm not entirely sure what you mean by *the* state space. is *a* state space associated specifically with the *utility function*. It has nothing to do with the state space of the environment. The reward function in the OP is , not . I slightly abused notation by defining in the parent comment. Let's say it's and is defined by using to translate the history to the (last) state and then applying .

One more question, this one about the priors: what are they a prior over exactly? ...I ask because the term will be positive infinity if is zero for any value where is non-zero.

The prior is just an environment i.e. a partial mapping defined on every history to which it doesn't itself assign probability . The expression means that we consider all possible ways to choose a Polish space , probability distributions and a mapping s.t. and (where the expected value is defined using the Bayes law and *not* pointwise, see also the definition of "instrumental states" here), and take the minimum over all of them of .

Good idea!

Fix some alphabet . Here's how you make an automaton that checks that the input sequence (an element of ) is a subsequence of some infinite periodic sequence with period . For every in , let be an automaton that checks whether the symbols in the input sequences at places s.t. are all equal (its number of states is ). We can modify it to make a transducer that produces its unmodified input sequence if the test passes and if the test fails. It also produces when the input is . We then chain and get the desired automaton. Alternatively, we can connect the in parallel and then add an automaton with boolean inputs that acts as an AND gate. is a valid multi-input automaton in our language because AND is associative and commutative (so we indeed get a functor on the product category).

Notice that the description size of this automaton in our language is polynomial in . On the other hand, a tabular description would be exponential in (the full automaton has exponentially many states). Moreover, I think that any regular expression for this language is also exponentially large.

We only talked about describing deterministic (or probabilistic, or monadic) automata. What about nondeterministic? Here is how you can implement a nondeterministic automaton in the same language, without incurring the exponential penalty of determinization, assuming non-free categories are allowed.

Let be some category that contains an object and a morphism s.t. and . For example it can be the closed cartesian category freely generated by this datum (which I *think* is well-defined). Then, we can simulate a non-deterministic automaton on category by a deterministic transducer from to :

- The state set is always the one element set (or, it can be two elements: "accept" and "reject").
- For every state of , we have a variable of signature . This variable is intended to hold when the state is achievable with the current input and otherwise.
- We use the fact that composition of endomorophisms of behaves as an "OR" operator to implement the transition rules of .

However, this trick doesn't give us nondeterministic transducers. A completely different way to get nondeterminism, which works for transducers as well, is using infra-Bayesian Knightian uncertainty to represent it. We can do via the memo monad approach, but then we get the constraint that nondeterministic transitions happen identically in e.g. identical subgraphs. This doesn't matter for ordinary automata (that process words) but it matters for tree/graph automata. If we don't want this constraint, we can probably do it in a framework that uses "infra-Markov" categories (which I don't know how to define atm, but it seems likely to be possible).

Together with Example 1, this implies that (this version of) our language is strictly more powerful than regular expressions.

Suppose we want to take two input sequences (elements of ) and check them for equality. This is actually impossible without the meta-vertex, because of the commutativity properties of category products. With the meta-vertex (the result is not an automaton anymore, we can call it a "string machine"), here is how we can do it. Denote the input sequences and . We apply a transducer to that transforms it into a string diagram. Each symbol in is replaced by a transducer that checks whether the first symbol of *its* input is and outputs the same sequence with the first symbol removed. These transducers are chained together. In the end of the chain we add a final transducer that checks whether its input is empty. Finally, we use the meta-vertex to feed into this chain of transducers.

Notice that this requires only one level of meta: the string diagram we generate doesn't contain any meta-vertices.

More generally, we can simulate any multi-tape automaton using a succinct string machine with one level of meta: First, there is a transducer that takes the tapes as separate inputs and simulates a single step of the multi-tape automaton. Here, moving the tape head forward corresponds to outputting a sequence with the first symbol omitted. Second, there is a transducer that takes the inputs and produces a string diagram that consists of copies of chained together (this transducer just adds another to the chain per every input symbol). Finally, the desired string machine runs and then uses a meta-vertex to apply the diagram produces to the input.

A string machine with unbounded meta can simulate a 1D cellular automaton, and hence any Turing machine: First, there is a transducer which simulates a single step of the automaton (this is a classical transducer, not even the stronger version we allow). Second, we already know there is an automaton that tests for equality. Third, fix some automaton whose job is to read the desired output off the final state of the cellular automaton. We modify to make a transducer , which (i) when the inputs are equal, produces a string diagram that consists only of (ii) when the inputs are different, produces this string machine. This string macine applies to the input , then applies to and , then uses the meta-vertex to apply to ...

Except that I cheated here because the description of references a string machine that contains . However, with *closed* machines such quining can be implemented: instead of producing a string diagram, the transducer produces an morphism that converts transducers to string diagrams, and then we feed the same transducer composed in the same way with itself into this morphism.

Jobst Heitzig asked me whether infra-Bayesianism has something to say about the absent-minded driver (AMD) problem. Good question! Here is what I wrote in response: