How complex are myopic imitators?

Vivek Hebbar

This post was written under Evan Hubinger’s direct guidance and mentorship, as a part of the Stanford Existential Risks Institute ML Alignment Theory Scholars (MATS) program.

TL;DR: Time-limited myopic agents might be desirable for their safety properties; one such case is that of "myopic imitators", which imitate a non-myopic process like a human, but do so without considering consequences. Since DL systems are biased towards simple functions, I consider the question of whether myopic imitators are simpler or more complex than their deceptively aligned competitors.

For this analysis, we impose a strong assumption which is similar to minimal program complexity, as well as a clean division between a world model, an "internal interpreter", and a main algorithm which is executed. The analysis is structured as follows:

Operationalizations for "myopia" and "simplicity prior"
Methodology for assessing the simplicity of a candidate algorithm
Analysis of candidate algorithms under various sets of further assumptions

Under these very strong assumptions, and out of the algorithms analyzed, myopic imitators are usually favored. One of the two leading myopic imitators is a straightforward human-imitator which seems to have nice properties, but is sometimes disadvantaged by an inability to do a certain Goodhart-style exploit. The other one, though myopic, is a "training set predictor" which generalizes terribly in some cases. That being said, the specific results are to be taken with a grain of salt, since the assumptions are unlikely to hold in reality. The main value going forward is the general approach and methodology.

Assumptions, worldview, and research vision

The analysis in this post applies to a world where a very specific set of assumptions applies. To be clear, I am quite confident that this set of assumptions does not apply to current deep learning systems. Current deep learning systems seem to be characterized more by circuits and interpolation than by the types of programs I consider here. However, the analysis here is interesting as a limiting case, and the methodology might be transferable to more realistic assumption sets.

The main value of this post is to provide a particular frame and methodology for thinking about the question: "How is a system with a strong simplicity prior likely to turn out by default?"

By default, when I say "simple" or "complex", I am referring to program length, not runtime. However, note that program length is not the only possible measure of complexity; in fact, the current inductive bias literature uses a different measure, as will be discussed later.

1 - Program approximation and power: The trained model closely approximates a program in some language, which is allowed to contain such abstractions as function reuse, recursion, interpretation, etc. This may be realized in reality through things like recurrence.

2 - Internal interpreter: The model will contain some sort of "interpreter" internally which can take an algorithm represented in terms of internally defined tokens and execute it. This might follow from minimal program length; a large program in an inefficient language can be refactored as {interpreter for a more suitable language} + {much shorter program}. One effect of this assumption is that the complexity of other parts can depend on the lengths of tokens in the language of this internal interpreter.

3 - Simplicity prior: Training a model to zero loss corresponds to taking the set of all programs below a certain length and runtime which perfectly fit the training set, then choosing one at random, weighted by some simplicity prior. In other words, the probability ratio between two perfect algorithms depends on only their complexity. This may seem terribly wrong at first, since we might expect something like "reachability by gradient descent" or "buildability/evolvability from an imperfect intermediate step" to be dominant. However, research suggests that simplicity is an empirically strong predictor, and the Lottery Ticket Hypothesis gives reason to believe that models are mostly "found", not built up from partial versions. In reality, there might be significant path dependence, but for the purpose of this analysis, we will assume that simplicity is the only factor in determining the probability of a valid program.

4 - Nature of pre-training: Pre-training the model on a large dataset (but not to zero loss) will cause it to learn a world model in the form of reusable internal "functions", which will persist for free upon fine-tuning.

5 - Nature of fine-tuning: Fine-tuning will result in the formation of an "algorithm" which is wired to output, whose description length is simply added to that of the world model. The "algorithm" is make up of tokens (such as entities and functions from the world-model), is stored in compressed form, and "run" by the interpreter.

6 - Serial depth and computation time: The depth of serial computation, and the total number of operations at inference time, are both presumed to be capped at some very large amount. The cap is large enough for all algorithm discussed here to execute, but small enough to rule out things like AIXI. This allows us to potentially avoid the malign-ness of the true universal prior.

Myopia

An agent is said to be myopic (“short-sighted”) when it cares less about long-term consequences than about immediate things (such as properties of its immediate next action) or short-term consequences. Myopia naturally falls on a spectrum, but in this post I will take myopic to mean specifically “time-limited myopic” as defined in Open Problems with Myopia. A time-limited myopic agent has a discount rate of zero, meaning that it only cares about immediate properties of its next action (for example, immediate reward, or some intrinsic property of the action like “honesty”), and not the consequences of that action in future frames.

Why care about myopia?

One reason that myopic agents could be desirable is that they are less prone to instrumental incentives. Wide classes of agents display instrumental convergence to power-seeking, since power acquisition helps with any long-term goal. This power-seeking behavior is expected to include deception, since it would be in an AI’s interests to deceive their overseers into believing that the AI behaves acceptably. Time-limited myopic agents don’t have long-term goals, so they are a step in the right direction for solving these issues. However, time-limited myopic agents in general still suffer from issues like “superrationality”, which could cause them to exhibit non-myopic and deceptive behavior through trading with past instances of themselves. Further discussion of these issues can be found in Open Problems with Myopia.

Myopia for various algorithm types

In this section, I will look at different types of algorithms and what myopia means for each of them. Note that myopia is a mechanistic property, meaning that it is based on the agent’s internal algorithm rather than its observed outputs.

Candidate selection

Suppose that an AI is trying to choose an action. One way to do this is to generate a number of potential actions internally, using some kind of heuristic process, then pass forward one of them based on some criterion. The process which creates candidates (say from a random seed) can be termed the “generator”, and the process which selects a candidate to send forward can be termed the “evaluator” (denoted “”).

This type of process seems like it would be useful in a large number of contexts, and I would expect it to occur numerous times in every forward pass. For instance, a text-producing AI might produce a number of candidate completions, then select the best of these to actually output. We would expect this to be efficient if making a more perfect first-shot generator is “harder”^[1] than running a simpler one multiple times and implementing an evaluator. We would also expect to see this type of routine being used in intermediate steps of reasoning which require guessing the output of other processes, or even when guessing matters of fact which are unknown but have known properties (like consistency properties) which the evaluator can check. Agentic systems like humans could also be modeled at low resolution using a candidate selection structure, and higher resolution models would probably contain many such routines at the level of cognition (for instance, modeling part of the human mind as a semi-random generator of thoughts followed by a filtering procedure).

Decision points

Consider the following diagram:

In this diagram, each evaluator can be unique (they can each implement a different evaluation function). For the overall system to be myopic, there is a set of evaluators which must be myopic. We will call any evaluator in this set a “decision point”.

Evaluator 3 is the “final decision point” -- the evaluator which outputs what will become the final output of the AI. If it chooses using a non-myopic function, the overall system is certainly not myopic.

Evaluators 1 and 2 are also “decision points”, since they choose which candidates get to evaluator 3. If either of them are non-myopic, then the overall behavior will reflect (at least) partially non-myopic selection of outcomes.

On the other hand, we cannot reasonably expect evaluator 4 to be myopic, since it is part of an internal model of a human, and humans are highly non-myopic. For our definition to allow any sort of simulation of non-myopic entities, we must exclude evaluator 4 from the criterion. So we define a “decision point” as an evaluator whose output has a possibility of being selected as the final output of the AI. Finally, we define a candidate-selection based AI as myopic if and only if all of its decision points are myopic evaluators (evaluators which do not directly use long term consequences in their selection).

Fuzziness

Non-decision points like evaluator 4 still feed information directly into channels leading to the output. So the definition chosen here relies on a clear distinction between information vectors that represent candidate answers and information vectors which do not. If these two categories of information flow blend into each other, then it is difficult to make judgements of myopia in this paradigm. In such a case, it might be salvageable by assessing myopia in terms of percent influence by myopic evaluators, rather than a binary judgement. However, the complexity analysis carried out here uses simple, precise algorithms, so it isn’t an issue within this scope.

Other caveats

There might not actually be a discrete set of “decision points”; it is possible that selection occurs implicitly in a more decentralized way.
There is a sense in which the “generator” can be thought of as a selector over all of candidate space, narrowing it down from a huge vector space to a much smaller space. From this perspective, the generator narrows down some uninformative prior over candidate space to a much “tighter” probability distribution, whose high-probability elements or regions score highly by some relevant metric. However, the exact implicit metric is probably indistinct from the generating process in the sense that its shortest description might be “probability of being generated by {process G} on {seed space S}”. The implicit utility function which is exactly implemented by a neural network is probably hopelessly complex unless expressed in terms of the neural network itself, so it is unclear how the inherent myopic-ness of an opaque generating process could be assessed. For the purposes of this analysis, I will look only at the myopic or non-myopic nature of the internal evaluation functions, and not the generators.

Direct function calls

Algorithms in this class do not introduce any new structure beyond what is already necessary in the world model. Rather, they simply utilize functions which are already part of the world model by virtue of being necessary for inference. For instance, a network trained to imitate a human could use the following algorithm:

$h u m a n R e s p o n s e ({c o n t e x t}, {h u m P a r a m s}, i n p u t)$

A network which implements this algorithm directly calls a pre-existing $h u m a n R e s p o n s e ()$ function in its world model, passing it the context (what situation or conversational context the human is in), some information about the particular human (such as their style of speaking, areas of knowledge, etc.), and the input to which the human is responding. Some function like $h u m a n R e s p o n s e ()$ is presumed to exist in the world model of any AGI, since humans are a very prominent feature of the world, and modeling their responses to things is seems necessary for any comprehensive understanding of the world. The curly bracket notation will be explained in a later section.

Explicit decision theory

Another class of algorithms are ones which directly “call” a particular decision theory. The idea here is that the AI would already understand a variety of decision theories through learning about them in the training data, and perhaps using them to model other agents. Since this knowledge is latent in the network, the output algorithm could directly use them to decide what to output, by internally “calling” the decision theory with a particular utility function as an argument.

This is actually a sub-category of direct function calls, and I won't be analyzing any algorithms which explicitly specify a decision theory. Furthermore, it is unclear to me whether the sense of myopia satisfied by LCDT is the same as the definition I am using here.

Simplicity bias

One of the inductive biases of deep learning systems is that, all else being equal, a simple algorithm is more likely to be found than a complex one. In this analysis, we will proceed with the assumption that the relevant sense of simplicity is the description length of the algorithm which is implemented.

Assumption - Maximally compact implementation

By using the literal description length of algorithms as a measure of their complexity in a neural network, we are implicitly assuming that the network implements it in a maximally compact way. For instance, if a generator is run 10 times in parallel, we assume that the additional 9 instances take no extra complexity except for the description length of “do __ 10 times”. We will also assume that the neural network can implement programs which require function calls between arbitrary parts. Note that if necessary, a network could implement this by having a literal programming language interpreter encoded inside of it. Whether the internal structure required for maximal compactness is actually reachable by gradient descent is left aside.

Modelling the simplicity prior

In this article, Mingard describes the prior as being the volume of a function in parameter space. Specifically, it is a “volume with Gaussian measure”; this means that the parameter initialization is done using a Gaussian distribution for each parameter, and we find the effective volume by integrating the probability density function of the multivariate Gaussian over the region of parameter space which implements the function. This is equivalent to the probability of a randomly initialized network implementing the function. The (empirically backed) claim is that “simpler” functions which generalize better tend to occupy larger effective volumes in parameter space.

As far as I know, it is an open question as to which measure of simplicity is actually used in the implicit prior of current DNNs; I remember seeing both Lempel-Ziv complexity and Kolmogorov complexity mentioned in the literature, but these choices looked arbitrary.

However, for this analysis, we take the relevant sense of simplicity to be “description length” in some idealized programming language (also known as “Kolmogorov complexity”, which I will abbreviate as “k-complexity”). While this is sufficient to make ordinal comparisons of probability, I will also propose a formula to calculate actual probabilities using an analogy to random binary strings, which may or may not hold empirically.

One way to model a simplicity prior is to consider the existence of an algorithm inside a network as equivalent to the existence of a binary string inside of a large random one (this is, of course, only an approximation through rough analogy). The probability of a substring alg of length k appearing in a random binary string of length N can be approximated closely^[2] by:

For $k > l o g_{2} (N) + \frac{1}{2} l o g_{2} (\frac{1}{ϵ})$ , we can simplify this to:

(with approximate acceptable error $ϵ << 1$ and assuming $N >> l o g_{2} (N) + \frac{1}{2} l o g_{2} (\frac{1}{ϵ})$ )

The main takeaway is that, under this model, the probability of encountering a given algorithm by random chance decays about exponentially with its description length. (This should look familiar to those familiar with Solomonoff induction and Kolmogorov complexity.)

Regardless of whether this particular analogy holds for DNNs, I have a strong intuition that a prior based on k-complexity would decay roughly exponentially (going as approximately 2^-k if k-complexity is taken to be description length in binary).^[3] In practice, this is a very sharp penalty -- a 20 bit difference corresponds to an odds ratio of about a million to one!

Complexity analysis

This section will be structured in four parts:

I will first introduce one of the "myopic" algorithms for the human imitation task, to provide a concrete example for the methodology section.
Methodology: I will explain the method by which I estimate description lengths. This is probably the most important section of the post.
Main analysis: Complexity and accuracy of various algorithms, mostly on the task of imitating human responses.
Conclusions

Sections 3 and 4 have been placed in a Google doc for reasons explained later.

Methodology

I implicitly presume something like a pre-training/fine-tuning setup, or at least the presence of two separate data sets: A general data dump (like internet data), and specific training data for a task (like human imitation). For tasks where RL is used, the second is effectively generated on the fly.

What we care about in this analysis is the k-complexity required to achieve zero training loss on the specific task (fine-tuning) with each algorithm, where imperfect algorithms are brought to zero loss by adding memorized differences between its predictions and the training labels. Low but non-zero loss is assumed on the pre-training/data dump. The effects of pre-training performance requirements on the analysis are unclear; in this post, I emphasize zero fine-tuning loss in the complexity analysis. Pre-training, and the properties of the pre-training data, are important for assumption plausibility and for other parts of the analysis.

Note on penalties for memorized differences: Compression of memorized differences would be equivalent to introducing new algorithms, which could include the non-myopic ones which we wish to avoid. In general, good compression = intelligence, and deceptive optimizers are good at filling any role. This has two implications:

Compression of differences is not allowed in the complexity analysis; an algorithm will have to pay the full cost, since compressing differences could ruin any specific generalization properties it has.
Forcing a network to implement and use a safe but imperfect function is insufficient to prevent it from implementing an unsafe one to make up the difference. It may even end up implementing the same unsafe function it would have used regardless.

How long is each token?

The algorithms examined consist of three token types^[4]: Function names, pointers (internal "names" of things/concepts in the world model), and raw data. The length of raw data is assessed by estimating its true information content (Due to the compression scheme assumption introduced in later in this is section, it would be near-maximally compressed). This leaves the question of how to assess name lengths.

Claim: Different functions and pointers add different amounts of k-complexity to an algorithm. Name lengths are shorter for functions and pointers which are referenced more.

The intuition here is that functions which are accessed more frequently in the world model should get shorter “names”, since this would minimize the total length of the program. The optimal (naïve) indexing would be something like Huffman coding, which is optimal for token-by-token compression. This yields the following equation for complexity:

k-complexity of accessing x $\approx - l o g_{2} (o c c u r r e n c e s (x) / t o t a l o c c u r r e n c e s)$

where "total occurrences" is the total across all tokens.^[5]

In other words, the number of bits it takes to call a function is proportional to the logarithm of its rarity; a function which is used half as often in the "source code" will require 1 extra bit in its name.

However, this is not quite right; for a task requiring a sufficiently complex program, the truly minimal program will go to even greater lengths for compression. Some tokens are more likely to appear together; for instance, the token "+" is likely to be followed by a number. A truly minimal program will have a compression scheme which takes advantage of these correlations. A network implementing such a program could have a decoder which converts compressed representations of sub-programs into raw programs before feeding them to an internal interpreter. (This is not necessarily how it would be structured, but under our assumptions it provides an upper bound on complexity, since it could be done.) In this case, we can refer to the description length of a token in fully compressed form as its conditional complexity. For a good compression scheme, this is essentially a measure of how surprised we are to see the token in that position, given the context of the previous tokens. From information theory, the information content (and thus the approximate length) is given by:

$- l o g_{2} (P (t o k e n | p r e v i o u s t o k e n s))$

which is the equivalent of the earlier formula, but using conditional probability instead of ordinary probability (occurrence proportion). Note that the tokens and probabilities here are of occurrences in the internal program, not text tokens in the training data. However, the distribution of concepts in the training data might be important in influencing these internal occurrence probabilities.

Salience Hypothesis

The implication of this claim is that the relative simplicity of two algorithms within a larger system with pre-existing functions is not purely an intrinsic property of those algorithms, but is dependent on the frequency with which other functions are referenced. Since functions are called conditional on what processing is required for a particular input, their name lengths depend on the distribution of inputs in the training data. Therefore, the relative effective complexity of two algorithms can depend on the relative frequency (“salience”) of different tasks, contexts, and concepts in the training data. Let us call this claim the weak salience hypothesis.

I expect this hypothesis to be correct if:

The framework of highly description-length-efficient internal programs is applicable.
Architectures are permissive enough to function-call style internal programs without large redundancy.

If these assumptions don't apply, then the hypothesis probably also fails, since it relies on the idea of "internal function calls", which may not have any real-world correlate if programs are not strongly limited on description length (or if the required structure is not supported well by the architecture). If some other measure of simplicity than k-complexity is the true inductive bias, an equivalent of internal "callable" functions need not exist.

I also introduce two stronger claims about the extent to which a token's complexity is influenceable by the training data, the first of which I consider to be more doubtful:

Strong salience hypothesis: A context-dependent function/pointer/concept's internal name length can be reduced by 1 bit by doubling the number of examples which require it in the training data.

Medium-strong version: A context-dependent function/pointer/concept's internal name length can be reduced by 1 bit by doubling the number of examples which require it in the training data if those new examples use them in novel ways.

First, note that the strong salience hypothesis does not trivially follow from the length expression $- l o g_{2} (o c c u r r e n c e s (x) / t o t a l o c c u r r e n c e s)$ . The reason is that "occurrences" refers to occurrences in the program implemented by the network (including the entire world model), not the number of occurrences in the training data. It is also not referring to the number of times the function is called at execution (which, for pre-training, would be more straightforwardly proportional to the occurrences of relevant data).

The pathways by which salience in training data could affect internal occurrence are more subtle:

Since zero loss is not assumed for pre-training, all things are not modeled perfectly. Rather, we would expect parts of the world model which are used more often (in predicting the data) to be better developed. So the pathway here is: More training data requiring concept --> more developed model of concept --> more calls to related functions in the "internal code" --> shorter function name. If this pathway is sufficiently strong, the strong hypothesis is true.
The medium-strong version is motivated by this other pathway: More variety in required reasoning in a context --> more internal code for that context --> more calls to related functions --> shorter name.

If effects 1+2 have enough combined strength, but effect 1 alone does not, then the medium-strong hypothesis is true while the strong one is false.

Note that the particular scaling requirement I have set is a bit arbitrary -- the more open ended questions would be of the form "What is the scaling law for function name length as a function of usage in the pre-training data?".

Describing algorithms online can drastically change their description length

Suppose there is some algorithm $x (y (z), w)$ . (I am using $x, y, z, w$ to stand in for pointers and function names.) If each of these tokens has a description length of 15 bits, then the algorithm as a whole has a length of 60 bits. However, if the algorithm appears as a concept in the world model, then there will be a much shorter pointer to it. For instance, suppose that $x (y (z), w)$ is a well known algorithm called "xyz" which is often referenced by articles online. In the course of training, the AI will have to often predict text pertaining to xyz. At a minimum, it will know to complete the string "x(y(z)," to "x(y(z), w)". Furthermore, it will probably understand that "xyz" is an algorithm, and can reason about its properties. So we expect that there will be a concept "xyz" in the world model, which stores the literal algorithm and/or the plaintext, and any other properties of xyz which the AI has memorized/learned.

The implication of this is that there is now a much easier way for the algorithm $x (y (z), w)$ to be used inside the AI. Rather than implementing "x(y(z), w)", the AI can implement something like "execute xyz", where the world model already knows what "xyz" is. This statement will probably be shorter in bits than the literal "x(y(z), w)" -- even if "xyz" is part of the reference class "the 1 trillion most notable things", it's length will be only $l o g_{2} (10^{12}) \approx 40$ bits.

Given that things written online will end up in AI training sets (through common crawl, for instance), we are in the strange situation where simply writing out and/or describing an algorithm in a blog post could make it easier for an AI to point to internally, and thus more likely to be implemented. Furthermore, it turns out that some of the myopic algorithms actually win on complexity compared to their deceptively aligned competitors, and thus describing the deceptively aligned ones online could ruin this advantage. To reduce the probability of an issue like this, the rest of the analysis will be carried out in a google doc which is easily accessible to human readers but not to simple crawlers.

The rest of the analysis (including conclusions) is in the following google doc; to prevent crawling, I have rot13'd the link:

uggcf://qbpf.tbbtyr.pbz/qbphzrag/q/1GzbZMqVDcmYq9nQx2lxI_ogdPidx0rnD_OMvD_j70HR/rqvg?hfc=funevat

(Rot13 is a very simple cipher -- just go to https://rot13.com/ and enter the link text to decode it)

^{^}
There are many things we might mean by “harder” in this case, including k-complexity, serial runtime, total computation, and difficulty to find by gradient descent.
^{^}
“String powers” (strings which consist of n repetitions of a smaller string -- see this discussion) work a bit differently; however, the difference is small unless the base string is very short, so for strings of significant length this issue is rare).
^{^}
Technically it is not provable here through convergence arguments, since there can also be a hard cap on complexity. However, it seems natural that adding one bit of description length would cut the prior probability roughly in half.
^{^}
I am ignoring delimiters like parenthesis.
^{^}
Not exactly right; we are neglecting the fact that one of the token types was "raw data".

18