Bounded complexity of solving ELK and its implications

AI ALIGNMENT FORUM
AF

Bounded complexity of solving ELK and its implications — AI Alignment Forum

This post was written for the SERI MATS program. I thank Evan Hubinger and Leo Gao for their mentorship in the program. Further thanks go to Simon Marshall and Leo Gao (again) for specific comments regarding the content of this post.

The Eliciting Latent Knowledge (ELK) problem was first introduced by Paul Christiano, Mark Xu, and Ajeya Cotra. Arguments concerning the complexity of solving the problem and the resulting consequences of that have been made by Abram Demski and Leo Gao. This post aims to synthesize their thoughts in an accessible way, and extend them with my own original content. I assume familiarity with the initial ELK report, but not with subsequent posts.

Epistemic status: 90% confident that I am accurately representing the thoughts of Abram and Leo at the time they wrote their posts, 70% confident my own arguments do not contain major logical flaws.

Introduction

We would like to solve the Eliciting Latent Knowledge (ELK) problem in the worst case scenario, or at least show that the task is practically impossible so that we can revise our expectations. One of the worst cases for ELK is when the direct reporter is arbitrarily computationally complex. As this case would still allow for approximating the direct reporter within certain tolerances, we extend it further to when the direct reporter and all other sufficiently close approximation are arbitrarily computationally complex. While Paul Christiano has suggested that the complexity of the predictor bounds the complexity of the reporter, we will show later that this only holds in a narrow formulation of the ELK problem.

Abram Demski argues in "ELK Computational Complexity: Three Levels of Difficulty" that if the direct reporter can be arbitrarily computationally complex, a training process to find it is intractable. To get around this issue, it is necessary to find an upper bound on computational complexity, conditional on the predictor, thus ruling out the worst case scenario. Leo Gao's post "Observations about ELK" focuses on a limited case of ELK and arrives at marginally more optimistic results, arguing that proposals which do not depend on the predictor are unable to find the direct translator in the worst case scenario, which rules out most current proposals but leaves open the possibility of a new approach that can somehow get around the issues.

This point summarizes and expands on Abram’s and Leo’s posts regarding the possibility and implications of arbitrarily computationally complex direct reporters. I make the case that, under relatively minor assumptions, it is possible to bound the computational complexity of answering any particular question that has an answer. However, I then go on to show that Leo’s argument against the tractability of proposals not depending on the predictor continues to apply even when computational complexity is bounded.

Question Nets

In the ELK report, it is assumed that both the human and the AI make predictions by doing inference in a Bayes net. While this may not be literally true, it is a useful approximation because it is often the case that there is an analogous Bayes net for however the agent is modeling the world. This is especially likely to hold true for sufficiently advanced agents, such as a human or an AI that can model the world at a superhuman level, so the Bayes net analogy is a safe one to make.

Once we acknowledge that doing inference in a Bayes net is only an analogy, we can stretch the analogy further. In particular, think of a human’s Bayes net as consisting exactly of a node for every question they could conceive to ask. The information contained at each of these question nodes is the human’s beliefs over the distribution of all possible answers.

This further abstraction from a standard Bayes net to a “question net” is done without loss of generality. Any information contained in a node of a standard Bayes net is the answer to some question, and so will be contained in the equivalent question net. Where the standard Bayes net model thinks of answering a question as applying some function to a subset of nodes, in a question net that question is instead made into a node directly downstream of those nodes relied on for answering it. This can be thought of as an application of factored cognition, where each question is decomposed into the questions needed to answer it, which become its parent nodes, until we eventually reach fundamental questions that require no background to answer. Given a question net, the act of answering a question is simply finding the node for that question and reading off the information within it.

The Bayes net for the predictor can be similarly transformed. Although the predictor may or may not be meaningfully said to “conceive” of questions in the way a human does, a question net can still be constructed that contains all the questions answered by information represented within the predictor. Many of these questions may not be of a form such that humans can understand either the question being asked or the potential answers. One question within the predictor’s question net gets a special status as the actual prediction made by the model.

Given a predictor’s question net, the reporter then builds its own question net on top of it, adding both downstream and base level nodes. These new nodes represent every question the reporter can conceive of, and since it needs to answer every question the human can ask it incorporates the human’s question net as a subset. The reporter's nodes would include questions that combine existing information from the predictor (e.g. if the predictor has nodes asking for the location of the diamond and the boundaries of the vault, the reporter adds a node asking if the diamond is in the vault), questions that could be derived without any information about the world (e.g. “Is eighty-three prime?”), and questions about information the reporter could know or guess but the predictor does not, such as those concerning human behavior (e.g. “How would a human answer the question ‘Is the diamond in the vault?’?”).

Modeling both the human and the reporter as having question nets gives natural definitions to the different types of reporters. The direct reporter responds to question X with the value of node X in the reporter’s question net,, while the human simulator responds with the value of the node “How would a human answer X given the prediction”. With this distinction, it also becomes easy to think about direct reporters and human simulators on a per-question basis, with some reporters potentially acting as either depending on the question asked. Both types of reporters would include nodes for both questions in their question nets, so it’s a matter of which answer actually gets reported in response to the question.

Computational Complexity of Knowledge

Levels of Knowledge

Focusing on the human’s question net for a moment, what knowledge can the human be said to actually have? Due to the nature of the space of possible questions, their question net is infinitely deep, so although the information at any specific node could be determined in finite time, ensuring accurate information at all nodes would take forever. It would be useful to differentiate between questions depending on whether a human will always, sometimes, or never actually know the answer, even if it is possible in theory. Similarly, in the reporter’s question net, the nodes corresponding to the predictor have readily accessible information, while the nodes added by the reporter require further computation to answer. Out of all the answers to questions that the reporter could potentially compute, which ones do they actually “know”?

Consider three different levels of being able to answer a question:

At the first level, we know the answer instantaneously, and the only step necessary to answer the question is retrieving that information from within our brain. For example, I can answer that 2x4=8 and that Ottawa is the capital of Canada^[1], without having to give the question any further thought. It requires effectively zero time or cognitive processing to access these facts. Questions at this level also include those that can be answered directly with sensory input, such as our field of vision. We do not need to be able to answer the question verbally, just within our mind.

The second level is when we can answer the question, but doing so requires some amount of thought. At this level, examples are answering that 26*37 = 962 or that there are eight US states whose names begin with “M”. Both of these facts required combining first level information using some algorithm, in a process that lasts between seconds and minutes.

Finally, the third level is when we can specify a process for arriving at the answer (or could specify a process eventually), but are unwilling or unable to use the time and cognitive resources to see that process through. Examples here include answering with the factorization of RSA-2048 or the contents of an encrypted document. Given sufficient time, I could use a known process to brute force the factorization or encryption, but doing so would take longer than my expected remaining lifespan.

Although the second and third level at least seem meaningfully different, it is a continuum rather than a dividing line. There is no clear demarcation point where multiplying sufficiently large numbers goes from the second to the third level. On top of this, the level at which a question can be answered is agent-dependent. A powerful AI might find the examples I gave of the third level to be questions of the second, or even first level. What we would like is a definition of knowledge that differentiates questions by how much work is required to access them, while acknowledging the continuum involved.

L-Knowledge and L-Scores

Abram Demski’s definition of L-knowledge is the perfect tool to split these categories apart. An agent is said to have L-knowledge of the answer to a question if there exists a program it can run to determine the correct answer, with the program’s length plus the log of the time to run it being less than L. This program has to answer the question correctly in all states, to rule out fast and simple programs such as “always output true”, which cannot be meaningfully said to be knowledge even if they are sometimes correct. This definition could be easily modified to account for probabilistic knowledge, where the agent only needs to assign sufficiently high probability to the correct answer, but that would complicate this post for no real conceptual gain. Abram’s definition of L-knowledge also has the program output a flag for if it is even possible to answer the question, but we will simplify by only focusing on questions that can be answered^[2].

As an example, if there exists a program with a length of six units and which takes eight units of time to determine the answer to “Is there a screen in front of the camera”, then an agent who is capable of running the program can be said to have 9-knowledge of the answer (using log base 2). This also implies L-knowledge for L greater than 9, such as 10-knowledge and 31-knowledge. However, it does not imply that the agent does not have 8-knowledge of the answer — there may be some shorter or faster program to answer the same question. Rather, the existence of a program gives an upper bound on the L-knowledge of the answer. We thus define the L-score of a question to be the lowest L-value where an agent has L-knowledge of that question’s answer.

Using the specific equation used to determine L-knowledge, program length plus log time, is not particularly consequential. It suggests a particular tradeoff between program length and running time, but other equations would also have the desired properties, as long as they are increasing in both of the variables. The important takeaway is that an agent can be said to have specific levels of knowledge about the answers to various questions depending on the difficulty of accessing that information.

The Base of Knowledge

One wrench in the gears of the definition of an L-score is that the ease of determining the answer to a question often depends on which other questions have already been answered. The sum of all prime numbers under 100,000 is much easier to calculate if you already know the sum of all prime numbers under 99,000, and trivial if you know the sum of all prime numbers under 99,999. How much background knowledge should be assumed when determining if an agent has L-knowledge of an answer?

We can start with the questions for which knowing other answers would not help and no calculations need to be done, only information retrieval. For a human, these are the questions with an immediate answer, such as the previous examples given of “What is 2 times 4?” and “What is the capital of Canada?”. For the reporter, these are the questions that can be answered directly within the predictor. Without quibbling over what exactly is the minimal program length and runtime to retrieve stored information, we will call the answers to these questions base-knowledge.

With base knowledge being essentially free to access, it can be considered the minimum background knowledge for determining the L-score of answers to questions deeper in the net. We then update the definition of L-knowledge and an L-score to depend on the existence of a program whose length and run time collectively meet the joint requirement conditional on base-knowledge, which will be the definition used for the rest of this post.

In addition to this updated definition of L-score, it can be helpful to think of what the L-score of a question would be if all its parent nodes in the question net were base-knowledge. That is, how hard is it to answer a question with all the requisite background knowledge? We will call this hypothetical L-score the parent-L-score, and assign base-knowledge a parent-L-score of zero.

Recursive L-scoring and Factored Cognition

Ideally, we would like to calculate the parent-L-score recursively, to get the L-score of a question as a function of its parent-L-score and the L-score of each of its parent questions. with the L-score of the parent questions calculated the same way. This can be rather awkward, as just summing these inputs will severely overestimate the true L-score. Even with some tricks to increase efficiency, the best we can do is an upper bound on the L-score, but fortunately this is all we need for our purposes.

The L-score of a question is bounded by the sum of parent-L-scores for each of its ancestors. It may be easier to answer a question than by determining all possibly relevant background information and answering based on that, but calculating the answers for parent questions recursively down to base-knowledge is a guaranteed way to answer it. Even this approach would give a lower L-score than the sum of parent-L-scores, as it would take the log of the sum of running times, rather than the sum of logs.

Tracking program length and run time separately and calculating the L-score at the end sharpens the bound somewhat, but it remains loose. The tradeoff between program length and run time will not be jointly optimized between the various levels. In addition, the recursion will ignore efficiency gains from reusing calculations for multiple nodes, and from cases where answering a question only ever requires the answers to a strict subset of parent questions. Still, the important thing is not how tight this bound is, but rather that a bound on the L-score exists at all..

Determining the exact L-score of a question without knowing how to answer the question can be an impossible task. Recursively checking parent-L-scores until reaching base-knowledge gives a loose upper bound rather than an exact answer, but greatly increases the tractability.

Bounding the L-score for a particular question does not require a model of the question tree. Rather, starting from the question, all that is needed is factored cognition. Identify a set of sub-questions whose answers give sufficient background knowledge to answer the main question (not necessarily the parent questions in the question tree), and define a program for answering the main question based on those answers. Then recurse, doing the same to answer each of the sub-questions, until all sub-questions are answered with base-knowledge. When this process ends, add up the length and complexity of all defined programs and use them to calculate an upper bound on the main question’s L-score. Depending on which question is being answered, this process is potentially something that could be run entirely by humans. Neither the programs nor the choice of sub-questions needs to be efficient, missing opportunities for improvement only loosens the upper bound.

The upper bound on computational complexity for a human to answer a question also serves as the upper bound for a reporter to answer the same question. This is because each node in the human question net is necessarily a node in the reporter question net, since the reporter can answer questions the human asks it. In addition, we assume that the base-knowledge of the reporter (which is the information in the predictor) is greater than the base-knowledge of the human, so less computation is needed to reach downstream nodes. Although this assumption may seem strict, it will not be a big deal if it does not strictly hold, as long as the reporter's L-score for questions where the human has base-knowledge is low.

This upper bound on the L-score for a question is exactly what we are looking for. This bound also works as a bound on the complexity of a reporter that is only required to be a direct reporter for that question, since all that reporter needs to do is calculate the answer and then provide it. This can be easily extended to bounding the complexity of a reporter required to be a direct reporter on any finite subset of questions.

The Implications of Unbounded and Bounded Complexity

Abram notes at the beginning of his post that if the direct translator is arbitrarily computationally complex, ELK is intractable. While the justification for this statement is not explained, I would guess that his reasoning is straightforward. Any given training process will only search over models up to a certain level of complexity, so if the direct translator is above that level it cannot be the output of the training process.

Paul argues in the comments of Abram's post that the direct translator cannot be arbitrarily complex relative to the predictor (which has a finite complexity). His logic is that the reporter’s complexity is bounded by recreating the predictor from scratch, so that the reporter knows everything the predictor knows, plus a constant term depending on the human’s model so that it can interact with the human appropriately.

Unfortunately, I believe this argument will only hold for extracting information explicitly represented within the predictor, which falls short of the cases we care about. For example, the predictor might have representations of “Where is the diamond located?” and “What are the boundaries of the vault?”. For the purposes of ELK, we would like to say that the predictor knows the answer to “Is the diamond inside the vault?”, but for the reporter to answer that question requires it to do an additional calculation. These additional calculations to answer questions can get arbitrarily complex, plus there are arbitrarily many of them. Restricting the reporter to exactly the information in the predictor could very well make it useless, but not restricting it allows it to be arbitrarily complex.

Paul might respond to this criticism with the definition of knowledge from the ELK report: the predictor knows some information if it acts differently based on the information to obtain a lower loss. This definition would limit the amount the predictor could know, and use that knowledge to answer all necessary questions with the reporter’s constant complexity human model.

I’m skeptical of relying on a specific definition of knowledge to bound the direct reporter’s complexity, and there are major issues with this definition. For starters, this definition could imply the predictor has more knowledge than we think of it having, as it may reduce its loss probabilistically, using heuristics, or based on assumptions. More importantly, the definition means that the predictor doesn’t know things even if they are represented in its model, unless it uses them to obtain a lower loss. That would mean, for example, that once the predictor predicts a screen in front of the vault’s camera, it doesn’t “know” anything going on behind it because that doesn’t affect its loss function. The whole point of ELK is that the predictor has latent knowledge beyond the loss-determining prediction it makes, and a definition of knowledge that incorporates that re-opens the possibility for knowledge of arbitrary complexity.

So, if ELK is intractable when the direct translator is arbitrarily computationally complex, what changes if we know that we can bound the complexity of answering any particular question? Well, nothing. Bounding the complexity of any question doesn’t matter if there’s always a more complex question the direct translator is required to answer. Even if it were possible to bound the maximum possible complexity of answering a question (or restrict the direct reporter to only answering questions below some complexity threshold), the direct reporter would still be arbitrarily complex unless we could also bound the number of possible questions^[3].

What a bound on the complexity of answering particular questions does give us is the possibility for a good enough solution to the ELK problem. We can’t find a direct reporter that answers all questions honestly. We can’t even find a direct reporter that answers all questions below some complexity threshold honestly. What we can do is specify some finite sample of questions, and find a direct reporter that answers those honestly. The bound alone won't give us this limited direct reporter by itself, it just allows for the possibility of finding it through a clever training process defined by other ELK proposals.

Answering only a limited number of questions may be totally fine, depending on the application. For example, with the SmartVault, we are primarily concerned about answering “Is the diamond in the vault”, and may be able to clear up the edge cases with a few more questions. In practice, for any application, humans would only ask some finite number of questions to the reporter. If we can cleverly anticipate what these will be in advance, we could then find a reporter that gives identical answers to the direct reporter over the set of questions it will be asked.

The Limits of the Training Process

The following is based off of the argument in Leo Gao's "Observations about ELK' post:

Abstracting the Training Process

One way to think about the training process for a model is to abstract it as a prior over possible models plus a dataset. The training process updates the prior based on the dataset, and selects the model with the highest posterior probability. This abstraction does not necessarily tell us how to actually implement a training process, but it can allow us to show that the training process has certain properties or limits.

We can alternatively think about the abstraction in reverse, as a dataset plus a posterior (conditional on the dataset). First, the dataset eliminates all models that are incompatible with the data, then the posterior chooses from among the models that were not eliminated. Whichever model has the highest posterior probability assigned will be chosen, so without loss of generality we can restrict the posterior to degenerate distributions that place full probability on a single model.

Let us focus on the ELK case where we only care about training the reporter to answer a single question honestly (such as “Is the diamond in the vault?”) and the predictor is specified outside the training process. Given the predictor, we choose a training process as a prior over reporters and a set of action-answer pairs, where the answer part is the answer to the desired question after the action (or, without loss of generality, a series of actions) occurs. We will also assume that there exists some default state before actions are taken that can be restored between data points.

The consequence of an action is to change the sensory inputs to an agent, which are included as part of base-knowledge. Once that update occurs, an agent can invest computational resources to determine the answer to further downstream questions.

The dataset is provided to the training process by a human, which gives it two limitations: structural and computational. Structurally, actions that affect a human’s base-knowledge in the same way are indistinguishable from each other, which limits the possible actions for the action-answer pairs. Computationally, for each action the human needs to determine the correct answer, which requires computational resources (though there may be efficiency gains in calculating the answers for multiple actions jointly). Having limited resources bounds the number of action-answer pairs that can be provided.

More Issues with Unbounded Complexity

If the complexity of the direct translator is not bounded, the prior could assign positive probability to infinitely many models. If it does, the probability assigned has to be on average declining with complexity so that total the probability assigned sums to one. This means that an arbitrarily complex direct translator has arbitrarily low probability assigned to it by the prior. Updating the prior based on data will then only give the direct translator if the data eliminates the arbitrarily many models assigned higher probability. But guaranteeing this result would require knowing which model is the direct translator, in which case why not assign it a probability of one in the prior?

In the reverse ordered abstraction of the training process, with unbounded complexity there are infinite possible models. Removing models inconsistent with the training data still leaves infinite possible models from which the direct translator must be picked out. While this is not strictly impossible, it requires knowing what the direct translator looks like, at least down to a class where all but one is inconsistent with the data.

The final wrench in the gears here is that which model is the direct translator depends on the predictor. Two predictors could make the same prediction but have different latent knowledge about the world state, for example one knowing that the diamond was stolen and replaced with a screen while the other does not. A model which worked as a direct translator for the second model would be a human simulator for the first. More generally, unless there is only a single direct translator that works for all possible predictors, two models which are direct translators for different predictors may give the same answers on all points in the training data provided by the human.

What this means is that picking out the direct translator from all models consistent with the data must depend on the predictor. Otherwise, if the same training process is used for all predictors, it could give the human simulator on some even while giving the direct translator for others. The majority of proposals included in the initial ELK report and prize results do not meet this criteria, such as those that add a regularization term to the search process based on the reporter’s structure.

Bounding Complexity

We can upper bound the complexity of the direct translator through the process laid out in the previous section, cognitive factoring of the question we wish to answer then summing the complexity of all sub-questions conditional on the answers to their factors. Bounded complexity means that only finite models can receive positive probability in the prior, so it no longer needs to be decreasing on average with complexity.

With a bound on the direct translator’s complexity, a possible training process for finding is suggested. First, with only finite possible models, we provide a prior and enough data points to ensure that all models besides the direct translator or human simulator are ruled out. This assumes that all models outside of those two classes below the complexity threshold will either make some prediction that contradicts the data, or can be rejected by inspection. Next, increase the complexity of the human simulator by amplifying the human to make the human simulation process more complicated, until the lower bound on the complexity of the human simulator is outside the search space. This means neither the human simulator or models besides the direct translator and human simulator remain in the search space, so the direct translator must be chosen.

I believe there are issues with both of these steps. The first assumes that it is possible to eliminate all models besides the direct translator and human simulator using only identifiable properties and data provided by a human, but this seems unlikely. For example, there are models that match the answers from both the direct translator and human simulator after simple actions, but after complicated actions for which the human cannot provide an answer in the training data give an answer that is just incorrect (neither the direct translation or what a human thinks happened).

In the second step, it may not be plausible to make the human’s decision process arbitrarily complex through amplification. Doing so requires either expanding the human’s Bayes/question net (a.k.a. “doing science”), or running more complex calculations within it, in a way that actually impacts the answer. Even if this could be done, it would be impossible to know how much complexity is added to the lower bound of the human simulator, since the human simulator might be able to run the same calculations in a more efficient way. It also raises the question, if the human’s process for answering the question is more complex than the direct translator, why do we need a reporter at all?

If this process for identifying the direct reporter under bounded complexity cannot be shown to work, then the main result from the unbounded case holds. The process of choosing a prior or selecting the direct predictor from among models not contradicting the data must depend on the predictor, which rules out many current proposals for finding the direct reporter. Although bounding the complexity of the direct reporter does not get around this requirement, it may greatly increase the tractability of meeting it. Rather than using information on the predictor to select the direct reporter out of infinitely many options, it is only necessary to select it from a finite set of alternatives. This is certainly still a challenge, but choosing from a limited number of options allows for many approaches that are useless when facing unlimited options. I am optimistic that progress will be made on using aspects of the predictor to determine the direct translator, and that when it happens bounding complexity will be an important step.

Summary

To reiterate, the main points of this post are the following:

If the direct translator can be arbitrarily complex, finding it is intractable.
The complexity of answering a specific question can be bounded above by using factored cognition to split it into sub-questions, then summing the complexity of answering each question conditional on answers to its sub-questions.
This result does not bound the complexity of a direct translator that honestly answers every question, but does bound the minimum complexity of a reporter that acts as a direct translator for some finite set of questions.
Even after bounding the complexity of a limited direct translator, the process of identifying it must depend on the predictor, as different reporters will be direct translators for different predictors.

^{^}
If nothing else, let this be the takeaway you remember months from now
^{^}
Alternatively, we can get around this by saying the answer to a question with no no answer is "The question has no answer"
^{^}
This might be possible, I have not given it significant thought