Financial status: This is independent research. I welcome financial support to make further posts like this possible.

Epistemic status: These ideas are still being developed.


  • I am interested in recognizing entities that might exert significant power over the future.

  • My current hypothesis is that knowledge of one’s environment is a prerequisite to power over one’s environment.

  • I would therefore like a good definition of what it means for an entity to accumulate knowledge over time.

  • However, I have not found a good definition for the accumulation of knowledge. In this sequence I describe the definitions I’ve tried and the counterexamples that I’ve come up against.


The entities that currently exert greatest influence over the future of our planet — humans — seem to do so in part by acquiring an understanding of their environment, then using that understanding to select actions that are likely to achieve a goal. Humans accumulate knowledge in this way as individuals, and are also able to share this understanding with others, which has led to the accumulation of cultural knowledge over time. This has allowed humankind to exert significant influence over the future.

More generally, life forms on this planet are distinguished from non-life in part by the accumulation of genetic knowledge over time. This knowledge is accumulated in such a way that the organisms it gives rise to have a capacity for goal-directed action that is optimized for features of the environment that have been discovered by the process of natural selection and are encoded into the genome.

Even though genetic knowledge accumulates over many lifetimes and cognitive knowledge accumulates during a single lifetime, for our present purposes there is no particular need to distinguish "outer knowledge accumulation" from "inner knowledge accumulation" as we do when distinguishing outer optimization from inner optimization in machine learning. Instead, there are simply processes in the world that accumulate knowledge and which we recognize by the capacity this knowledge conveys for effective goal-directed action. Examples of such processes are natural selection, culture, and cognition.

In AI alignment, we seek to build machines that have a capacity for effective goal-directed action, and that use that capacity in a way that is beneficial to all life. We would particularly like to avoid building machines that do have a capacity for effective goal-directed action, but do not use that capacity in a way that is beneficial to all life. At an extreme minimum, we would like to have a theory of effective goal-directed action that allows us to recognize the extent to which our creations have the capacity to influence the future, so that we might make informed choices about whether to deploy them into the world.

The detection of entities that have a greater-than-expected capacity to influence the future is particularly relevant in the context of the prosaic AI regime, in which contemporary machine learning systems eventually produce entities with a capacity for effective goal-directed action that exceeds that of human society, without any new insights into the fundamental nature of intelligence or autonomy. In this regime, large-scale search processes working mostly by black-box optimization eventually produce very powerful policies, and we have relatively little understanding of how these policies work internally, so there is a risk that we deploy policies that exert greater influence over the future than we expect.

If we had a robust theory of the accumulation of knowledge, we might be able to determine whether a policy produced in such a way has the capacity to accumulate unexpectedly detailed knowledge about itself or its environment, such as a robot vacuum that unexpectedly accumulates knowledge about the behavior of its human cohabitants. Alternatively, with such a theory we might be able to detect the "in-flight" accumulation of unexpected knowledge after deploying a policy, and shut it down. Or we might be able to limit the accumulation of knowledge by deployed entities as a way to limit the power of those entities.

Understanding the accumulation of knowledge could be particularly helpful in dealing with policies that come to understand the training process in which they are embedded during the time that they are being trained and then produce outputs selected to convince the overseers of the training process to deploy them into the external world ("deceptive alignment" in the terminology of Hubinger et al). In order to behave in such a deceptive way, a policy would first need to accumulate knowledge about the training process in which it is embedded. Interrogating a policy about its knowledge using its standard input and output channels won’t work if we are concerned that our policies are deliberately deceiving us, but recognizing and perhaps limiting the accumulation of knowledge at the level of mechanism might help to detect or avoid deception.

Interestingly, in a world where we do not get prosaic AI but instead are forced to develop new deep insights into the nature of intelligence before we can build machines with the capacity for highly effective goal-directed action, investigating the accumulation of knowledge might also be fruitful. Among processes that converge towards a small set of target configurations despite perturbations along the way — say, a ball rolling down a hill, a computer computing the square root of two by gradient descent, and a team of humans building a house — it is only the team of humans building a house that do so in a way that involves the accumulation of knowledge. It might be that the central difference between systems that exhibit broad "optimizing" behavior, and the subset of those systems that do so due to the agency of an entity embedded within them, is the accumulation of knowledge. Furthermore, we might be able to understand the accumulation of knowledge without reference to the problematic agent model in which the agent and environment are separated, and the agent is assumed to behave according to an immutable internal decision algorithm.

In summary, investigating the accumulation of knowledge could be a promising line of attack on both the problem of understanding agency without presupposing a dualistic agent model, as well as the problem of detecting dangerous patterns of cognition in agents engineered via large-scale search processes. The key question seems to be: is knowledge real? Is knowledge a fundamental aspect of all systems that have the capacity for effective goal-directed action, or is it a fuzzy intermediate quantity acquired by some intelligent systems and not others?

This sequence, unfortunately, does not give any final answers to these questions. The next four posts will explore four failed definitions of the accumulation of knowledge and go over counterexamples to each one.

Problem statement

Suppose I show you a physically closed system — say, for the sake of concreteness, a shipping container with various animals and plants and computer systems moving about and doing things inside — and tell you that knowledge is accumulating within a certain physical region within the system. What does this mean, at the level of physics?

Or suppose that I show you a cellular automata — say, a snapshot of Conway’s Game of Life — and I point to a region within the overall game state and claim that knowledge is accumulating within this region. Without any a priori knowledge of the encoding of this hypothesized knowledge, nor of the boundary between any hypothesized agent and environment, nor of the mechanism by which any hypothesized computation is happening, can you test my claim?

Or even more abstractly, if I show you a state vector evolving from one time step to the next according to a transition function and I claim that knowledge is accumulating within some particular subset of the dimensions of this state vector, can you say what it would mean for my claim to be true?

I have been seeking a definition of knowledge as a correspondence between the configuration of a region and the configuration of the overall system, but I have not found a satisfying definition. In this sequence I will describe the attempts I've made and the challenges that I've come up against.

What a definition should accomplish

The desiderata that I’ve been working with are as follows. I’ve chosen these based on the AI-related motivations described above.

  • A definition should provide necessary and sufficient conditions for the accumulation of knowledge such that any entity that exerts goal-directed influence over the future must accumulate knowledge according to the definition.

  • A definition should be expressed at the level of physics, which means that it should address what it means for knowledge to accumulate within a given spatial region, without presupposing any particular structure to the system inside or outside of that region.

  • In particular there should not be reference to "agent" or "computer" as ontologically fundamental concepts within the definition. However, a definition of the accumulation of knowledge might include sub-definitions of "agent" or "computer", and of course it’s fine to use humans, robots and digital computers as examples and counterexamples.

The following are non-goals:

  • Practical means for detecting the accumulation of knowledge in a system.

  • Practical means for limiting the accumulation of knowledge in a system.


The failed definitions of the accumulation of knowledge that I will explore in the ensuing posts in this sequence are as follows. I will be posting one per day this week.

Direct map/territory resemblance

Attempted definition: Knowledge is accumulating whenever a region within the territory bears closer and closer resemblance to the overall territory over time, such as when drawing a physical map with markings that correspond to the locations of objects in the world.

Problem: Maps might be represented in non-trivial ways that make it impossible to recognize a map/territory resemblance when examining the system at a single point in time, such as a map that is represented within computer memory rather than on a physical sheet of paper.

Mutual information between region and environment

Attempted definition: Knowledge is accumulating whenever a region within the territory and the remainder of the territory are increasing in mutual information over time.

Problem: The constant interaction between nearby physical objects means that even a rock orbiting the Earth is acquiring enormous mutual information with the affairs of humans due to the imprinting of subatomic information onto the surface of rock by photons bouncing off the Earth, yet this does not constitute knowledge.

Mutual information over digital abstraction layers

Attempted definition: Knowledge is accumulating whenever a digital abstraction layer exists and there is an increase over time in mutual information between its high-level and low-level configurations. A digital abstraction layer is a grouping of low-level configurations into high-level configurations such that transitions between high-level configurations are predictable without knowing the low-level configurations.

Problem: A digital computer that is merely recording everything it observes is acquiring more knowledge, on this definition, than a human who cannot recall their observations but can construct models and act on them.

Precipitation of action

Attempted definition: Knowledge is accumulating when an entity’s actions are becoming increasingly fine-tuned to a particular configuration of the environment over time.

Problem: A sailing ship that is drawing a map of a coastline but sinks before the map is ever used by anyone to take action would not be accumulating knowledge by this definition, yet does in fact seem to be accumulating knowledge.

Literature review

The final post in the sequence reviews some of the philosophical literature on the subject of defining knowledge, as well as a few related posts here on lesswrong.

New Comment
13 comments, sorted by Click to highlight new comments since:

Planned summary for the Alignment Newsletter:

Probability theory can tell us about how we ought to build agents that have knowledge (start with a prior, and perform Bayesian updates as evidence comes in). However, this is not the only way to create knowledge: for example, humans are not ideal Bayesian reasoners. As part of our quest to <@_describe_ existing agents@>(@Theory of Ideal Agents, or of Existing Agents?@), could we have a theory of knowledge that specifies when a particular physical region within a closed system is “creating knowledge”? We want a theory that <@works in the Game of Life@>(@Agency in Conway’s Game of Life@) as well as the real world.

This sequence investigates this question from the perspective of defining the accumulation of knowledge as increasing correspondence between [a map and the territory](, and concludes that such definitions are not tenable. In particular, it considers four possibilities, and demonstrates counterexamples to all of them:

1. Direct map-territory resemblance: Here, we say that knowledge accumulates in a physical region of space (the “map”) if that region of space looks more like the full system (the “territory”) over time.

Problem: This definition fails to account for cases of knowledge where the map is represented in a very different way that doesn’t resemble the territory, such as when a map is represented by a sequence of zeros and ones in a computer.

2. Map-territory mutual information: Instead of looking at direct resemblance, we can ask whether there is increasing mutual information between the supposed map and the territory it is meant to represent.

Problem: In the real world, nearly _every_ region of space will have high mutual information with the rest of the world. For example, by this definition, a rock accumulates lots of knowledge as photons incident on its face affect the properties of specific electrons in the rock giving it lots of information.

3. Mutual information of an abstraction layer: An abstraction layer is a grouping of low-level configurations into high-level configurations such that transitions between high-level configurations are predictable without knowing the low-level configurations. For example, the zeros and ones in a computer are the high-level configurations of a digital abstraction layer over low-level physics. Knowledge accumulates in a region of space if that space has a digital abstraction layer, and the high-level configurations of the map have increasing mutual information with the low-level configurations of the territory.

Problem: A video camera that constantly records would accumulate much more knowledge by this definition than a human, even though the human is much more able to construct models and act on them.

4. Precipitation of action: The problem with our previous definitions is that they don’t require the knowledge to be _useful_. So perhaps we can instead say that knowledge is accumulating when it is being used to take action. To make this mechanistic, we say that knowledge accumulates when an entity’s actions become more fine tuned to a specific environment configuration over time. (Intuitively, they learned more about the environment, and so could condition their actions on that knowledge, which they previously could not do.)

Problem: This definition requires the knowledge to actually be used to count as knowledge. However, if someone makes a map of a coastline, but that map is never used (perhaps it is quickly destroyed), it seems wrong to say that during the map-making process knowledge was not accumulating. 

This was a great summary, thx.

Your summaries are excellent Rohin. This looks good to me.

I think that part of the problem is that talking about knowledge requires adopting an interpretative frame. We can only really say whether a collection of particles represents some particular knowledge from within such a frame, although it would be possible to determine the frame of minimum complexity that interprets a system as representing certain facts. In practise though, whether or not a particular piece of storage contains knowledge will depend on the interpretative frames in the environment, although we need to remember that interpretative frames can emulate other interpretative frames. ie. A human experimenting with multiple codes in order to decode a message.

Regarding the topic of partial knowledge, it seems that the importance of various facts will vary wildly from context to context and also depending on the goal. I'm somewhat skeptical that goal independent knowledge will have a nice definition.

Well yes I agree that knowledge exists with respect to a goal, but is there really no objective difference an alien artifact inscribed with deep facts about the structure of the universe and set up in such a way that it can be decoded by any intelligent species that might find it, and an ordinary chunk of rock arriving from outer space?

Well, taking the simpler case of exacting reproducing a certain string, you could find the simplest program that produces the string similar to Kolmogorov complexity and use that as a measure of complexity.

A slightly more useful way of modelling things may be to have a bunch of different strings with different points representing levels of importance. And perhaps we produce a metric combining the Kolmovorov complexity of a decoder with the sum of the points produced where points are obtained by concatenating the desired strings with a predefined separator. For example, we might find the quotient.

One immediate issue with this is that some of the strings may contain overlapping information.  And we'd still have to produce a metric to assign importances to the strings. Perhaps a simpler case would be where the strings represent patterns in a stream via encoding a Turing machine with the Turing machines being able to output sets of symbols instead of just symbols representing the possible symbols at each locations.  And the amount of points they provide would be equal to how much of the stream it allows you to predict. (This would still require producing a representation of the universe where the amount of the stream predicted would be roughly equivalent to how useful the predictions are).

Any thoughts on this general approach?

Well here is a thought: a random string would have high Kolmogorov complexity, as would a string describing the most fundamental laws of physics. What are the characteristics of the latter that conveys power over one's environment to an agent that receives it, that is not conveyed by the former? This is the core question I'm most interested in at the moment.

Is this sequence complete? I was expecting a final literature review post before summarizing for the newsletter, but it's been a while since the last update and you've posted something new, so maybe you decided to skip it?

The sequence is now complete.

It's actually written, just need to edit and post. Should be very soon. Thanks for checking on it.

I think grappling with this problem is important because it leads you directly to understanding that what you are talking about is part of your agent-like model of systems, and how this model should be applied depends both on the broader context and your own perspective.

Au contraire, I think that "mutual information between the object and the environment" is basically the right definition of "knowledge", at least for knowledge about the world (as it correctly predicts that all four attempted "counterexamples" are in fact forms of knowledge), but that the knowledge of an object also depends on the level of abstraction of the object which you're considering.

For example, for your rock example: A rock, as a quantum object, is continually acquiring mutual information with the affairs of humans by the imprinting of subatomic information onto the surface of rock by photons bouncing off the Earth. This means that, if I was to examine the rock-as-a-quantum-object for a really long time, I would know the affairs of humans (due to the subatomic imprinting of this information on the surface of the rock), and not only that, but also the complete workings of quantum gravity, the exact formation of the rock, the exact proportions of each chemical that went into producing the rock, the crystal structure of the rock, and the exact sequence of (micro-)chips/scratches that went into making this rock into its current shape. I feel perfectly fine counting all this as the knowledge of the rock-as-a-quantum-object, because this information about the world is stored in the rock. 

(Whereas, if I were only allowed to examine the rock-as-a-macroscopic-object, I would still know roughly what chemicals it was made of and how they came to be, and the largest fractures of the rock, but I wouldn't know about the affairs of humans; hence, such is the knowledge held by the rock-as-a-macroscopic-object. This makes sense because the rock-as-a-macroscopic-object is an abstraction of the rock-as-a-quantum-object, and abstractions always throw away information except that which is "useful at a distance".)

For more abstract kinds of knowledge, my intuition defaults to question-answering/epistemic-probability/bet-type definitions, at least for sufficiently agent-y things. For example, I know that 1+1=2. If you were to ask me, "What is 1+1?", I would respond "2". If you were to ask me to bet on what 1+1 was, in such a way that the bet would be instantly decided by Omega, the omniscient alien, I would bet with very high probability (maybe 40:1odds in favor, if I had to come up with concrete numbers?) that it would be 2 (not 1, because of Cromwell's law, and also because maybe my brain's mental arithmetic functions are having a bad day). However, I do not know whether the Riemann Hypothesis is true, false, or independent of ZFC. If you asked me, "Is the Riemann Hypothesis true, false, or independent of ZFC?", I would answer, "I don't know" instead of choosing one of the three possibilities, because I don't know. If you asked me to bet on whether the Riemann Hypothesis was true, false, or independent of ZFC, with the bet to be instantly decided by Omega, I might bet 70% true, 20% false, and 10% independent (totally made-up semi-plausible figures that no bearing on the heart of the argument; I haven't really tested my probabilistic calibration), but I wouldn't put >95% implied probability on anything because I'm not that confident in any one possibility. Thusly, for abstract kinds of knowledge, I think I would say that an agent (or a sufficiently agent-y thing) knows an abstract fact X if it tells you about this fact when prompted with a suitably phrased question, and/or if it places/would place a bet in favor of fact X with very high implied probability if prompted to bet about it. 

(One problem with this definition is that, intuitively, when I woke up today, I had no idea what 384384*20201 was; the integers here are also completely arbitrary. However, after I typed it into a calculator and got 7764941184, I now know that 384384*20201 = 7764941184. I think this is also known as the problem of logical omniscience; Scott Aaronson once wrote a pretty nice essay about this topic and others from the perspective of computational complexity.)

I have basically no intuition whatsoever on what it means for a rock* to know that the Riemann Hypothesis is true, false, or independent of ZFC. My extremely stupid and unprincipled guess is that, unless a rock is physically inscribed with a proof of the true answer, it doesn't know, and that otherwise it does.

*I'm using a rock here as a generic example of a clearly-non-agentic thing. Obviously, if a rock was an agent, it'd be a very special rock, at least in the part of the multiverse that I inhabit. Feel free to replace "rock" with other words for non-agents.

Thank you for this comment duck_master.

I take your point that it is possible to extract knowledge about human affairs, and about many other things, from the quantum structure of a rock that has been orbiting the Earth. However, I am interested in a definition of knowledge that allows me to say what a given AI does or does not know, insofar as it has the capacity to act on this knowledge. For example, I would like to know whether my robot vacuum has acquired sophisticated knowledge of human psychology, since if it has, and I wasn't expecting it to, then I might choose to switch it off. On the other hand, if I merely discover that my AI has recorded some videos of humans then I am less concerned, even if these videos contain the basic data necessary to constructed sophisticated knowledge of human psychology, as in the case with the rock. Therefore I am interested not just in information, but something like action-readiness. I am referring to that which is both informative and action-ready as "knowledge", although this may be stretching the standard use of this term.

Now you say that we might measure more abstract kinds of knowledge by looking at what an AI is willing to bet on. I agree that this is a good way to measure knowledge if it is available. However, if we are worried that an AI is deceiving us, then we may not be willing to trust its reports of its own epistemic state, or even of the bets it makes, since it may be willing to lose money now in order to convince us that it is not particularly intelligent, in order to make a treacherous turn later. Therefore I would very much like to find a definition that does not require me to interact with the AI through its input/output channels in order to find out what it knows, but rather allows me to look directly at its internals. I realize this may be impossible, but this is my goal.

So as you can see, my attempt at a definition of knowledge is very much wrapped up with the specific problem I'm trying to solve, and so any answers I arrive at may not be useful beyond this specific AI-related question. Nevertheless, I see this as an important question and so am content to be a little myopic in my investigation.