Epistemic status: my own thoughts I've thought up in my own time. They may be quite or very wrong! I am likely not the first person to come to these ideas. All of my main points here are just hypotheses which I've come to by the reasoning stated below. Most of it is informal mathematical arguments about likely phenomena and none is rigorous proof. I might investigate them if I had the time/money/programming skills. Lots of my hypotheses are really long and difficult-to-parse sentences.

What is knowledge?

I think this question is bad.

It's too great of a challenge. It asks us (implicitly) for a mathematically rigorous definition which fits all of our human feelings about a very loaded word. This is often a doomed endeavour from the start, as human intuitions don't neatly map onto logic. Also, humans might disagree on what things count as or do not count as knowledge. So let's attempt to right this wrong question:

Imagine a given system is described as "knowing" something. What is the process that leads to the accumulation of said knowledge likely to look like?

I think this is much better.

We limit ourselves to systems which can definitely be said to "know" something. This allows us to pick a starting point. This might be a human, GPT-3, or a neural network which can tell apart dogs and fish. In fact this will be my go-to answer for the future. We also don't need to perfectly specify the process which generates knowledge all at once, only comment on its likely properties.

Properties of "Learning"

Say we have a very general system, with parameters θ, with t representing time during learning. Let's say they're initialized as θ0 according to some random distribution. Now it interacts with the dataset which we will represent with X, taken from some distribution over possible datasets. The learning process will update θ0, so we can represent the parameters the parameters after some amount of time as θ(θ0;X;t). This reminds us that the set of parameters depends on three things: the initial parameters, the dataset, and the amount of training.

Consider θ(θ0;X;0). This is trivially equal to θ0, and so it depends only on the choice of θ0. The dataset has had no chance to affect the parameters in any way.

So what about as t→∞? We would expect that θ∞(θ0;X)=θ(θ0;X;∞) depends mostly on the choice of X and much less strongly on θ0. There will presumably be some dependency on initial conditions, especially for very complex models like a big neural network with many local minima. But mostly it's ω which influences θ.

So far this is just writing out basic sequences stuff. To make a map of the city you have to look at it, and to learn your model has to causally entangle itself with the dataset. But let's think about what happens when ω is slightly different.

Changes in the world

So far we've represented the whole dataset with a single letter X, as if it were just a number or something. But in reality it will have many, many independent parts. Most datasets which are used as inputs to learning processes are also highly structured.

Consider the dog-fish discriminator, trained on the dataset Xdog/fish. The system θ∞(θ0;Xdog/fish) could be said to have "knowledge" that "dogs have two eyes". One thing this means if we instead fed it an X which was identical except every dog had three eyes (TED) then the final values of θ would be different. The same is true of facts like "fish have scales", "dogs have one tail". We could express this as follows:

θ∞(θ0;Xdog/fish+ΔXTED)

Where ΔXTED is the modification of "photoshopping the dogs to have three eyes". We now have:

Now let's consider how Δθ∞(θ0;X;ΔX) behaves. For lots of choices of ΔX it might just be a series of random changes tuning the whole set of θ values. But from my knowledge of neural networks, it might not be. Lots of image recognizing networks have been found to contain neurons with specific functions which relate to structures in the data, from simple line detectors, all the way up to "cityscape" detectors.

For this reason I suggest the following hypothesis:

Structured and localized changes in the dataset that a parameterized learning system is exposed to will cause localized changes in the final values of the parameters.

Impracticalities and Solutions

Now it would be lovely to train all of GPT-3 twice, once with the original dataset, and once in a world where dogs are blue. Then we could see the exact parameters that lead it to return sentences like "the dog had [chocolate rather than azure] fur". Unfortunately rewriting the whole training dataset around this is just not going to happen.

Finding the flow of information, and influence in a system is easy if you have a large distribution of different inputs and outputs (and a good idea of the direction of causality). If you have just a single example, you can't use any statistical tools at all.

So what else can we do? Well we don't just have access to θ∞. In principle we could look at the course of the entire training process and how θ changes over time. For each timestep, and each element of the dataset X, we could record how much each element of θ is changed. We'll come back to this

Let's consider the dataset as a function of the external world: X(Ω). All the language we've been using about knowledge has previously only applied to the dataset. Now we can describe how it applies to the world as a whole.

For some things the equivalence of knowledge of X and Ω is pretty obvious. If the dataset is being used for a self-driving car and it's just a bunch of pictures and videos then basically anything the resulting parameterised system knows about X it also knows about Ω. But for obscure manufactured datasets like [4000 pictures of dogs photoshopped to have three eyes] then it's really not clear.

Either way, we can think about Ω as having influence over X the same way as we can think about X as having influence over θ∞. So we might be able to form hypotheses about this whole process. Let's go back to Xdog/fish. First off imagine a change Ωnew=Ω+ΔΩ, such as "dogs have three eyes". This will change some elements of X more than others. Certain angles of dog photos, breeds of dogs, will be changed more. Photos of fish will stay the same!

Now we can imagine a function Δθ(θ0;X(Ω);ΔX(Ω;ΔΩ)). This represents some propagation of influence from Ω→X→θ. Note that the influence of Ω on X is independent of our training process or θ0. This makes sense because different bits of the training dataset contain information about different bits of the world. How different training methods extract this information might be less obvious.

The Training Process

During training, θ(t) is exposed to various elements of X and updated. Different elements of X will update θ(t) by different amounts. Since the learning process is about transferring influence over θ from θ0 to Ω (acting via X), we might expect that for a given element of X, it has more "influence" over the final values of the elements of θ which were changed the most due to exposure to that particular element of X during training.

This leads us to a second hypothesis:

The degree to which an element of the dataset causes an element of the parameters to be updated during training is correlated with the degree to which a change to that dataset element would have caused a change in the final value of the parameter.

Which is equivalent to:

Knowledge of a specific properties of the dataset is disproportionately concentrated in the elements of the final parameters that have been updated the most during training when "exposed" to certain dataset elements that have a lot of mutual information with that property.

For the dog-fish example: elements of parameter space which have updated disproportionately when exposed to photos of dogs that contain the dogs' heads (and therefore show just two eyes), will be more likely to contain "knowledge" of the fact that "dogs have two eyes".

This naturally leads us to a final hypothesis:

Correlating update-size as a function of dataset-element across two models will allow us to identify subsets of parameters which contain the same knowledge across two very different models.

Therefore

Access to a simple interpreted model of a system will allow us to rapidly infer information about a much larger model of the same system if they are trained on the same datasets, and we have access to both training histories.

Motivation

I think an AI which takes over the world will have a very accurate model of human morality, it just won't care about it. I think that one way of getting the AI to not kill us is to extract parts of the human utility-function-value-system-decision-making-process-thing from its model and tell the AI to do those. I think that to do this we need to understand more about where exactly the "knowledge" is in an inscrutable model. I also find thinking about this very interesting.

Epistemic status: my own thoughts I've thought up in my own time. They may be quite or very wrong! I am likely not the first person to come to these ideas. All of my main points here are just hypotheses which I've come to by the reasoning stated below. Most of it is informal mathematical arguments about likely phenomena and none is rigorous proof. I might investigate them if I had the time/money/programming skills. Lots of my hypotheses are really long and difficult-to-parse sentences.I think this question is bad.

It's too great of a challenge. It asks us (implicitly) for a mathematically rigorous definition which fits all of our human feelings about a very loaded word. This is often a doomed endeavour from the start, as human intuitions don't neatly map onto logic. Also, humans might disagree on what things count as or do not count as knowledge. So let's attempt to right this wrong question:

I think this is much better.

We limit ourselves to systems which can definitely be said to "know" something. This allows us to pick a starting point. This might be a human, GPT-3, or a neural network which can tell apart dogs and fish. In fact this will be my go-to answer for the future. We also don't need to perfectly specify the process which generates knowledge all at once, only comment on its likely properties.

## Properties of "Learning"

Say we have a very general system, with parameters θ, with t representing time during learning. Let's say they're initialized as θ0 according to some random distribution. Now it interacts with the dataset which we will represent with X, taken from some distribution over possible datasets. The learning process will update θ0, so we can represent the parameters the parameters after some amount of time as θ(θ0; X; t). This reminds us that the set of parameters depends on three things: the initial parameters, the dataset, and the amount of training.

Consider θ(θ0; X; 0). This is trivially equal to θ0, and so it depends only on the choice of θ0. The dataset has had no chance to affect the parameters in any way.

So what about as t→∞? We would expect that θ∞(θ0; X)=θ(θ0; X; ∞) depends mostly on the choice of X and much less strongly on θ0. There will presumably be some dependency on initial conditions, especially for very complex models like a big neural network with many local minima. But mostly it's ω which influences θ.

So far this is just writing out basic sequences stuff. To make a map of the city you have to look at it, and to learn your model has to causally entangle itself with the dataset. But let's think about what happens when ω is slightly different.

## Changes in the world

So far we've represented the whole dataset with a single letter X, as if it were just a number or something. But in reality it will have many, many independent parts. Most datasets which are used as inputs to learning processes are also highly structured.

Consider the dog-fish discriminator, trained on the dataset Xdog/fish. The system θ∞(θ0; Xdog/fish) could be said to have "knowledge" that "dogs have two eyes". One thing this means if we instead fed it an X which was identical except every dog had three eyes (TED) then the final values of θ would be different. The same is true of facts like "fish have scales", "dogs have one tail". We could express this as follows:

θ∞(θ0; Xdog/fish+ΔXTED)

Where ΔXTED is the modification of "photoshopping the dogs to have three eyes". We now have:

θ∞(θ0; Xdog/fish+ΔXTED)=θ∞(θ0; Xdog/fish)+Δθ∞(θ0; Xdog/fish; ΔXTED)

Now let's consider how Δθ∞(θ0; X; ΔX) behaves. For lots of choices of ΔX it might just be a series of random changes tuning the whole set of θ values. But from my knowledge of neural networks, it might not be. Lots of image recognizing networks have been found to contain neurons with

specific functionswhich relate to structures in the data, from simple line detectors, all the way up to "cityscape" detectors.For this reason I suggest the following hypothesis:

## Impracticalities and Solutions

Now it would be lovely to train all of GPT-3 twice, once with the original dataset, and once in a world where dogs are blue. Then we could see the exact parameters that lead it to return sentences like "the dog had [chocolate rather than azure] fur". Unfortunately rewriting the whole training dataset around this is just not going to happen.

Finding the flow of information, and influence in a system is easy if you have a large distribution of different inputs and outputs (and a good idea of the direction of causality). If you have just a single example, you can't use any statistical tools at all.

So what else can we do? Well we don't just have access to θ∞. In principle we could look at the course of the entire training process and how θ changes over time. For each timestep, and each element of the dataset X, we could record how much each element of θ is changed. We'll come back to this

Let's consider the dataset as a function of the external world: X(Ω). All the language we've been using about knowledge has previously only applied to the dataset. Now we can describe how it applies to the world as a whole.

For some things the equivalence of knowledge of X and Ω is pretty obvious. If the dataset is being used for a self-driving car and it's just a bunch of pictures and videos then basically anything the resulting parameterised system knows about X it also knows about Ω. But for obscure manufactured datasets like [4000 pictures of dogs photoshopped to have three eyes] then it's really not clear.

Either way, we can think about Ω as having influence over X the same way as we can think about X as having influence over θ∞. So we might be able to form hypotheses about this whole process. Let's go back to Xdog/fish. First off imagine a change Ωnew=Ω+ΔΩ, such as "dogs have three eyes". This will change some elements of X more than others. Certain angles of dog photos, breeds of dogs, will be changed more. Photos of fish will stay the same!

Now we can imagine a function Δθ(θ0; X(Ω); ΔX(Ω; ΔΩ)). This represents some propagation of influence from Ω→X→θ. Note that the influence of Ω on X is independent of our training process or θ0. This makes sense because different bits of the training dataset contain information about different bits of the world. How different training methods extract this information might be less obvious.

## The Training Process

During training, θ(t) is exposed to various elements of X and updated. Different elements of X will update θ(t) by different amounts. Since the learning process is about transferring influence over θ from θ0 to Ω (acting via X), we might expect that for a given element of X, it has more "influence" over the final values of the elements of θ which were changed the most due to exposure to that particular element of X during training.

This leads us to a second hypothesis:

Which is equivalent to:

For the dog-fish example: elements of parameter space which have updated disproportionately when exposed to photos of dogs that contain the dogs' heads (and therefore show just two eyes), will be more likely to contain "knowledge" of the fact that "dogs have two eyes".

This naturally leads us to a final hypothesis:

Therefore

## Motivation

I think an AI which takes over the world will have a very accurate model of human morality, it just won't care about it. I think that one way of getting the AI to not kill us is to extract parts of the human utility-function-value-system-decision-making-process-thing from its model and tell the AI to do those. I think that to do this we need to understand more about where exactly the "knowledge" is in an inscrutable model. I also find thinking about this very interesting.