Status: a brief distillation of Wittgenstein's book On Certainty, using examples from deep learning and GOFAI, plus discussion of AI alignment and interpretability.
"That is to say, the questions that we raise and our doubts depend on the fact that some propositions are exempt from doubt, are as it were like hinges on which those turn."
— Ludwig Wittgenstein, On Certainty
1. Deep Learning
Suppose we want a neural network to detect whether two children are siblings based on photographs of their face. The network will received two n-dimensional vectors v1 and v2representing the pixels in each image, and will return a value y(v1,v2)∈R which we interpret as the log-odds that the children are siblings. So the model has type-signature Rn+n→R.
There are two ways we can do this.
We could use an architecture yA(v1,v2)=σ(vT1Av2+b), where —
σ is the sigmoid function
A is an n×n matrix of learned parameters,
b∈R is a learned bias.
This model has n2+1 free parameters.
Alternatively, we could use an architecture yU(v1,v2)=σ(vT1(U+UT2)v2+b), where —
σ is the sigmoid function
U is an n×n upper-triangular matrix of learned parameters
b∈R is a learned bias
This model has n2/2+n/2+1 free parameters.
Each model has a vector of free parameters θ∈Θ. If we train the model via SGD on a dataset (or via some other method) we will end up with a trained models yθ:Rn+n→R, where y_:Θ→(Rn+n→R) is the architecture.
Anyway, we now have two different NN models, and we want to ascribe beliefs to each of them. Consider the proposition ϕ that siblingness is symmetric, i.e. every person is the sibling of their siblings. What does it mean to say that a model knows or belives that ϕ.
Let's start with a black-box definition of knowledge or belief: when we say that a model knows or believes that ϕ, we mean that yθ(v1,v2)=yθ(v2,v1) for all v1,v2∈Rn which look sufficiently like faces. According to this black-box definition, both trained models believe ϕ.
But if we peer inside the black box, we can see that NN Model 1 believes ϕ in a very different way than how NN Model 2 believes ϕ.
For NN Model 1, the belief is encoded in the learned parameters θ∈Θ.
For NN Model 2, the belief is encoded in the architecture itself y_.
These are two different kinds of belief.
2. Symbolic Logic
Suppose we use GOFAI/symbolic logic to determine whether two children are siblings.
Our model consists of three things —
A language L consisting of names and binary familial relations.
A knowledge-base Γ consisting of L-formulae.
A deductive system ⊢ which takes a set of L-formulae (premises) to a larger set of L-formulae (conclusions).
There are two ways we can do this.
We could use a system (L,Γ,⊢) , where —
The language L has names for every character and familial relations parent,child,sibling,grandparent,grandchild,cousin
The knowledge-base Γ has axioms {sibling(Jack,Jill),sibling(x,y)→sibling(y,x)}
The deductive system ⊢ corresponds to first-order predicate logic.
Alternatively, we could use a system (L,Γ,⊢), where —
The language L has names for every character and familial relations parent,child,sibling,grandparent,grandchild,cousin
The knowledge-base Γ has axioms {sibling(Jack,Jill)}
The deductive system ⊢ corresponds to first-order predicate logic with an additional logical rule sibling(x,y)⊢sibling(y,x).
In this situation, we have two different SL models, and we want to ascribe beliefs to each of them. Consider the proposition ϕ that siblingness is symmetric, i.e. every person is the sibling of their siblings.
Let's start with a black-box definition of knowledge or belief: when we say that a model knows or believes that ϕ, we mean that Γ⊢sibling(τ1,τ2)→sibling(τ2,τ1) for every pair of closed L-terms τ1,τ2. According to this black-box definition, both models believe ϕ.
But if we peer inside the black box, we can see that SL Model 1 believes ϕ in a very different way than how SL Model 2 believes ϕ.
For SL Model 1, the belief is encoded in the knowledge-base Γ.
For SL Model 2, the belief is encoded in the deductive system ⊢ itself.
These are two different kinds of belief. Can you see how they map onto the distinction in the previous section?
3. Wittgenstein
In On Certainty, Wittgenstein contrasts two different kinds of belief.
Humans have free beliefs and hinge beliefs.
A human's free beliefs are similar to how NN Model 1 and SL Model 1 believe ϕ. In other words, these are beliefs encoded in our learned parameters θ∈Θ, or in the knowledge-base Γ.
In contrast, a human's hinge beliefs are similar to how NN Model 2 and SL Model 2 believe ϕ. In other words, these are beliefs encoded in the architecture itself y_, or in the deductive system ⊢ .
Here are some of my free beliefs:
Cairo is the capital of Egypt.
101 is a prime number.
There are eight planets in the Solar System.
Today is a Thursday.
Here are some of my hinge beliefs:
I am currently on Earth.
Today is not 1943.
Here is my hand
The external world exists.
My memory is at least somewhat reliable over short timespans.
Let's use LessWrong's favourite analogy — the map and the territory.
We might say the map knows that Manchester is north of Portsmouth because that's what's shown on the map. This would count as a free belief.
We might also say the map knows that England is roughly two dimensional — that's also shown on the map. But this would count as a hinge belief, because it's not a free parameter.
Wittgenstein calls these "hinge beliefs" because they must be fixed, allowing our world-model to "swing like a door" throughout the rest of the possibilites.
Hinge beliefs are not like axioms. They aren't foundational, but instead pre-foundational. They are the presuppositions for our conceptual map to connect with the external world whatsoever.
Hinge beliefs are not subject to rational evalutation or empirical testing, but they can be evaluated in other ways.
It's somewhat defective to say "I know ϕ" or "I doubt ϕ" when ϕ is a hinge belief.
Perception
Judgement
Free belief
This cat is furry
Today is a Thursday
Hinge belief
There are three colours
ϕ,ϕ→ψ⊢ψ
4. Alignment relevance
Depending on the architecture, randomly initialised neural networks will "know" things.
Determining which hinge beliefs are induced by a neural network architecture is (in general) non-trivial.
Whether a belief is a hinge belief or a free belief will affect —
Capabilities
Safety
Interpretability
The general trend of ML over the past ten years has been towards free beliefs rather than hinge beliefs. If there are less hinges, then the door can swing through a wider space, i.e. the model is more general.
Nonetheless, even the most general architecture must induce some hinge beliefs, because otherwise the models couldn't correspond to any external territory whatsoever.
As a rough rule-of-thumb, I expect that swapping free beliefs with hinge beliefs would make AI more safe and less capable. I'm not sure whether this would be worthwhile on the safety-capabilities trade-off, and I'm not sure whether it would make AI more interpretable (but my guess is slightly yes).
If mechanistic interpretability goes well, then we should be able to take a trained neural network with free beliefs, identify certain symmetries/regularities within the parameters, and then convert the model into an equivalent model where those beliefs are now hinges. In other words, we should be able to turn knowledge stuck in the parameters to knowledge stuck in the architecture.
Status: a brief distillation of Wittgenstein's book On Certainty, using examples from deep learning and GOFAI, plus discussion of AI alignment and interpretability.
1. Deep Learning
Suppose we want a neural network to detect whether two children are siblings based on photographs of their face. The network will received two n-dimensional vectors v1 and v2representing the pixels in each image, and will return a value y(v1,v2)∈R which we interpret as the log-odds that the children are siblings. So the model has type-signature Rn+n→R.
There are two ways we can do this.
Each model has a vector of free parameters θ∈Θ. If we train the model via SGD on a dataset (or via some other method) we will end up with a trained models yθ:Rn+n→R, where y_:Θ→(Rn+n→R) is the architecture.
Anyway, we now have two different NN models, and we want to ascribe beliefs to each of them. Consider the proposition ϕ that siblingness is symmetric, i.e. every person is the sibling of their siblings. What does it mean to say that a model knows or belives that ϕ.
Let's start with a black-box definition of knowledge or belief: when we say that a model knows or believes that ϕ, we mean that yθ(v1,v2)=yθ(v2,v1) for all v1,v2∈Rn which look sufficiently like faces. According to this black-box definition, both trained models believe ϕ.
But if we peer inside the black box, we can see that NN Model 1 believes ϕ in a very different way than how NN Model 2 believes ϕ.
These are two different kinds of belief.
2. Symbolic Logic
Suppose we use GOFAI/symbolic logic to determine whether two children are siblings.
Our model consists of three things —
There are two ways we can do this.
In this situation, we have two different SL models, and we want to ascribe beliefs to each of them. Consider the proposition ϕ that siblingness is symmetric, i.e. every person is the sibling of their siblings.
Let's start with a black-box definition of knowledge or belief: when we say that a model knows or believes that ϕ, we mean that Γ⊢sibling(τ1,τ2)→sibling(τ2,τ1) for every pair of closed L-terms τ1,τ2. According to this black-box definition, both models believe ϕ.
But if we peer inside the black box, we can see that SL Model 1 believes ϕ in a very different way than how SL Model 2 believes ϕ.
These are two different kinds of belief. Can you see how they map onto the distinction in the previous section?
3. Wittgenstein
In On Certainty, Wittgenstein contrasts two different kinds of belief.
4. Alignment relevance