Towards a Less Bullshit Model of Semantics

johnswentworth; David Lorell

Or: Towards Bayesian Natural Language Semantics In Terms Of Interoperable Mental Content

Or: Towards a Theory of Interoperable Semantics

You know how natural language “semantics” as studied in e.g. linguistics is kinda bullshit? Like, there’s some fine math there, it just ignores most of the thing which people intuitively mean by “semantics”.

When I think about what natural language “semantics” means, intuitively, the core picture in my head is:

I hear/read some words, and my brain translates those words into some kind of internal mental content.
The mental content in my head somehow “matches” the mental content typically evoked in other peoples’ heads by the same words, thereby allowing us to communicate at all; the mental content is “interoperable” in some sense.

That interoperable mental content is “the semantics of” the words. That’s the stuff we’re going to try to model.

The main goal of this post is to convey what it might look like to “model semantics for real”, mathematically, within a Bayesian framework.

But Why Though?

There’s lots of reasons to want a real model of semantics, but here’s the reason we expect readers here to find most compelling:

The central challenge of ML interpretability is to faithfully and robustly translate the internal concepts of neural nets into human concepts (or vice versa). But today, we don’t have a precise understanding of what “human concepts” are. Semantics gives us an angle on that question: it’s centrally about what kind of mental content (i.e. concepts) can be interoperable (i.e. translatable) across minds.

Later in this post, we give a toy model for the semantics of nouns and verbs of rigid body objects. If that model were basically correct, it would give us a damn strong starting point on what to look for inside nets if we want to check whether they’re using the concept of a teacup or free-fall or free-falling teacups. This potentially gets us much of the way to calculating quantitative bounds on how well the net’s internal concepts match humans’, under conceptually simple (though substantive) mathematical assumptions.

Then compare that to today: Today, when working on interpretability, we’re throwing darts in the dark, don’t really understand what we’re aiming for, and it’s not clear when the darts hit something or what, exactly, they’ve hit. We can do better.

Overview

In the first section, we will establish the two central challenges of the problem we call Interoperable Semantics. The first is to characterize the stuff within a Bayesian world model (i.e. mental content) to which natural-language statements resolve; that’s the “semantics” part of the problem. The second aim is to characterize when, how, and to what extent two separate models can come to agree on the mental content to which natural language resolves, despite their respective mental content living in two different minds; that’s the “interoperability” part of the problem.

After establishing the goals of Interoperable Semantics, we give a first toy model of interoperable semantics based on the “words point to clusters in thingspace” mental model. As a concrete example, we quantify the model’s approximation errors under an off-the-shelf gaussian clustering algorithm on a small-but-real dataset. This example emphasizes the sort of theorems we want as part of the Interoperable Semantics project, and the sorts of tools which might be used to prove those theorems. However, the example is very toy.

Our second toy model sketch illustrates how to construct higher level Interoperable Semantics models using the same tools from the first model. This one is marginally less toy; it gives a simple semantic model for rigid body nouns and their verbs. However, this second model is more handwavy and has some big gaps; its purpose is more illustration than rigor.

Finally, we have a call to action: we’re writing this up in the hopes that other people (maybe you!) can build useful Interoperable Semantics models and advance our understanding. There’s lots of potential places to contribute.

What’s The Problem?

Central Problem 1: How To Model The Magic Box?

So we have two agents, Alice and Carol. (Yes, Bob is also hanging around, we’ll get to him shortly). We’re in a Bayesian framework, so they each have a probabilistic world model. Carol is telling a story and says “... and then the fox jumped!”. Alice hears these words, and updates her internal world model to include the fox jumping.^[1]

Somewhere in the middle there, some magic box in Alice’s brain needed to turn the words “and then the fox jumped” into some stuff which she can condition on - e.g. things like , where $X$ is some random variable in Alice's world model. That magic box is the semantics box: words go in, mental content comes out. The first central problem of Interoperable Semantics, in a Bayesian frame, is how to model the box. The semantics of ‘and then the fox jumped’ is the stuff which Alice conditions on, like $X = 12.3$ : i.e. the Bayesian “mental content” into which natural language is translated by the box.

Subproblem: What’s The Range Of Box Outputs?

A full Bayesian model of semantics would probably involve a full human-like world model. After all, the model would need to tell us exactly which interoperable-variable values are spit out by a magic semantics box for any natural language text.

A shorter-term intermediate question is: what even is the input set (i.e. domain) and output set (i.e. range) of the semantics box? Its inputs are natural language, but what about the outputs?

We can already give a partial answer: because we’re working in a Bayesian frame, the outputs of the semantics box need to be assignments of values to random variables in the world model, like $X = 12.3$ . (Note that functions of random variables and data structures over random variables are themselves random variables, so the semantics box could e.g. output assignments of values to a whole list of functions of random variables.)

… but that’s only a partial answer, because the set of semantic outputs must be far smaller than the set of assignments of values to random variables. We must think of the children.

SubSubProblem: What Can Children Attach Words To?

Key observation: children need only a handful of examples - sometimes even just one - in order to basically-correctly learn the meaning of a new word (or short phrase^[2].) So there can’t be that many possible semantic targets for a word.

For instance, we noted earlier that any function of random variables in Alice’s model is itself a random variable in Alice’s model - e.g. if $X$ is a (real-valued) random variable, then so is $X^{2} + 3$ . Now imagine that a child’s model includes their visual input as a 1 megabit random variable, and any function of that visual input is a candidate semantic target. Well, there are $2^{2^{1000000}}$ functions of 1 million bits, so the child would need $2^{1000000}$ bits in order to pick out one of them. In other words: the number of examples that hypothetical child would need in order to learn a new word would be quite dramatically larger than the number of atoms in the known universe.

Takeaway: there can’t be that many possible semantic targets for words. The set of semantic targets for words (in humans) is at least exponentially smaller than the set of random variables in an agent’s world model.

That’s the main problem of interest to us, for purposes of this post: what’s the set of possible semantic targets for a word?^[3]

Summary So Far

Roughly speaking, we want to characterize the set of things which children can attach words to, in a Bayesian frame. Put differently, we want to characterize the set of possible semantic targets for words. Some constraints on possible answers:

Since we’re in a Bayesian frame, any semantic targets should be assignments of values (e.g. $X = 12.3$ ) to random variables in an agent’s model. (Note that this includes functions of random variables in the agent’s model, and data structures of random variables in the agent’s model.)
… but the set of possible semantic targets for words must be exponentially smaller than the full set of possible assignments of values to random variables.

So we know the set we’re looking for is a relatively-small subset of the whole set of assignments to random variables, but we haven’t said much yet about which subset it is. In order to narrow down the possibilities, we’ll need (at least) one more criterion… which brings us to Central Problem 2.

Central Problem 2: How Do Alice and Bob “Agree” On Semantics?

Let’s bring Bob into the picture.

Bob is also listening to Carol’s story. He also hears “... and then the fox jumped!”, and a magic semantics box in his brain also takes in those words and spits out some stuff which Bob can condition on - e.g. things like $Y = 1$ , where $Y$ is some random variable in Bob’s world model.

Now for the key observation which will constrain our semantic model: in day-to-day practice, Alice and Bob mostly agree, in some sense, on what sentences mean. Otherwise, language couldn’t work at all.

Notice that in our picture so far, the output of Alice’s semantics-box consists of values of some random variables in Alice’s model, and the output of Bob’s semantics-box consists of values of some random variables in Bob’s model. With that picture in mind, it’s unclear what it would even mean for Alice and Bob to “agree” on the semantics of sentences. For instance, imagine that Alice and Bob are both Solomonoff inductors with a special module for natural language. They both find some shortest program to model the world, but the programs they find may not be exactly identical; maybe Alice and Bob are running slightly different Turing machines, so their shortest programs have somewhat different functions and variables internally. Their semantics-boxes then output values of variables in those programs. If those are totally different programs, what does it even mean for Alice and Bob to “agree” on the values of variables in these totally different programs?

The second central problem of Interoperable Semantics is to account for Alice and Bob’s agreement. In the Bayesian frame, this means that we should be able to establish some kind of (approximate) equivalence between at least some of the variables in the two agents’ world models, and the outputs of the magic semantics box should only involve those variables for which we can establish equivalence.

In other words: not only must our model express the semantics of language in terms of mental content (i.e. values of random variables, in a Bayesian frame), it must express the semantics of language in terms of interoperable mental content - mental content which has some kind of (approximate) equivalence to other agents’ mental content.

Summary: Interoperable Semantics

In technical terms: we’d ultimately like a model of (interoperable mental content)-valued semantics, for Bayesian agents. The immediate challenge which David and I call Interoperable Semantics is to figure out a class of random variables suitable for the “interoperable mental content” in such a model, especially for individual words. Specifically, we want a class of random variables which

is rich enough to reasonably represent the semantics of most words used day-to-day in natural language, but
small enough for each (word -> semantics) mapping to be plausibly learnable with only a few examples, and
allows us to establish some kind of equivalence of variables across Bayesian agents in the same environment.

Beyond that, we of course want our class of random variables to be reasonably general and cognitively plausible as an approximation - e.g. we shouldn’t assume some specific parametric form.

At this point, we’re not even necessarily looking for “the right” class of random variables, just any class which satisfies the above criteria and seems approximately plausible.

The rest of this post will walk through a couple initial stabs at the problem. They’re pretty toy, but will hopefully illustrate what a solution could even look like in principle, and what sort of theorems might be involved.

First (Toy) Model: Clustering + Naturality

As a conceptual starting point, let’s assume that words point to clusters in thingspace in some useful sense. As a semantic model, the “clusters in thingspace” conceptual picture is both underspecified and nowhere near rich enough to support most semantics - or even the semantics of any single everyday word. But it will serve well as a toy model to illustrate some foundational constraints and theorems involved in Interoperable Semantics. Later on, we’ll use the “clusters in thingspace” model as a building block for a richer class of variables.

With that in mind: suppose Alice runs a bog-standard Bayesian clustering algorithm on some data. (Concrete example: below we’ll use an off-the-shelf expectation-maximization algorithm for a mixture of Gaussians with diagonal covariance, on ye olde iris dataset.) Out pop some latents: estimated cluster labels and cluster parameters. Then Bob runs his Bayesian clustering algorithm on the data - but maybe he runs a different Bayesian clustering algorithm than Alice, or maybe he runs the same algorithm with different initial conditions, or maybe he uses a different random subset of the data for training. (In the example below, it’s the ‘different-initial-conditions’ option.)

Insofar as words point to clusters in thingspace and the project of Interoperable Semantics is possible at all, we should be able to establish some sort of equivalence between the clusters found by Alice and Bob, at least under some reasonably-permissive conditions which Alice and Bob can verify. In other words, we should be able to take the cluster labels and/or parameters as the “set of random variables” representing interoperable mental content.

Equivalence Via Naturality

We want a set of random variables which allows for some kind of equivalence between variables in Alice’s model and variables in Bob’s model. For that, we’ll use the machinery of natural latents.

Here are the preconditions we need:

Predictive Agreement: once Alice and Bob have both trained their clustering algorithms on the data, they must agree on the distribution of new data points. (This assumption does not require that they train on the same data, or that they use the same clusters to model the distribution of new data points.)
Mediation: under both Alice and Bob’s trained models, (some subset of) the “features” must be independent within each cluster, i.e. the cluster label mediates between features.
Redundancy: under both Alice and Bob’s trained models, the cluster label must be estimable to high confidence while ignoring any one feature.

The second two conditions (mediation and redundancy) allow for approximation via KL-divergence; see Natural Latents: The Math for the details. Below, we’ll calculate the relevant approximation errors for an example system.^[4] We do not currently know how to handle approximation gracefully for the first condition; the first thing we tried didn’t work for that part.

So long as those conditions hold (approximately, where relevant), the cluster label is a(n approximate) natural latent, and therefore Alice’s cluster label is (approximately) isomorphic to Bob’s cluster label. (Quantitatively: each agent’s cluster label has bounded entropy given the other agent’s cluster label, with the bound going to zero linearly as the approximation error for the preconditions goes to zero.)

So, when the preconditions hold, we can use assignments of values to the cluster label (like e.g. “cluster_label = 2”) as semantic targets for words, and have an equivalence between the mental content which Alice and Bob assign to the relevant words. Or, in English: words can point to clusters.

A Quick Empirical Check

In order for cluster equivalence to apply, we needed three preconditions:

Predictive Agreement: Alice and Bob must agree on the distribution of new data points.
Mediation: under both Alice and Bob’s trained models, (some subset of) the “features” must be independent within each cluster, i.e. the cluster label mediates between features.
Redundancy: under both Alice and Bob’s trained models, the cluster label must be estimable to high confidence while ignoring any one feature.

We don’t yet know how to usefully quantify approximation error for the first precondition. But we can quantify the approximation error for the mediation and redundancy conditions under a small-but-real model. So let’s try that for a simple clustering model: mixture of gaussians, with diagonal covariance, trained on ye olde iris dataset.

The iris dataset contains roughly 150 points in a 4 dimensional space of flower-attributes^[5]. For the mixture of gaussians clustering, we David used the scikit implementation. Github repo here, which consists mainly of the methods to estimate the approximation errors for the mediation and redundancy conditions.

How do we estimate those approximation errors? (Notation: $X$ is one sample, $Λ$ is the cluster label for that sample.)

Mediation: under this model, mediation holds exactly. The covariance is diagonal, so (under the model) the features are exactly independent within each cluster. Approximation error is 0.
Redundancy: as somewhat-sneakily stated, our redundancy condition includes two pieces
- Actual Redundancy Condition: How many bits of information about $Λ$ are lost when dropping variable $X_{i}$ : $D_{K L} (P [X, Λ] | | P [Λ | X_{¯ i}] P [X])$ , which can be rewritten as $E_{X} [D_{K L} (P [Λ | X] | | P [Λ | X_{¯ i}])]$ . Since the number of values of $Λ$ is the number of clusters, that last expression is easy to estimate by sampling a bunch of $X$ -values and averaging the $D_{K L}$ for each of them.
- Determinism: Entropy of $Λ$ given $X$ ^[6]. Again, we sample a bunch of $X$ -values, and average the entropy of $Λ$ conditional on each $X$ -value.

Here are the redundancy approximation errors, in bits, under dropping each of the four components of $X$ , after two different training runs of the model:

Redundancy Error (bits)	Drop (0,)	Drop (1,)	Drop (2,)	Drop (3,)
First run (“Alice”)	0.0211	0.011	0.048	0.089
Second run (“Bob”)	0.034	0.004	0.031	0.177

Recall that we use these approximation errors to bound the approximation error of isomorphism between the two agents’ cluster labels. Specifically, if we track the $ϵ$ ’s through the proofs in Natural Latents: The Math, we’ll find that:

Alice’s natural latent is a deterministic function of Bob’s to within (sum of Alice’s redundancy errors) + (entropy of Alice’s label given $X$ ) + (entropy of Bob’s label given $X$ )
Bob’s natural latent is a deterministic function of Alice’s to within (sum of Bob’s redundancy errors) + (entropy of Alice’s label given $X$ ) + (entropy of Bob’s label given $X$ )

(Note: the proofs as-written in Natural Latents: The Math assume that the redundancy error for each component of $X$ is the same; dropping that assumption is straightforward and turns $n ϵ_{r e d}$ into $\sum ϵ_{r e d}^{i}$ ; thus the sum of redundancy errors.) In the two runs of our clustering algorithm above, we find:

Sum of redundancy errors in the first run is 0.168 bits
Sum of redundancy errors in the second run is 0.246 bits
Entropy of label given $X$ in the first run is 0.099 bits
Entropy of label given $X$ in the second run is 0.099 bits

So, ignoring the differing distribution over new data points under the two models, we should find:

The first model’s cluster label is a deterministic function of the second to within 0.366 bits (i.e. entropy of first label given second is at most 0.366 bits)
The second model’s cluster label is a deterministic function of the first to within 0.444 bits.

Though the differing distribution over new data points is still totally unaccounted-for, we can estimate those conditional entropies by averaging over the data, and we actually find:

Entropy of first model’s cluster label given second model’s: 0.222 bits
Entropy of second model’s cluster label given first model’s, is also: 0.222 bits

The entropies of the cluster labels under the two models are 1.570 and 1.571 bits, respectively, so indeed each model’s cluster label is approximately a deterministic function of the other, to within reasonable error (~0.22 bits of entropy out of ~1.57 bits).

Strengths and Shortcomings of This Toy Model

First, let’s walk through our stated goals for Interoperable Semantics. We want a class of (assignments of values to) random variables which:

is rich enough to reasonably represent the semantics of most words used day-to-day in natural language, but
small enough for each (word -> semantics) mapping to be plausibly learnable with only a few examples, and
allows us to establish some kind of equivalence of variables across Bayesian agents in the same environment.

When the preconditions hold, how well does the cluster label variable fit these requirements?

We already discussed the equivalence requirement; that one works (as demonstrated numerically above), insofar as the preconditions hold to within reasonable approximation. The main weakness is that we don’t yet know how to handle approximation in the requirement that our two agents have the same distribution over new data points.

Can the (word -> semantics) mapping plausibly be learned with only a few examples? Yes! Since each agent already calculated the clusters from the data (much like a child), all that’s left to learn is which cluster gets attached to which word. So long as the clusters don’t overlap much (which turns out to be implied by the mediation and redundancy conditions), that’s easy: we just need ~1 example from each cluster with the corresponding word attached to it.

Is the cluster label rich enough to reasonably represent the semantics of most words used day-to-day in natural language? Lol no. What’s missing?

Well, we implicitly assumed that the two agents are clustering data in the same (i.e. isomorphic) space, with the same (i.e. isomorphic) choice of features (axes.) In order to “do semantics right”, we’d need to recurse: find some choice of features which comes with some kind of equivalence between the two agents.

Would equivalence between choice of features require yet another assumption of equivalence, on which we also need to recurse? Probably. Where does it ground out? Usually, I (John) am willing to assume that agents converge on a shared notion of spacetime locality, i.e. what stuff is “nearby” other stuff in the giant causal web of our universe. So insofar as the equivalence grounds out in a shared notion of which variables are local in space and time, I’m happy with that. Our second toy model will ground out at that level, though admittedly with some big gaps in the middle.

Aside: What Does “Grounding In Spacetime Locality” Mean?

The sort of machinery we’re using (i..e natural latents) needs to start from some random variable $X$ which is broken into components $X_{1}, \dots, X_{n}$ . The machinery of natural latents doesn’t care about how each component is represented; replacing $X_{1}$ with something isomorphic to $X_{1}$ doesn’t change the natural latents at all. But it does matter which components we break $X$ into.

I’m generally willing to assume that different agents converge on a shared notion of spacetime locality, i.e. which stuff is “near” which other stuff in space and time. With that assumption, we can break any random variable $X$ into spacetime-local components - i.e. each component represents the state of the world at one point in space and time. Thanks to the assumed convergent notion of spacetime locality, different agents agree on that decomposition into components (though they may have different representations of each component).

So when we talk about “grounding in spacetime locality”, we mean that our argument for equivalence of random variables between the two agents should start, at the lowest level, from the two agents having equivalent notions of how to break up their variables into components each representing state of the world at a particular place and time.

Second (Toy) Model Sketch: Rigid Body Objects

Our first model was very toy, but we walked through the math relatively thoroughly (including highlighting the “gaps” which our theorems don’t yet cover, like divergence of the agents’ predictive distributions). In this section we’ll be less rigorous, but aim to sketch a more ambitious model. In particular, we’ll sketch an Interoperable Semantic model for words referring to rigid-body objects - think “teacup” or “pebble”. We still don’t expect this model to be fully correct, even for rigid-body objects, but it will illustrate how to build higher-level semantic structures using building blocks similar to the clustering model.

First we’ll sketch out the model and argument for interoperability (i.e. naturality of the latents) for just one rigid-body object - e.g. a teacup. We’ll see that the model naturally involves both a “geometry” and a “trajectory” of the object. Then, we’ll introduce clusters of object-geometries as a model of (rigid body) noun semantics, and clusters of object-trajectories as a model of verb semantics.

Note that there will be lots of handwaving and some outright gaps in the arguments in this model sketch. We’re not aiming for rigor here, or even trying to be very convincing; we’re just illustrating how one might build up higher-level semantic structures.

The Teacup

Imagine a simple relatively low-level simulation of a teacup moving around. The teacup is represented as a bunch of particles. In terms of data structures, the code tracks the position and orientation of each particle at each time.

There’s a lot of redundancy in that representation; an agent can compress it a lot while still maintaining reasonable accuracy. For instance, if we approximate away vibrations/bending (i.e. a rigid body approximation), then we can represent the whole set of particle-trajectories using only:

The trajectory of one reference particle’s position and orientation
The initial position and orientation of each particle, relative to the reference particle

We’ll call these two pieces “trajectory” and “geometry”. Because the teacup’s shape stays the same, all the particles are always in the same relative positions, so this lower-dimensional factorization allows an agent to compress the full set of particle-trajectories.

Can we establish naturality (i.e. mediation + redundancy) for the geometry and trajectory, much like we did for clusters earlier?

Here’s the argument sketch for the geometry:

Under the rigid body approximation, the geometry is conserved over time. So, if we take $X_{t}$ to be the state of all the particles at time $t$ , the geometry is redundantly represented over $X_{t}, X_{t + 1}, \dots$ ; it should approximately satisfy the redundancy condition.
If the teacup moves around randomly enough for long enough, the geometry will be the only information about the cup at an early time which is relevant to much later times. In that case, the geometry mediates between $X_{t}$ and $X_{t^{'}}$ for times $t$ and $t^{'}$ sufficiently far apart; it approximately satisfies the mediation condition.

So, intuitively, the geometry should be natural over the time-slices of our simulation.

Next, let’s sketch the argument for the trajectory:

Under the rigid body approximation, if I know the trajectory of any one particle, that tells me the trajectory of the whole teacup, since the particles always maintain the same relative positions and orientations. So, the trajectory should approximately satisfy the redundancy condition over individual particle-trajectories.
If I know the trajectory of the teacup overall, then under the rigid body approximation I can calculate the trajectory of any one particle, so the particle trajectories are technically all independent given the teacup trajectory. So, the trajectory should approximately satisfy the mediation condition over individual particle-trajectories.

That means the trajectory should be natural over individual particle trajectories.

Now, there’s definitely some subtlety here. For instance: our argument for naturality of the trajectory implicitly assumes that the geometry was also known; otherwise I don’t know where the reference particle is relative to the particle of interest. We could tweak it a little to avoid that assumption, or we could just accept that the trajectory is natural conditional on the geometry. That’s the sort of detail which would need to be nailed down in order to turn this whole thing into a proper Interoperable Semantics model.

… but we’re not aiming to be that careful in this section, so instead we’ll just imagine that there’s some way to fill in all the math such that it works.

Assuming there’s some way to make the math work behind the hand-wavy descriptions above, what would that tell us?

We get a class of random variables in our low-level simulation: “geometry” (natural latent over time-slices), and “trajectory” (natural latent over individual particle-trajectories). Let’s say that the simulation is Alice’s model. Then for the teacup’s geometry, naturality says:

If two Alice and Bob both make the same predictions about the low-level particles constituting the teacup…
and those predictions match a rigid body approximation reasonably well, but have lots of uncertainty over long-term motion…
and there’s some variable in Alice’s model which both mediates between the particle-states at far-apart times and can be reconstructed from particle-states at any one time…
and there’s some variable in Bob’s model which also both mediates between the particle-states at far-apart times and can be reconstructed from particle-states at any one time…
… then Alice’s variable and Bob’s variable give the same information about the particles. Furthermore, if the two variables’ values can both be approximated reasonably well from the full particle-trajectories, then they’re approximately isomorphic.

In other words: we get an argument for some kind of equivalence between “geometry” in Alice’s model and “geometry” in Bob’s model, insofar as their models match predictively. Similarly, we get an argument for some kind of equivalence between “trajectory” in Alice’s model and “trajectory” in Bob’s model.

So:

We have a class of random variables in Alice’s model and Bob’s model…
… which isn’t too big (roughly speaking, natural latents are approximately unique, so there’s approximately just the two variables)
… and for which we have some kind of equivalence between Alice and Bob’s variables.

What we don’t have, yet, is enough expressivity for realistic semantics. Conceptually, we have a class of interoperable random variables which includes e.g. any single instance of a teacup, and the trajectory of that teacup. But our class of interoperable random variables doesn’t contain a single variable representing the whole category of teacups (or any other category of rigid body objects), and it’s that category which the word “teacup” itself intuitively points to.

So let’s add a bit more to the model.

Geometry and Trajectory Clusters

Now we imagine that Alice’s model involves lots of little local models running simulations like the teacup, for different rigid-body objects around her. So there’s a whole bunch of geometries and trajectories which are natural over different chunks of the world.

Perhaps Alice notices that there’s some cluster-structure to the geometries, and some cluster-structure to the trajectories. For instance, perhaps there’s one cluster of similarly-shaped rigid-body geometries which one might call “teacups”. Perhaps there’s also one cluster of similar trajectories which one might call “free fall”. ~~Perhaps this particular geometry-cluster and trajectory-cluster are particularly fun when combined.~~

Hopefully you can guess the next move: we have clusters, so let’s apply the naturality conditions for clusters from the first toy model. For both the geometry-clusters and the trajectory-clusters, we ask:

Are (some subset of) the “features” approximately independent within each single cluster?
Can the cluster-label of a point be estimated to high confidence ignoring any one of (the same subset of) the “features”?

If yes to both of these, then we have naturality, just like in the first toy model. Just like the first toy model, we then have an argument for approximate equivalence between Alice and Bob’s clusters, assuming they both satisfy the naturality conditions and make the same predictions.

(Note that we’ve said nothing at all about what the “features” are; that’s one of the outright gaps which we’re handwaving past.)

With all that machinery in place, we make the guess:

(rigid body) nouns typically refer to geometry clusters; example: “teacup”
(rigid body) verbs typically refer to trajectory clusters; example: “free fall”

With that, we have (hand-wavily)

a class of random variables in each agent’s world model…
which we expect to typically be small enough for each (word -> semantics) mapping to be learnable with only a handful of examples (i.e. a word plus one data point in a cluster is enough to label the cluster with the word)...
and we have a story for equivalence of these random variables across two agents’ models…
and we have an intuitive story on how this class of random variables is rich enough to capture the semantics of many typical rigid body nouns and verbs, like “teacup” or “free fall”.

Modulo handwaving and (admittedly large) gaps, we have hit all the core requirements of an Interoperable Semantics model.

Strengths and Shortcomings of This Toy Model

The first big strength of this toy model is that it starts from spacetime locality: the lowest-level natural latents are over time-slices of a simulation and trajectories of particles. (The trajectory part is not quite fully grounded, since “particles” sure are an ontological choice, but you could imagine converting the formulation from Lagrangian to Eulerian; that’s an already-reasonably-common move in mathematical modeling. In the Eulerian formulation, the ontological choices would all be grounded in spacetime locality.)

The second big strength is expressivity. We have a clear notion of individual (rigid body) objects and their geometries and trajectories, nouns and verbs point to clusters of geometries and trajectories, this all intuitively matches our day-to-day rigid-body models relatively nicely.

The shortcomings are many.

First, we reused the clustering machinery from the first toy model, so all the shortcomings of that model (other than limited expressivity) are inherited. Notably, that includes the “what features?” question. We did ground geometries in spacetime locality and trajectories in particles (which are a relatively easy step up from spacetime locality), so the “what features” question is handled for geometries and trajectories. But then we cluster geometries, and cluster trajectories. What are the “features” of a rigid body object’s geometry, for clustering purposes? What are the “features” of a rigid body object’s trajectory, for clustering purposes? We didn’t answer either of those questions. The underdetermination of feature choice at the clustering stage is probably the biggest “gap” which we’ve ignored in this model.

Second, when defining geometries and trajectories, note that we defined the geometry to be a random variable which “both mediates between the particle-states at far-apart times and can be reconstructed from particle-states at any one time”. That works fine if there’s only one rigid body object, but if there’s multiple rigid body objects in the same environment, then the “geometry” under that definition would be a single variable summarizing the geometry of all the rigid body objects. That’s not what we want; we want distinct variables for the geometry (and trajectory) of each object. So the model needs to be tweaked to handle multiple rigid bodies in the same environment.

Third, obviously we were very handwavy and didn’t prove anything in this section.

Fourth, obviously the model is still quite limited in expressivity. It doesn’t handle adjectives or adverbs or non-rigid-body nouns or …

Summary and Call To Action

The problem we call Interoperable Semantics is to find some class of random variables (in a Bayesian agent’s world model) which

is rich enough to reasonably represent the semantics of most words used day-to-day in natural language, but
small enough for each (word -> semantics) mapping to be plausibly learnable with only a few examples, and
allows us to establish some kind of equivalence of variables across Bayesian agents in the same environment.

Beyond that, we of course want our class of random variables to be reasonably general and cognitively plausible as an approximation - e.g. we shouldn’t assume some specific parametric form.

At this point, we’re not even necessarily looking for “the right” class of random variables, just any class which satisfies the above criteria and seems approximately plausible.

That, we claim, is roughly what it looks like to “do semantics for real” - or at least to start the project.

Call To Action

We’re writing up this post now because it’s maybe, just barely, legible enough that other people could pick up the project and make useful progress on it. There’s lots of potential entry points:

Extend the methods used in our toy models to handle more kinds of words and phrases:
- other kinds of nouns/verbs
- adjectives and adverbs
- subject/object constructions
- etc.
Fill some of the gaps in the arguments
Find some other arguments to establish equivalence across agents
Take the toy models from this post, or some other Interoperable Semantics models, and go look for the relevant structures in real models and/or datasets (either small scale or large scale)
Whatever other crazy stuff you come up with!

We think this sort of project, if it goes well, could pretty dramatically accelerate AI interpretability, and probably advance humanity’s understanding of lots of other things as well. It would give a substantive, quantitative, and non-ad-hoc idea of what stuff interpretability researchers should look for. Rather than just shooting in the dark, it would provide some actual quantitative models to test.

Thank you to Garret Baker, Jeremy Gillen, and Alexander Gietelink-Oldenziel for feedback on a draft of this post.

^{^}
In this post, we’ll ignore Gricean implicature; our agents just take everything literally. Justification for ignoring it: first, the cluster-based model in this post is nowhere near the level of sophistication where lack of Gricean implicature is the biggest problem. Second, when it does come time to handle Gricean implicature, we do not expect that the high-level framework used here - i.e. Bayesian agents, isomorphism between latents - will have any fundamental trouble with it.
^{^}
When we say “word” or “short phrase”, what we really mean is “atom of natural language.”
^{^}
A full characterization of interoperable mental content / semantics requires specifying the possible mappings of larger constructions, like sentences, into interoperable mental content, not just words. But once we characterize the mental content which individual words can map to (i.e. their ‘semantic targets’,) we are hopeful that the mental content mapped to by larger constructions (e.g. sentences,) will usually be straightforwardly constructable from those smaller pieces. So if we can characterize “what children can attach words to”, then we’d probably be most of the way to characterizing the whole range of outputs of the magic semantics box.
Notably, going from words to sentences and larger constructs is the focus of the existing academic field of “semantics”. What linguists call “semantics” is mostly focused on constructing semantics of sentences and larger constructs from the semantics of individual words (“atoms”). From their standpoint, this post is mostly about characterizing the set of semantic values of atoms, assuming Bayesian agents.
^{^}
For those who read Natural Latents: The Math before this post, note that we added an addendum shortly before this post went up. It contains a minor-but-load-bearing step for establishing approximate isomorphism between two agents’ natural latents.
^{^}
Sepal length, sepal width, petal length, and petal width in case you were wondering, presumably collected from a survey of actual flowers last century.
^{^}
Remember that addendum we mentioned in an earlier footnote? The determinism condition is for that part.

Notice that in our picture so far, the output of Alice’s semantics-box consists of values of some random variables in Alice’s model, and the output of Bob’s semantics-box consists of values of some random variables in Bob’s model. With that picture in mind, it’s unclear what it would even mean for Alice and Bob to “agree” on the semantics of sentences. For instance, imagine that Alice and Bob are both Solomonoff inductors with a special module for natural language. They both find some shortest program to model the world, but the programs they find may not be exactly identical; maybe Alice and Bob are running slightly different Turing machines, so their shortest programs have somewhat different functions and variables internally. Their semantics-boxes then output values of variables in those programs. If those are totally different programs, what does it even mean for Alice and Bob to “agree” on the values of variables in these totally different programs?

This importantly understates the problem. (You did say "for instance" -- I don't think you are necessarily ignoring the following point, but I think it is a point worth making.)

Even if Alice and Bob share the same universal prior, Solomonoff induction comes up with agent-centric models of the world, because it is trying to predict perceptions. Alice and Bob may live in the same world, but they will perceive different things. Even if they stay in the same room and look at the same objects, they will see different angles.

If we're lucky, Alice and Bob will both land on two-part representations which (1) model the world from a 3rd person perspective, and (2) then identify the specific agent whose perceptions are being predicted, providing a 'phenomonological bridge' to translate the 3rd-person view of reality into a 1st person view. Then we're left with the problem which you mention: Alice and Bob could have slightly different 3rd-person understandings of the universe.

If we could get there, great. However, I think we imagine Solomonoff induction arriving at such a two-part model largely because we think it is smart, and we think smart people understand the world in terms of physics and other 3rd-person-valid concepts. We think the physicalist/objective conception of the world is true, and therefore, Solomonoff induction will figure out that it is the best way.

Maybe so. But it seems pretty plausible that a major reason why humans arrive at these 'objective' 3rd-person world-models is because humans have a strong incentive to think about the world in ways that make communication possible. We come up with 3rd-person descriptions of the words because they are incredibly useful for communicating. Solomonoff induction is not particularly designed to respect this incentive, so it seems plausible that it could arrange its ontology in an entirely 1st-person manner instead.

But it seems pretty plausible that a major reason why humans arrive at these 'objective' 3rd-person world-models is because humans have a strong incentive to think about the world in ways that make communication possible.

This is an interesting point which I had not thought about before, thank you. Insofar as I have a response already, it's basically the same as this thread: it seems like understanding of interoperable concepts falls upstream of understanding non-interoperable concepts on the tech tree, and also there's nontrivial probability that non-interoperable concepts just aren't used much even by Solomonoff inductors (in a realistic environment).

Ah, don't get me wrong: I agree that understanding interoperability is the thing to focus on. Indeed, I think perhaps "understanding" itself has something to do with interoperability.

The difference, I think, is that in my view the whole game of interoperability has to do with translating between 1st person and 3rd person perspectives.

Your magic box takes utterances and turns them into interoperable mental content.

My magic box takes non-interoperable-by-default^[1] mental content and turns them into interoperable utterances.

The language is the interoperable thing. The nature of the interoperable thing is that it has been optimized so as to easily translate between many not-so-easily-interoperable (1st person, subjective, idiosyncratic) perspectives.

^{^}
"Default" is the wrong concept here, since we are raised from little babies to be highly interoperable, and would die without society. What I mean here is something like, it is relatively easy to spell out non-interoperable theories of learning / mental content, EG solomonoff's theory, or neural nets.

Takeaway: there can’t be that many possible semantic targets for words. The set of semantic targets for words (in humans) is at least exponentially smaller than the set of random variables in an agent’s world model.

I don't think this follows. The set of semantic targets could be immense, but children and adults could share sufficiently similar priors, such that children land on adequately similar concepts to those that adults are trying to communicate with very little data.

Think of it like a modified Schelling-point game, where some communication is possible, but sending information is expensive. Alice is trying to find Bob in the galaxy, and Bob has been able to communicate only a little information for Alice to go on. However, Alice and Bob are both from Earth, so they share a lot of context. Bob can say "the moon" and Alice knows which moon Bob is probably talking about, and also knows that there is only one habitable moon-base on the moon to check.

Bob could find a way to point Alice to any point in the galaxy, but Bob probably won't need to. So the set of possibilities appears to be small, from the perspective of someone who only sees a few rounds of this game.

So really, rather than "the set of semantic targets is small", I should say something like "the set of semantic targets with significant prior probability is small", or something like that. Unclear exactly what the right operationalization is there, but I think I buy the basic point.

That’s the main problem of interest to us, for purposes of this post: what’s the set of possible semantic targets for a word?

From the way you've defined things so far, it seems relatively clear what it would mean to solve this problem for sentences; translating from "X" to X has been operationalized as what you condition on if you take "X" literally.

However, the jump you are making to the meaning of a word seems surprising and unclear. If Carol shouts "Ball!" it is unclear what it would mean to condition on the literal content; it seems to be all pragmatics. Since Carol didn't bother to form a valid sentence, she is not making a claim which can be true or false. It could mean "there is a ball coming at your head" or it could mean "We forgot the basketball at the court" or any number of other things, depending on context.

So, while it does indeed seem meaningful to talk about the semantics of words, the picture you have drawn so far of the "magic box" does not seem to fit the case of individual words. We do not condition on the literal meaning of individual words; those meanings have the wrong type signature to condition on.

We can already give a partial answer: because we’re working in a Bayesian frame, the outputs of the semantics box need to be assignments of values to random variables in the world model, like

Why random variables, rather than events? In terms of your sketched formalism so far, it seems like events are the obvious choice -- events are the sort of thing we can condition on. Assigning a random variable to a value is just an indirect way to point out an event; and, this indirect method creates a lot of redundancy, since there are many many assignments-of-random-variables-to-values which would point out the same event.

First: if the random variables include latents which extend some distribution, then values of those latents are not necessarily representable as events over the underlying distribution. Events are less general. (Related: updates allowed under radical probabilism can be represented by assignments of values to latents.)

Second: I want formulations which feel like they track what's actually going on in my head (or other peoples' heads) relatively well. Insofar as a Bayesian model makes sense for the stuff going on in my head at all, it feels like there's a whole structure of latent variables, and semantics involves assignments of values to those variables. Events don't seem to match my mental structure as well. (See How We Picture Bayesian Agents for the picture in my head here.)

The two perspectives are easily interchangeable, so I don't think this is a big disagreement. But the argument about extending a distribution seems... awful? I could just as well say that I can extend my event algebra to include some new events which cannot be represented as values of random variables over the original event algebra, "so random variables are less general".

The second central problem of Interoperable Semantics is to account for Alice and Bob’s agreement. In the Bayesian frame, this means that we should be able to establish some kind of (approximate) equivalence between at least some of the variables in the two agents’ world models, and the outputs of the magic semantics box should only involve those variables for which we can establish equivalence.

To me, this seems like a strange way to go about it, if your hope is to address AI safety concerns. If Alice is trying to understand Bob, and Alice sees that Bob uses a weird blob of incomprehensible gibberish as a key step in his reasoning, then Alice should think she has failed, rather than thinking she should ignore that part.

In some sense, agents come equipped with a 1st-person perspective (a set of cognitive tools which is useful for predicting their own sense-data and managing their own actions), and the challenge we face is one of translating that 1st-person perspective to a 3rd-person perspective (an interoperable language which can readily be translated into many different 1st person perspectives, ie, understood by many different agents).

That particular paragraph was intended to be about two humans. The application to AI safety is less direct than "take Alice to be a human, and Bob to be an AI" or something like that.

That makes sense. But, effectively, you are deferring the question of how it relates to AI safety. If I have my intuition (roughly, that the most important part of the problem is how to understand alien concepts which AIs might have) and you have your intuition (roughly, that the most important part of the problem is how to understand human concepts) then presumably we can try and articulate some reasons.

I've said something about why I think it seems important not to give up on mental content that seems hard to translate. Perhaps you could say a bit more about why you are interested in a thingy that only looks for easily translatable content and ignores hard-to-translate content?

I definitely have substantial probability on the possibility that AIs will use a bunch of alien (i.e. non-interoperable or hard-to-interoperate) concepts. And in worlds where that's true, I largely agree that those are the most important (i.e. hardest/rate-limiting) part of the technical problems of AI safety.

That said:

I have substantial probability that AIs basically don't use a bunch of non-interoperable concepts (or converge to more interoperable concepts as capabilities grow, or ...). In those worlds, I expect that "how to understand human concepts" is the rate-limiting part of the problem.
Even in worlds where AIs do use lots of alien concepts, it feels like understanding human concepts is "earlier on the tech tree" than figuring out what to do with those alien concepts. Like, it is a hell of a lot easier to understand those alien concepts by first understanding human concepts and then building on that understanding, than by trying to jump straight to alien concepts.

What would constitute "understanding human concepts" in the relevant sense?

In another comment, I suggested that human concepts can be represented in human language. This might miss out on some important human mental content, but it would not miss out on anything that the magic box spits out, since the magic box is specifically dealing with language.

This trivializes the magic box; it becomes the identity function, or at best, a paraphrasing function. But what, exactly, is wrong with such a trivial understanding of the magic box? Where does it fall short of the sort of understanding you seek to achieve?

It frames things in terms of events (each event labeled with a natural-language sentence) rather than random variables, like you want, but I can trivially reframe it in terms of random variables by considering the truth value of the sentences as 0,1 instead of true,false.

Yes, I intuitively feel that this is a dumb trivial proposal that contributes nothing to our understanding of concepts. But, I quote:

At this point, we’re not even necessarily looking for “the right” class of random variables, just any class which satisfies the above criteria and seems approximately plausible.

One example: you know that thing where I point at a cow and say "cow", and then the toddler next to me points at another cow and is like "cow?", and I nod and smile? That's the thing we want to understand. How the heck does the toddler manage to correctly point at a second cow, on their first try, with only one example of me saying "cow"? (Note that same question still applies if they take a few tries, or have heard me use the word a few times.)

The post basically says that the toddler does a bunch of unsupervised structure learning, and then has a relatively small set of candidate targets, so when they hear the word once they can assign the word to the appropriate structure. And then we're interested in questions like "what are those structures?", and interoperability helps narrow down the possibilities for what those structures could be.

... and I don't think I've yet fully articulated the general version of the problem here, but the cow example is at least one case where "just take the magic box to be the identity function" fails to answer our question.

Since we’re in a Bayesian frame, any semantic targets should be assignments of values (e.g. ) to random variables in an agent’s model. (Note that this includes functions of random variables in the agent’s model, and data structures of random variables in the agent’s model.)
… but the set of possible semantic targets for words must be exponentially smaller than the full set of possible assignments of values to random variables.

I commented about why I disagree with the first bullet point here, and the second bullet point here.

A shorter-term intermediate question is: what even is the input set (i.e. domain) and output set (i.e. range) of the semantics box? Its inputs are natural language, but what about the outputs?

Because you are making the assumption that the important semantic content is inter-operable, and you're assuming this interoperable content is mediated entirely through language (Carol doesn't get to EG demonstrate how to tie shoelaces visually), It seems like Alice should be able to tell people what she understood Carol to mean.

In other words, it seems like your framework implies that you can use language itself as the representation without losing anything. Yes, an utterance will have many equivalent paraphrasings; IE the magic box is not a 1-1 function. It can be very lossy. However, the magic box should not add information. So the semantic content Y can be represented by one of the utterances X which would land on it.

If the magic box does add information (EG, if Carol says 'apple', Alice always imagines a specific color of apple, and does so randomly so that information is really added in an infotheoretic sense) ... well, I suppose that can happen, but we've violated the assumption that the magic box is a function, and also I think something has gone wrong in terms of Alice trying to understand Carol (Alice should understand that the apple could be any color).

I don't think this is quite right? Most of the complexity of the box is supposed to be learned in an unsupervised way from non-language data (like e.g. visual data). If someone hasn't already done all that unsupervised learning, then they don't "know what's in the box", so they don't know how to extract semantics from words.

I don't disagree with this point. I don't see how it undermines the idea that all of the semantic content of language can be represented via language. (I'm not sure what you understood me to be saying, such that this objection of yours felt relevant.)

I'm not claiming that our mental representations of semantic content "are" linguistic, or that they "come from" language. I'm just saying that we can use language to represent them.

Importantly, it is also possible that there are forms of mental content which are very difficult or even impossible to communicate with language alone, like perhaps thoughts about knot-tying. I am only claiming that the output of the magic box described here can necessarily be represented linguistically.

The central challenge of ML interpretability is to faithfully and robustly translate the internal concepts of neural nets into human concepts (or vice versa). But today, we don’t have a precise understanding of what “human concepts” are. Semantics gives us an angle on that question: it’s centrally about what kind of mental content (i.e. concepts) can be interoperable (i.e. translatable) across minds.

It seems to me like there's an important omission here: we also don't understand what we really want to point at whet we say "the internal concepts of neural nets".

One might say that understanding "human concepts" is more the central difficulty here, because the human concepts are what we're trying to translate into.

However, we also need to understand what we're translating out of. For example, we might find a translation from NN activations to human concepts which is highly satisfying by some metric, but, which fails to uncover deceptive cognition within the NN. One idea for how to avoid this: ignoring content which we do not know how to translate into human concepts needs to count as a failure, rather than a success. Notice how this requires a notion of 'content' which we are trying to translate.

We can perhaps understand this as a 'strategy-stealing' requirement: to fully understand the content of an NN means to be able to replicate all of its capabilities using the translated content (importantly, including hidden capabilities which we don't see on our test data).

In this post, we’ll ignore Gricean implicature; our agents just take everything literally. Justification for ignoring it: first, the cluster-based model in this post is nowhere near the level of sophistication where lack of Gricean implicature is the biggest problem. Second, when it does come time to handle Gricean implicature, we do not expect that the high-level framework used here - i.e. Bayesian agents, isomorphism between latents - will have any fundamental trouble with it.

A naive reader may think that "ignoring Gricean implicature" means pretending that it doesn't exist; to be more precise: pretending that the semantics and pragmatics of an utterance are equal.

(I will use 'pragmatics' to mean all implications a listener can draw from an utterance, including Gricean implicature, and 'semantics' to mean only the literal implications. For example, if I say "you left the door open" then (depending on context) I probably am implying that you should close it; this is pragmatics and gricean implicature, but is not a literal implication of what I said. This is also a near-synonym of a connotation/denotation distinction, where connotationpragmatics, denotation $\approx$ semantics.)

However, the way you frame the problem actually critically relies on a semantics/pragmatics distinction. You define "the magic box" to be what translates from the utterance to what you would condition on if you took the sentence literally: the difference between $C a r o l S a y s (‘ ‘ X ")$ vs $X$ .

IE, when Alice hears Carol say something, she conditions on the full sensory experience, and reaches the full range of pragmatic conclusions: $P_{A l i c e} (. | C a r o l S a y s (‘ ‘ X "))$ . But, for the purpose of sussing out semantics, what you want to do in the post is pretend that Alice takes Carol literally, and conditions only on the semantic content of what Carol says: $P_{A l i c e} (. | X)$ .

Hence, the magic box is a function relating pragmatics to semantics; it takes the event which we would condition on to get the pragmatics (namely: the full sensory experience) and maps it to the semantics (the literal meaning of what was said).

I suppose I didn't draw out the critical implication I'm trying to point to:

If you buy my argument that, far from ignoring semantics vs pragmatics, your way of framing the problem relies critically on the distinction...

...then you should be more curious about what is going on with the distinction, rather than writing it off as a less important detail to be figured out later.

I take pragmatics to be easy to understand (so long as we take it to include semantics, rather than be exclusive): the pragmatics of an utterance is just what a Bayesian listener would infer from it. (We can, if we like, also point to the pragmatic intent: what the speaker was trying to get the listener to infer.)

What seems hard is, how do we point out only the semantic content, when in conversation we always need to think about the full pragmatics?

Why do we even believe that utterances have literal content, rather than only a cloud of probabilistic implications? How could such a belief be grounded in linguistic behavior, aside from the brute fact that people talk about this distinction as if it is a thing? What singles out some inferences as semantic? What makes those inferences different from other pragmatic inferences?

It seems like it has something to do with always-valid inferences vs context-sensitive inferences, for one thing.

Notice that in our picture so far, the output of Alice’s semantics-box consists of values of some random variables in Alice’s model, and the output of Bob’s semantics-box consists of values of some random variables in Bob’s model. With that picture in mind, it’s unclear what it would even mean for Alice and Bob to “agree” on the semantics of sentences. For instance, imagine that Alice and Bob are both Solomonoff inductors with a special module for natural language. They both find some shortest program to model the world, but the programs they find may not be exactly identical; maybe Alice and Bob are running slightly different Turing machines, so their shortest programs have somewhat different functions and variables internally. Their semantics-boxes then output values of variables in those programs. If those are totally different programs, what does it even mean for Alice and Bob to “agree” on the values of variables in these totally different programs?

This importantly understates the problem. (You did say "for instance" -- I don't think you are necessarily ignoring the following point, but I think it is a point worth making.)

But it seems pretty plausible that a major reason why humans arrive at these 'objective' 3rd-person world-models is because humans have a strong incentive to think about the world in ways that make communication possible.

Ah, don't get me wrong: I agree that understanding interoperability is the thing to focus on. Indeed, I think perhaps "understanding" itself has something to do with interoperability.

The difference, I think, is that in my view the whole game of interoperability has to do with translating between 1st person and 3rd person perspectives.

Your magic box takes utterances and turns them into interoperable mental content.

My magic box takes non-interoperable-by-default^[1] mental content and turns them into interoperable utterances.

^{^}
"Default" is the wrong concept here, since we are raised from little babies to be highly interoperable, and would die without society. What I mean here is something like, it is relatively easy to spell out non-interoperable theories of learning / mental content, EG solomonoff's theory, or neural nets.

Takeaway: there can’t be that many possible semantic targets for words. The set of semantic targets for words (in humans) is at least exponentially smaller than the set of random variables in an agent’s world model.

That’s the main problem of interest to us, for purposes of this post: what’s the set of possible semantic targets for a word?

We can already give a partial answer: because we’re working in a Bayesian frame, the outputs of the semantics box need to be assignments of values to random variables in the world model, like

The second central problem of Interoperable Semantics is to account for Alice and Bob’s agreement. In the Bayesian frame, this means that we should be able to establish some kind of (approximate) equivalence between at least some of the variables in the two agents’ world models, and the outputs of the magic semantics box should only involve those variables for which we can establish equivalence.

That particular paragraph was intended to be about two humans. The application to AI safety is less direct than "take Alice to be a human, and Bob to be an AI" or something like that.

That said:

I have substantial probability that AIs basically don't use a bunch of non-interoperable concepts (or converge to more interoperable concepts as capabilities grow, or ...). In those worlds, I expect that "how to understand human concepts" is the rate-limiting part of the problem.
Even in worlds where AIs do use lots of alien concepts, it feels like understanding human concepts is "earlier on the tech tree" than figuring out what to do with those alien concepts. Like, it is a hell of a lot easier to understand those alien concepts by first understanding human concepts and then building on that understanding, than by trying to jump straight to alien concepts.

What would constitute "understanding human concepts" in the relevant sense?

Yes, I intuitively feel that this is a dumb trivial proposal that contributes nothing to our understanding of concepts. But, I quote:

At this point, we’re not even necessarily looking for “the right” class of random variables, just any class which satisfies the above criteria and seems approximately plausible.

Since we’re in a Bayesian frame, any semantic targets should be assignments of values (e.g. ) to random variables in an agent’s model. (Note that this includes functions of random variables in the agent’s model, and data structures of random variables in the agent’s model.)
… but the set of possible semantic targets for words must be exponentially smaller than the full set of possible assignments of values to random variables.

I commented about why I disagree with the first bullet point here, and the second bullet point here.

A shorter-term intermediate question is: what even is the input set (i.e. domain) and output set (i.e. range) of the semantics box? Its inputs are natural language, but what about the outputs?

I'm not claiming that our mental representations of semantic content "are" linguistic, or that they "come from" language. I'm just saying that we can use language to represent them.

The central challenge of ML interpretability is to faithfully and robustly translate the internal concepts of neural nets into human concepts (or vice versa). But today, we don’t have a precise understanding of what “human concepts” are. Semantics gives us an angle on that question: it’s centrally about what kind of mental content (i.e. concepts) can be interoperable (i.e. translatable) across minds.

It seems to me like there's an important omission here: we also don't understand what we really want to point at whet we say "the internal concepts of neural nets".

One might say that understanding "human concepts" is more the central difficulty here, because the human concepts are what we're trying to translate into.

In this post, we’ll ignore Gricean implicature; our agents just take everything literally. Justification for ignoring it: first, the cluster-based model in this post is nowhere near the level of sophistication where lack of Gricean implicature is the biggest problem. Second, when it does come time to handle Gricean implicature, we do not expect that the high-level framework used here - i.e. Bayesian agents, isomorphism between latents - will have any fundamental trouble with it.

A naive reader may think that "ignoring Gricean implicature" means pretending that it doesn't exist; to be more precise: pretending that the semantics and pragmatics of an utterance are equal.

I suppose I didn't draw out the critical implication I'm trying to point to:

If you buy my argument that, far from ignoring semantics vs pragmatics, your way of framing the problem relies critically on the distinction...

...then you should be more curious about what is going on with the distinction, rather than writing it off as a less important detail to be figured out later.

What seems hard is, how do we point out only the semantic content, when in conversation we always need to think about the full pragmatics?

It seems like it has something to do with always-valid inferences vs context-sensitive inferences, for one thing.

46

Towards a Less Bullshit Model of Semantics

46

But Why Though?

Overview

What’s The Problem?

Central Problem 1: How To Model The Magic Box?

Subproblem: What’s The Range Of Box Outputs?

SubSubProblem: What Can Children Attach Words To?

Summary So Far

Central Problem 2: How Do Alice and Bob “Agree” On Semantics?

Summary: Interoperable Semantics

First (Toy) Model: Clustering + Naturality

Equivalence Via Naturality

A Quick Empirical Check

Strengths and Shortcomings of This Toy Model

Aside: What Does “Grounding In Spacetime Locality” Mean?

Second (Toy) Model Sketch: Rigid Body Objects

The Teacup

Geometry and Trajectory Clusters

Strengths and Shortcomings of This Toy Model

Summary and Call To Action

Call To Action