For ELK truth is mostly a distraction

Epistemic Status: Pretty confident in the central conclusions, and very confident in the supporting claims from meta-logic. Any low confidence conclusions are presented as such. NB: I give an intentionally revisionary reading of what ELK is (or should be) about. Accordingly, I assume familiarity with the ELK report. Summary here.

Executive Summary

Eliciting Latent Knowledge (ELK) collapses into either the automation of science or the automation of mechanistic interpretability. I promote the latter.

Abstract

After reframing ELK from the perspective of a logician, I highlight the problem of cheap model-theoretic truth: by default reporters will simply learn (or search for) interpretations of the predictor’s net that make the teacher’s answers “true” in the model-theoretic sense, whether or not they are True (correspond with reality)! This will be a problem, even if we manage to avoid human simulators and are guaranteed an honest translator.

The problem boils down to finding a way to force the base optimizer (e.g. gradient descent) to pay attention to the structure of the predictor’s net, instead of simply treating it like putty. I argue that trying to get the base optimizer to care about the True state of affairs in the vault is not a solution to this problem, but instead the expression of a completely different problem – something like automating science. Arguably, this is not the problem we should be focused on, especially if we’re just trying to solve intent alignment. Instead I tentatively propose the following solution: train the reporter on mechanistic interpretability experts, in the hope that it internalizes and generalizes their techniques. I expand this proposal by suggesting we interpret in parallel with training, availing ourselves of the history of a predictor’s net in order to identify and track the birth of each term in its ontology. The over-arching hope here is that if we manage to fully interpret the predictor at an earlier stage in its development, we can then maintain that transparency as it develops into something much more complex.

To finish off, I zoom out, outlining three different levels of ELK-like projects, each building on the last, each more difficult than the last.

Terminology

I need to address some conflicts in namespace. I use the noun "model" in three different senses: as standing for logical models, ML models, and finally in the colloquial English sense, as synonymous with “a smaller, compressed representation of something.” Context should be enough to keep matters clear and I try to flag when I switch meanings, but I apologize in advance for any confusion. It’s a similar story with “truth” – many different senses in play. Sorry for any confusion; please watch for cues.

I should also note that, because I’m not very familiar with Bayes nets, I’ll instead be (sloppily) talking as if the predictor is a Neural Net (NN) that somehow “thinks” in neat propositions. Basically, I know NNs work on fuzzy logic, and I’m then idealizing them as working on classical bivalent logic, for simplicity. I don’t think any of my arguments turn on this idealization, so it should be permissible.

Finally, some shorthand:

Translator ≝ ML model that operates roughly as follows:

Take a question from humans as input
Generate candidate answers using NL processing or something
Using only a mapping which takes terms of these candidate answers to referents in the predictor’s net, generate a probability distribution over the candidate answers. This probability distribution is understood as a distribution of truth values in a fuzzy logic. (Again though, for simplicity I mostly pretend we’re working with bivalent logic).
Output the answer(s) with the highest truth-value.

Honest translator ≝ translator that doesn’t know better: it never manipulates truth values of sentences “given” to it by the predictor. To sharpen this in your mind, consider the following dishonest translator: a translator which correctly translates sentences from the predictor’s net into some “objective” 3^rd person perspective and then mistranslates back into NL, manipulating truth values to match the truth values the teacher would assign.

Human simulator ≝ receives questions and generates candidate answers the same as a translator, but generates its probability distribution over candidate answers by relying on a model of the teacher’s inference procedure.

1. The problem: interpretation, not truth

1.1 Naïve translators

Suppose we have a magically absolute guarantee that our training algorithm will produce honest translators. That’s not enough to solve ELK: by default, our training algorithm will prefer honest translators which interpret the predictor’s net in ways that always make the teacher’s answers true! Since the only source of feedback is the human teacher, it will effectively place complete trust in the human and interpret the predictor accordingly. Unlike in NL translation, where feedback comes from both sides (in the form of pairs of pre-translated phrases), there is no push back from the predictor saying “no, that’s not what I was thinking” or “yep, that pretty much sums it up”. This is a problem. In what follows, I elaborate this in various ways. Assume throughout that rewarding honest translation is perfectly encoded into the loss function we use for training: we are always dealing with honest translators.

1.2 ELK, from a logical point of view

Because it’s the best framework I am familiar enough with to make my specific critiques, I’m going to reframe ELK from a logician’s perspective. This involves some idealization but, as far as I can tell, nothing that impacts the bottom line. A more complicated story could be told that better respects the details of real-world NNs but it would follow the same broad strokes to the same conclusions.

We can, for all intents and purposes, think of the base optimizer (e.g. Gradient Descent (GD)) as searching for a logical model of a specific set of sentences (the teacher’s answers) in a language $L$ (e.g. a formalized English) where the predictor’s net can be thought of as the domain[1], and the translator produced is a collection of functions which map $L$ ’s constants, predicates and functions onto the domain (a dictionary of sorts). In simpler terms: GD is looking for an interpretation of the predictor’s net which makes the teacher’s answers true. But for a given predictor there will likely exist many models of $Γ$ ! Only one of these models is the intended one (the one which maps “diamond” onto the predictor’s representation of diamonds and not something completely else). The problem of ontological identification – figuring out which model is the intended one – is immediately in view.

(Notice: under this framing, it becomes clear that you will not in general solve ontological identification by chasing after truth – or at least, not model-theoretic truth. More on this later).

Back to models though: besides the intended one, there will likely be a great number of “scrambled” models (e.g. a model which maps “diamond” onto the predictor’s representation of eggs). These can be thought of as wild permutations of the intended model, exploiting whatever isomorphisms[2] and spurious correlations are present in the domain. When one exists, GD will ditch the intended model for a scrambled model that makes more of the teacher’s sentences true. And intuitively, the existence of such models becomes more likely as the predictor gets bigger relative to $Γ$ .

OK, but this is what more training (adding sentences into $Γ$ ) and regularizers are for, right? Penalize complexity to incentivize it to generalize[3]. Spurious correlations eventually collapse after all, and though isomorphisms may not, extra data (sentences in $Γ$ ) will probe just how exact the isomorphism is, likely cracking it under enough scrutiny. (Few things in physical reality are perfectly isomorphic).

Unfortunately, honest translation which respects truth under intended interpretation is not the only generalizing solution. Not only do others exist, but many will be preferred by default, again, for according more tightly with the teacher’s responses.

1.3 Second-hand simulators and Gödelizers

To begin, if the predictor contains a model of human of cognition, the translator could just latch on to that, relying on the predictor’s model of the teacher’s inference. Let’s call this honest translator human simulator hybrid (it meets both my definitions) a second-hand simulator.

OK, but this doesn’t seem impossible to get around. Not all predictors will model humans. (Though it would be nice if our solution to ELK could handle such predictors as well). And maybe penalizing translators which rely heavily on small subsets of the predictor’s net would be enough. (Counterexample 1: the predictor’s model of humans is non-local. Counterexample 2: the translator uselessly processes a bunch of the predictor’s net in such a way that has no effect on the final answer it spits out.) I’ll admit, the problem of second-hand simulators is one we should probably make a note of, but shouldn’t consider fundamental. Let’s put it aside.

My over-arching concern is best highlighted by another generalizing solution though. For clarity, I’ll break the solution into two parts: first, like a human simulator, model the teacher’s inference procedure to determine what they would say; second, unlike a simulator, procedurally generate (or search[4] for) an interpretation of the predictor’s net which makes that answer true (again, in the model-theoretic sense of “true”). We know this second step is possible: Gödel showed us how to do it in his completeness theorem for first-order logic. Since then, logicians have regularly cooked up algorithms that generate models for a given set of sentences. The base optimizer is essentially playing the role of the logician here, searching for an algorithm that suits its needs – only instead of availing itself of the natural numbers (or whatever mathematical objects logicians like), the base optimizer avails itself of the predictor’s parameters. This solution drives home my point: by default, the base optimizer will treat the predictor’s net like a soup of empty symbols, ready to be reinterpreted and cooked up into whatever is wanted! The predictor’s net won’t get any better respect just because the base optimizer has its gaze trained on truth! At least, not if it’s mere model-theoretic truth: that type of truth is cheap. Which is exactly why the base optimizer will go for it by default.

Call this class of procedural translators Gödelizers. Gödelizers sound a lot like human simulators[5]. Indeed, the way I described them makes them sound like a contrived problem: why would they even bother to generate a dictionary after simulating the teacher’s inference? But as a matter of fact, I’m not sure simulating the teacher’s inference would be necessary: it might be possible to learn how to procedurally generate models for sentences[6] from a specific set of sentences (parts of which are hidden) without explicitly learning how to generate that set of sentences. Something worth researching.

Anyway, that kind of misses the point, which is just this: honest translators, which merely preserve model-theoretic truth, just ain’t what we want! They won’t solve ontology identification, and by default they won’t be any better than simulators. Again, what is wanted is an honest translator that respects model-theoretic truth under the intended interpretation. (Logicians love this phrase, but go quiet when asked how intended interpretations are determined. I suppose that’s for psychologists, linguists and philosophers of language to explain! More on them soon).

2. Some ways forward?

2.1 Use an enriched understanding of truth

Could we solve ontology identification and avoid human simulators if we somehow get the base optimizer to care about Truth capital T – the “ground-truth”? This hope is what seems to underlie the string of strategies from the ELK report. It is very much in line with Christiano’s “build a better teacher” approach to the alignment problem. But while it may be better than model-theoretic truth, I don’t think Truth is the most useful target either. Here are three reasons why.

First, “Truth” still eludes rigorous formalization (something “model-theoretic truth” has going for it) and despite their efforts I don’t see philosophers making a breakthrough any time soon. Targeting something we have a better handle on might increase our odds of success: so long as we chase this ideal, it seems we’ll be stuck throwing a bunch of proxies at the problem and hoping something sticks.

Second, chasing Truth dramatically increases the alignment tax. Let “Truth reporter” stand for reporters which try to report what the True state of affairs will be in the vault (the system we care about). Getting the base optimizer to build a Truth reporter essentially involves doing a lot of science (see ELK report). That seems like a lot of work! Didn’t we just want a translator?

Which brings me to the third and most important reason. Truth reporters are, in a sense, as far off the mark as human simulators. I think the most sensible, tractable version of ELK aims to build translators that try to answer our questions according to what the predictor believes true – whether those beliefs are in fact True or not! Call such translators “interpreters”. Interpreters seem necessary and possibly sufficient for de dicto intent alignment; Truth reporters just seem like overkill [7]. Therefore, we neither want a reporter that says whatever the average teacher would say about the vault state, nor what the greatest team of amplified scientists would say. Insofar as a Truth reporter and the predictor disagree about the Truth, a Truth reporter will not help us discover what the predictor is thinking: that makes it a bad interpreter.

Of course a Truth reporter would be very useful – but if that’s what we wanted, why all the fuss with the predictor’s net and translation? I have a feeling this is why the authors of the ELK report find themselves back at “square one” after exploring their string of strategies: because they essentially ditched trying to translate the predictor’s net and end up trying to build a Truth reporter.

Furthermore, as the authors point out, reaping the benefits of a Truth reporter will eventually require an interpreter, as much as the predictor did: in either case, we will need to solve ontology identification. Again, to reframe things as a logician, we can think of ontology identification as the challenge of:

Blind Translation (BT): finding the intended semantics (a.k.a. structure) for a language $L$ , where the domain and all terms of $L$ are given. (Additionally, the syntax of $L$ is given).

Note how, prima facie, this does not involve truth in any shape or form – satisfaction[8], model-theoretic truth, or Truth, the ground-truth state of affairs in the vault[9]. This is why I say truth is mostly a distraction, and why going straight for a Truth reporter is putting the cart before the horse. This is also why logicians will tell you BT is none of their business!

2.2 Don’t treat the predictor like a black box

In a way, we shouldn’t be surprised that, by default, the base optimizer (along with the reporter it produces) treats the predictor’s parameters like a soup of empty symbols – like a black box. After all, that’s effectively what the human teacher does in nearly every ELK strategy proposed so far[10]! GD see, GD do.

This suggests a very different sort of proposal: train the reporter on mechanistic interpretability experts and hope it generalizes their techniques and abilities. Interpretability experts can be understood as experts at voicing the intended interpretation of any arbitrary chunk of the predictor’s net: exactly what we want our reporter to be good at. This tackles BT head on.

Details of the training process: the reporter is asked questions that are always about specified chunks of the predictor’s net, and its answers are evaluated against what the mechanistic interpretability expert would say after applying all their techniques and tools to that specified chunk. By training the reporter on chunks large and small (and applying some regularizers) we might reasonably hope the reporter will generalize, internalizing the techniques and tools the experts use to interpret any arbitrary chunk of the predictor’s net. The questions we really want answered are about the “conclusions” the predictor has come to regarding the things we care about, but the idea is to work our way up to these questions, Olah style.

Obvious limitation 1: mechanistic interpretability probably has a long way to go before this proposal becomes viable. I say: delay agentic HLMI and bring on the Microscope AI revolution!

Obvious limitation 2: the reporter will mimic systematic human error (i.e. error not modelled by a little noise), replicating misinterpretations along with correct interpretations of the predictor’s parameters and associated neurons. But now at least we’re closer to what we want: the reporter is at least paying attention to the structure of the predictor, where before it had no reason not to treat it like playdough.

2.3 Track the history of the predictor’s terms[11]

What follows is meant to build on the proposal above by helping reduce (possibly to zero[12]) that replicated error. In that sense, these suggestions can be read as an extension of mechanistic interpretability techniques (if they aren’t already used). However, I’m not sure how useful these techniques would be for humans: they seem time and cost prohibitive to fully execute manually. However, teaching it to a reporter which then carries it much further seems doable.

You may have noticed mechanistic interpretability researchers are not alone: anthropologists and archaeologists also tackle BT. (Think trying to interpret Egyptian hieroglyphs before the discovery of the Rosetta stone.) Those interested in solving ontology identification might learn something from them. No doubt linguists and psychologists who study language development also have insights. Unfortunately I am none of these! I am familiar with philosophers of language however. While they don’t engage with BT directly, they do engage with adjacent issues.

There is a common view in philosophy that the meaning of an expression is given simply by its truth-conditions (read “Truth-conditions” capital T). Formal semantics studies how exactly those truth-conditions should be put together for given expressions, but assumes we already know, by and large, the intended meaning of those expressions. Formalization is the name of the game – translating natural language into a formal system we can prove things with and about. For that reason, formal semantics is largely unhelpful here.

More helpful are “foundational theories of meaning, which are attempts to specify the facts in virtue of which expressions of natural languages come to have the semantic properties that they have.” They ask questions such as: what causal, historical, sociological or psychological facts determine and sustain the fact that “Neptune” refers to Neptune?

There are a variety of such theories, each with different insights.[13] I’ll focus on one insight that I think has been overlooked here: sometimes the history of a term is critical. Take the example of names (just one type of term). The name “Neptune” refers to Neptune at least in part because it was so “baptized” in 1846, and the label has been successfully transmitted to us through an unbroken causal chain of speakers teaching it to other speakers. In general, the reference of names looks underdetermined by its functional role in our model of the world (see footnote 2). Additionally, the manner and context of those baptisms seems to shape the meaning of the names in question, beyond just fixing their reference. Consider how much more informative “The Evening Star is identical to the Morning Star” is than “Venus is identical to Venus.” And this, despite all of these names referring to the same planet! Arguably, it’s the difference between how “Evening Star” and “Morning Star” were introduced that explains their difference in meaning, and thus explains why it is informative to be told these two names name the same thing.

In short, solving BT is undoubtedly made much harder if we limit ourselves to looking only at the current state of the language in question: we should avail ourselves of the history of the language. If we can identify and track the birth and baptism of every new term in the predictor, in parallel with the predictor’s training, we’ll have a much better chance at solving BT. One way to do this might be to structure the training of the predictor, and factor in that structure into our interpretations. If, for example, we know the predictor is next going to be trained to recognize cats, we can expect to catch the birth of a cat recognition circuit. Catching the birth of specific Gabor filters will of course be a bit harder, but I hope the principle is at least clear. Once a birth has been noticed, hopefully we can track it: for all subsequent interpretations of the parameter/circuit in question, past interpretations should be factored in.

Details of the training process: train a reporter on interpretability experts as before but now its inputs include, for a given chunk C of the predictor (here presumed to be a neural net):

(as before) the current state (values of all parameters) of C
(as before) the matrices of neuron activations in C paired with the inputs to C that yield each matrix.
the previous state(s) of C (along with activation information?) paired with the reporter’s previous interpretations of C.
the current, previous and upcoming data in the training dataset (where the dataset has been structured)

As before, the reporter is asked questions and its answers are evaluated against those of interpretability experts.

Finally, interpreting in parallel with training might come with another benefit: we might minimize our error rate by making the predictor fully transparent when it’s easier to do so,[14] and then by tracking the development of the predictor well enough to maintain that transparency as the predictor grows into something much more complex – something which we would struggle to make fully transparent if we only approached it in its final state. It’s possible that the predictor’s ontology will go through Kuhnian paradigm shifts,[15] but hopefully no one shift is so total or so discontinuous that the chain of interpretability completely breaks. Special techniques might also be developed to handle such shifts?

3. Three Grades of ELK

Let’s zoom out. I’ve been arguing that mere model-theoretic truth is too cheap, and Truth is too expensive (and overkill): for now, if we want to talk about truth at all, we should be talking about model-theoretic truth under intended interpretation. For now, the goal should be to develop techniques for finding the intended model(s) of the predictor’s “beliefs,” where the domain of the model is a natural language. This requires solving BT, i.e. a form of ontology identification. From the intended semantics we get out of BT, we can build the intended models. If we manage all that, we’ll have solved what I’m calling eliciting latent answers.

3.1 ELB: Eliciting Latent Beliefs

As I see it, ELB is the first of three grades of ELK. Solving ELB means successfully building an interpreter (a reporter that is trying to tell us what the predictor believes to be true). ELB is the easiest grade in that:

It has no Truth guarantee: the answers to our questions about the vault will simply reflect what the predictor believes. Where the predictor is Wrong, we’ll get False answers.
It has no “completeness” guarantee (for lack of a better term): we will only get answers to the questions we currently know how to ask – i.e. answers that can currently be formulated in a language we understand. That means we will likely miss some of the predictor’s beliefs or gloss over their details – the beliefs and details that are inaccessible to us, that we can’t grasp for a lack of conceptual resources. To put it another way: mere ELB interpreters are lossy. The bigger and more alien the predictor becomes, the bigger a problem this becomes.

3.2 ELI: Eliciting Latent Information

ELI is the next grade. Solving ELI means successfully building what I’ll call a “complete interpreter:” a reporter that is trying to tell us all of what the predictor believes, in all its detail. ELI still has no Truth guarantee, but it does come with a completeness guarantee. A complete interpreter is lossless, or at the very least, flags answers that are lossy. This bears elaborating.

Suppose instead of a diamond in the vault, it’s a lump of jade. Furthermore, suppose it’s 1763, a century before we discovered “jade” actually refers to two different minerals, jadeite and nephrite: these words were not in our vocabulary, and we could not distinguish these minerals. Unbeknownst to us, our lump of jade is in fact a lump of jadeite. Finally, suppose lumps of jadeite are what we really care about – unbeknownst to us, that’s what is critical for human flourishing, not nephrite. (You could understand this as, “jadeite will be valued over nephrite after the long reflection” if you prefer).

Imagine now that the operations under consideration by the predictor involve the chemical transformation of our jadeite into nephrite: the predictor’s ontology distinguishes the two minerals, and is aware of the transformation. When we ask our simple ELB interpreter “Is the lump of jade still in the vault?” we get the lossy answer “Yes.” Note how, to our 1763 selves, the question will not be ambiguous. But we clearly have a problem.

What would the complete interpreter do? First, unlike the ELB interpreter, it would somehow surpass human ontology, and grasp the extra nuance in the predictor’s understanding of what is going on. But once it has hold of this extra-human information, what is it supposed to do with it? Ideally it would have the ability to teach us how to grasp it – help us expand our vocabulary and our understanding of the world – and then give us the information. But barring that, I think the next best thing would be the following feature:

Beyond Your Ken Identification (BYKI): the ability to identify and alert humans when something has happened which is beyond their current best science, beyond their current conceptual resources. In such scenarios, the only human-understandable answers to their questions will be lossy, requiring the extra-human information about the predictor’s beliefs be dropped.

An interpreter equipped with BYKI would answer the question “Is the lump of jade still in the vault?” with “Yes, but this answer glosses over details beyond your ken.”

The process followed by a BYKI equipped complete interpreter would look something like this:

For every candidate answer to the input question, evaluate its truth value under all possible refinements of the human ontology (maybe weighted by the likelihood we would make such a refinement). These refinements are carried as far as required to capture all the relevant[16] information about the predictor’s beliefs (this is what makes it a complete interpreter). Create a list of these truth-values for each candidate.
For every candidate, check the variance of its truth-value list.
1. If the variance is low, all is fine: no BYK alert.
2. If the variance is high, that means the truth value of the candidate depends strongly on how the human ontology gets refined. Break and set Errorlevel to “BYK.”
If Errorlevel = “BYK,” output the BYK alert + the answer(s) with the highest truth value under the unrefined human ontology.
If Errorlevel = 0, average each truth value list and assign the result to the corresponding candidate. Output the answer(s) with the highest truth values.

Perhaps this behaviour could be taught by training reporters on mechanistic interpretability experts who are pretending to answer questions for people with a less fine-grained ontology, such as people from the past (e.g. from 1763).

A deeper concern about BYKI equipped interpreters: how much will they bring down the alignment tax relative to IRL on expert humans? As the predictor gets more and more powerful, the interpreter will answer more and more questions with “beyond your ken” – precisely where the predictor goes beyond human expertise. What are we supposed to do with the knowledge of our ignorance? Blindly trusting the predictor amounts to throwing out oversight: not smart. If instead we constrain the predictor to only operate the vault in ways we understand, won’t the once superhuman AI operator just collapse into the group of human expert operators? I fear so.

Something to beware! We cannot simply implement automated ontology identification – let the complete interpreter extrapolate our own scientific trajectory or otherwise project how we would refine our ontology – and leave it at that. Such an interpreter answers questions under our projected refinement, which means we, our 1763 selves, would lose hold of the intended interpretation of the interpreter’s answers. The interpreter may continue to answer with words orthographically identical to our 1760s English words, but it won’t quite be speaking 1760s English anymore. Instead, it will be speaking some “evolved” English whose intended interpretation will escape us. The problem is not just (as Miles et al point out) that the refinement of our ontology is itself a value-laden activity, but that we would no longer be able to understand the interpreter meant to help us interpret the predictor! (Worse: it might sound understandable despite actually talking past us!)

For a somewhat contrived example, suppose instead of a diamond it’s a bottle of moonshine we’re trying to protect, but we don’t distinguish it from water in this era: we call it a “bottle of water.” Imagine the predictor foresees a robber swapping it for a bottle of H₂0. When we ask this auto-refining interpreter “Is a bottle of water still in the vault?” it will misleadingly answer “Yes.” Despite the answer being True, we have a problem: the answer will be misunderstood.

3.3 ELK, strictly understood

The third and most difficult grade is ELK proper, supplementing ELB with a completeness and a Truth guarantee[17]. Solving this means building a complete interpreter as well as a fact checker which verifies that the predictor’s beliefs are True. A Truth reporter could play this fact checker role.

How we’re meant to pull this off is beyond me: we’re talking about automating science as far as I can tell. As I argued earlier, I think we’re getting ahead of ourselves if we go straight for this grade of ELK. Besides, it seems to add a lot of undue complexity, especially if the goal is intent alignment. I see intent alignment as tractable and cost-effective precisely because (as far as I can tell) it doesn’t require us to fact check the predictor’s predictions. Finally, this strict ELK seems to lead down a slippery slope: if the worry is “Who is fact checking the predictor’s predictions?” then the next worry is of course “Who is fact checking the fact checker?” Infinite regress looms. At some point we have to stop.

All in all, I do lament the K in “ELK”: I think it distracts us from the more immediately useful and tractable problem of making the predictor’s beliefs transparent to us. I know that knowledge is sometimes just treated as a proficiency at prediction in certain domains, but that ends up glossing over the distinction between model-theoretic truth, Truth, and the happy medium, truth under intended interpretation. Taking the perspective of an epistemologist, things are a bit clearer but only further demote knowledge. The standard formula for knowledge is: justified True belief, plus some extra conditions tacked on to handle Gettier cases. But for our purposes, what do we really care about in that formula? We’re not investigating how exactly the predictor’s “beliefs” are justified, we don’t really care whether the predictor strictly uses beliefs or some other representational device, we don’t care whether its “beliefs” are True, and we don’t care whether the predictor’s “beliefs” are resistant to Gettier cases. Wasn’t the prize of this project just to learn what the predictor thinks is going to happen, in all its detail?

[1] For clarity’s sake, I will continue to treat the predictor’s net as the domain, and the words used by humans as comprising the language, but in reality this is somewhat arbitrary distinction that doesn’t make a deep functional difference. We could just as well (if less naturally) talk of GD searching for sets of sentences inside the predictor’s net for which the teacher’s answers form a model.

I should also note that the relationship here between the domain and $Γ$ is a little more complicated than usual: certain subsets of $Γ$ are restricted to certain subsets of the domain. This corresponds to certain sets of the teacher’s answers only applying to certain states of the predictor’s net: each subset of answers corresponds to answers regarding one run of the predictor, one proposed vault operation and its accompanying video feed. So in reality, GD is looking for many models, one for each of these subsets, each relying on a slightly different domain. However, I don’t think this complication affects my overall points.

[2] I think isomorphisms are actually another reason why Demski’s proposals 1.1 and 2.1, would have trouble (see here for clarifications on 2.1). The causal structure of a model alone cannot tell you what it is a model of, since the model of one set of phenomena might be structurally identical to the model of something completely else. Sometimes mere labels are all that tell them apart. Take the labels off a model of our solar system, and you might confuse it for a model of the atom! Underdetermination will plague translation methods that rely too heavily on merely correlating causal structures in models (without e.g. paying attention to the history of a model’s development – see §2.3).

[3] I’d like to point out that these scrambled dictionaries would already be somewhat general, and thus resistant to new training data (though probably not regularizers). To see why, I need to explain something about logical models. For any consistent set of sentences $Γ$ , a model of $Γ$ is also a model of $Γ$ ’, where $Γ$ ’ includes $Γ$ as well as any sentences logically entailed by $Γ$ (entailed in virtue of nothing other than logical operators, such as “and,” “or,” “not” etc). Therefore, these scrambled dictionaries would generalize to all hidden sentences that are logically entailed by the known sentences. With enough NL processing, you also might get scrambled dictionaries that generalize to all hidden sentences which are analytically entailed by the known sentences (roughly, entailed in virtue of the meaning of the terms in the sentences).

[4] Making it a mesa-optimizer.

[5] Like simulators, they would be initially susceptible to strategies which penalize reporters that work with many different predictors. The counterexample is the same too: GD will just sabotage the generality of the dictionary generating algorithm, leaving it intact only for the one predictor humans choose to have it work on.

[6] The task here is even easier than what logicians tackle: by default, these Gödelizers won’t even care about using the same model for every sentence in the set. Perhaps penalizing computation time would make it care about generating “high-value” models that can be reused many times (maybe for all questions regarding one state of the predictor’s net), but it will be difficult to make it care about generating only one model for all sentences.

[7] As far as I can tell, knowing what an agent is thinking is all we need to ensure the agent is trying to do what it thinks we want it to do. One caveat: I think all the agent’s thoughts need to be transparent for this to work, which means only some types of interpreters will be sufficient to solve intent alignment. See §3.2 for more.

[8] This is meta-logic’s name for what you might call sub-sentential truth.

[9] Obviously, BT does involve Truth in the broad sense: only one semantics will Truly be the intended one. But every question trivially involves Truth in this sense.

[10] Demski’s proposal 1.2 is an interesting exception. It tries to get the base optimizer to care about the structure of the predictor’s net by making the reporter also give plausible answers to questions about counterfactual scenarios, scenarios resulting from interventions made to the predictor’s net that are known to the teacher but hidden from the reporter. Sure enough, the teacher here is treating the predictor’s net less like a black box, making targeted interventions on it. See here for more discussion.

[11] My apologies, for any confusion: from here on I treat the predictor’s net as the language, and the totality of English words as the domain, since I think that’s more perspicuous here. See footnote 1.

[12] I think it’s possible all error (beside user error we can model with a little noise) is induced by the limitations of our interpretability techniques. I.e. it might be possible to break the process down into small enough steps that each has a guarantee of success when implemented by a competent human. If so, then the challenge is finding a way to compound those steps while preserving interpretability.

[13] For the curious, current mechanistic interpretability seems to operate on a theory of meaning based on behavioral regularities. That’s the “mechanistic” part in mechanistic interpretability.

[14] Maybe at an early-ish stage and maybe before all neurons have been implemented, assuming we purposefully stagger the implementation of layers and neurons during training.

[15] This concern is related to what Christiano et al call “discrete modes of prediction.”

[16] This is not an innocuous term: relevancy will be a tricky question. To idealize things here, we can imagine the complete interpreter treats the entirety of the predictor’s model as being relevant to every question. Excessive no doubt, but undoubtedly complete.

[17] One could also combine ELB with simply a Truth guarantee, but this doesn’t seem to describe what the authors of the ELK report end up going in for.

AI ALIGNMENT FORUM
AF