Epistemic status: written fast instead of not at all, probably partially deeply confused and/or unoriginal. Thanks to Collin Burns, Nora Belrose, and Garett Baker for conversations.
As setup, let's consider an ELK predictor (the thing that predicts future camera frames). There are facts about the world that we don't understand that are in some way useful for predicting the future observations. This is why we can expect the predictor to learn facts that are superhuman (in that if you tried to supervised-train a model to predict those facts, you would be unable to generate the ground truth data yourself).
Now let's imagine the environment we're predicting consists of a human who can (to take a concrete example) look at things and try to determine if they're trees or not. This human implements some algorithm for taking various sensory inputs and outputting a tree/not tree classification. If the human does this a lot, it will probably become useful to have an abstraction that corresponds to the output of this algorithm. Crucially, this algorithm can be fooled by i.e a fake tree that the human can't distinguish from a real tree because (say) they don't understand biology well enough or something.
However, the human can also be said to, in some sense, be "trying" to point to the "actual" tree. Let's try to firm this down. The human has some process they endorse for refining their understanding of what is a tree / "doing science" in ELK parlance; for example, spending time studying from a biology textbook. We can think about the limit of this process. There are a few problems: it may not converge, or may converge to something that doesn't correspond to what is "actually" a tree, or may take a really really long time (due to irrationalities, or inherent limitations to human intelligence, etc). This suggests that this concept is not necessarily even well defined. But even if it is, this thing is far less naturally useful for predicting the future human behaviour than the algorithm the human actually implements! Implementing the actual human algorithm directly lets you predict things like how humans will behave when they look at things that look like trees to them.
More generally, one possible superhuman AI configuration I can imagine is one where the bulk of the circuits are used to predict its best-guess for what will happen in the world. There may also be a set of circuits that operate in a more humanlike ontology used specifically for predicting humans, or it may be that the best-guess circuits are capable enough that this is not necessary (and if we scale up our reporter we eventually get a human simulator inside the reporter).
The optimistic case here is if the "actually a tree" abstraction happens to be a thing that is useful for (or is very easily mapped from) the weird alien ontology, possibly because some abstractions are more universal. In this case, for sufficiently powerful supervision it may just become cheaper to map from the best-guess ontology. For what it's worth, a sizeable chunk of my remaining optimism lies here. However, one thing that makes this hard is that the more capable the model (especially in the superhuman regime), the further I expect the weird alien ontology to diverge from the human ontology.
A closely related claim: A long time ago I used to be optimistic that because LMs contained lots of human data, then systems would learn to understand human values, and then we would just have to point to that ("retargeting the search"). The problem here is just like in the tree example: the actual algorithm that people implement (and therefore, the thing that is most useful for predicting the training data), is not actually the thing we want to maximize, because that algorithm is foolable. In my mental model of how RLHF fails, this maps directly onto supervision failures (think the hand in front of the ball example), one of the two major classes of RLHF failures in my mental model (the other being deceptive alignment failures).
I think there's still an intuitive frame in which the concept of the direct translator still nonetheless feels super intuitively simple to specify. Like, it feels very intuitive that "the actual thing" should in some sense be a very simple thing to point at. I think it's plausible that in some Kolmogorov complexity sense the direct translation might actually not be that complex: an excellent argument made by Collin Burns in the specific case of identifying the "actual truth" as opposed to "what the human thinks is true" inside a given model argues for the existence of a specification using a relatively small number of bits, constructed from answers to superhuman questions (like whether the Riemann hypothesis is true, etc - the core idea being that even though we don't actually know the answers, if we accept that we could identify the direct translator in theory with a list of such answers, then that bounds the complexity).
I think the solution to ELK, if it were to exist at all, probably has a simple Kolmogorov complexity, but only in a trivial sense stemming from the counterintuitiveness of the Kolmogorov complexity. An extremely difficult to solve but ultimately solvable problem is only as complex as the problem statement, but this could conceal tremendous amounts of computation, and the solution could be very complicated if expressed as, say, a neural network. Thus, in practice under the neural network prior it is possibly extremely hard to identify the direct translator.
Another angle on this is supposing you could get a model to try and tell you "like, the actual thing" (and somehow avoiding the problem that il n'y a pas de hors-texte (see this comment and the section in the post it responds to), which I think you'd run into long before any of this), to a model which has a very different ontology, it could be just that the model literally does not have the ability to give you the "actual" information. In a more grounded ML case, if the direct translator is quite complex and in training a case was never encountered where having the direct translator was needed, this circuitry would just never get learned, and then whether the model realizes that you're asking for the direct translator is no longer the reason the model can't tell you. Maybe once your NN is sufficiently powerful to solve ELK on the fly this problem will no longer exist, but we're probably long dead at that point.
A way to intuitively think about this is if you have a room with a human and a computer running a molecular level simulation to determine future camera observations, it doesn’t help much if the human understands that you’re asking for the “actual” thing; that human still has to do the difficult work of comprehending the molecular simulation and pulling out the “actual” thing going on.
or images, or video, etc; none of my arguments are text specific.
Nice post! I need to think about this more, but:(1) Maybe if what we are aiming for is honesty & corrigibility to help us build a successor system, it's OK that the NN will learn concepts like the actual algorithm humans implement rather than some idealized version of that algorithm after much reflection and science. If we aren't optimizing super hard, maybe that works well enough? (2) Suppose we do just build an agentic AGI that's trying to maximize 'human values' (not the ideal thing, the actual algorithm thing) and initially it is about human level intelligence. Insofar as it's going to inevitably go off the rails as it learns and grows and self-improves, and end up with something very far from the ideal thing, couldn't you say the same about humans--over time a human society would also drift into something very far from ideal? If not, why? Is the idea that it's kinda like a random walk in both cases, but we define the ideal as whatever place the humans would end up at?
re:1, yeah that seems plausible, I'm thinking in the limit of really superhuman systems here and specifically pushing back against a claim that this human abstractions being somehow inside a superhuman AI is sufficient for things to go well.
re:2, one thing is that there are ways of drifting that we would endorse using our meta-ethics, and ways that we wouldn't endorse. More broadly, the thing I'm focusing on in this post is not really about drift over time or self improvement; in the setup I'm describing, the thing that goes wrong is it does the classical "fill the universe with pictures of smiling humans" kind of outer alignment failure case (or worse yet, the more likely outcome of trying to build an agentic AGI is we fail to retarget the search and end up with one that actually cares about microscopic squiggles, and then it does the deceptive alignment using those helpful human concepts it has lying around).