• Consider a human concept such as "tree." Humans implement some algorithm for determining whether given objects are trees. We expect our predictor/language model to develop a model of this algorithm because this is useful for predicting the behavior of humans. 
  • This is not the same thing as some kind of platonic ideal concept of what is “actually” a tree, which the algorithm is not incentivized to develop by training on internet text,[1] and trying to retarget the search at it has the same supervision problems as RLHF against human scores on whether things look like trees.
  • Pointing at this “actually a tree” concept inside the network is really hard; the ability of LMs to comprehend natural language does not allow one to point using natural language, because it just passes the buck.

Epistemic status: written fast instead of not at all, probably partially deeply confused and/or unoriginal. Thanks to Collin Burns, Nora Belrose, and Garett Baker for conversations.

Will NNs learn human abstractions?

As setup, let's consider an ELK predictor (the thing that predicts future camera frames). There are facts about the world that we don't understand that are in some way useful for predicting the future observations. This is why we can expect the predictor to learn facts that are superhuman (in that if you tried to supervised-train a model to predict those facts, you would be unable to generate the ground truth data yourself).

Now let's imagine the environment we're predicting consists of a human who can (to take a concrete example) look at things and try to determine if they're trees or not. This human implements some algorithm for taking various sensory inputs and outputting a tree/not tree classification. If the human does this a lot, it will probably become useful to have an abstraction that corresponds to the output of this algorithm. Crucially, this algorithm can be fooled by i.e a fake tree that the human can't distinguish from a real tree because (say) they don't understand biology well enough or something. 

However, the human can also be said to, in some sense, be "trying" to point to the "actual" tree. Let's try to firm this down. The human has some process they endorse for refining their understanding of what is a tree / "doing science" in ELK parlance; for example, spending time studying from a biology textbook. We can think about the limit of this process. There are a few problems: it may not converge, or may converge to something that doesn't correspond to what is "actually" a tree, or may take a really really long time (due to irrationalities, or inherent limitations to human intelligence, etc). This suggests that this concept is not necessarily even well defined. But even if it is, this thing is far less naturally useful for predicting the future human behaviour than the algorithm the human actually implements! Implementing the actual human algorithm directly lets you predict things like how humans will behave when they look at things that look like trees to them.

More generally, one possible superhuman AI configuration I can imagine is one where the bulk of the circuits are used to predict its best-guess for what will happen in the world. There may also be a set of circuits that operate in a more humanlike ontology used specifically for predicting humans, or it may be that the best-guess circuits are capable enough that this is not necessary (and if we scale up our reporter we eventually get a human simulator inside the reporter).

The optimistic case here is if the "actually a tree" abstraction happens to be a thing that is useful for (or is very easily mapped from) the weird alien ontology, possibly because some abstractions are more universal. In this case, for sufficiently powerful supervision it may just become cheaper to map from the best-guess ontology. For what it's worth, a sizeable chunk of my remaining optimism lies here. However, one thing that makes this hard is that the more capable the model (especially in the superhuman regime), the further I expect the weird alien ontology to diverge from the human ontology.

A closely related claim: A long time ago I used to be optimistic that because LMs contained lots of human data, then systems would learn to understand human values, and then we would just have to point to that ("retargeting the search"). The problem here is just like in the tree example: the actual algorithm that people implement (and therefore, the thing that is most useful for predicting the training data), is not actually the thing we want to maximize, because that algorithm is foolable. In my mental model of how RLHF fails, this maps directly onto supervision failures (think the hand in front of the ball example), one of the two major classes of RLHF failures in my mental model (the other being deceptive alignment failures).

It feels intuitive that the "correct" thing should be simple

I think there's still an intuitive frame in which the concept of the direct translator still nonetheless feels super intuitively simple to specify. Like, it feels very intuitive that "the actual thing" should in some sense be a very simple thing to point at. I think it's plausible that in some Kolmogorov complexity sense the direct translation might actually not be that complex: an excellent argument made by Collin Burns in the specific case of identifying the "actual truth" as opposed to "what the human thinks is true" inside a given model argues for the existence of a specification using a relatively small number of bits, constructed from answers to superhuman questions (like whether the Riemann hypothesis is true, etc - the core idea being that even though we don't actually know the answers, if we accept that we could identify the direct translator in theory with a list of such answers, then that bounds the complexity). 

I think the solution to ELK, if it were to exist at all, probably has a simple Kolmogorov complexity, but only in a trivial sense stemming from the counterintuitiveness of the Kolmogorov complexity. An extremely difficult to solve but ultimately solvable problem is only as complex as the problem statement, but this could conceal tremendous amounts of computation, and the solution could be very complicated if expressed as, say, a neural network. Thus, in practice under the neural network prior it is possibly extremely hard to identify the direct translator.

Another angle on this is supposing you could get a model to try and tell you "like, the actual thing" (and somehow avoiding the problem that il n'y a pas de hors-texte (see this comment and the section in the post it responds to), which I think you'd run into long before any of this), to a model which has a very different ontology, it could be just that the model literally does not have the ability to give you the "actual" information. In a more grounded ML case, if the direct translator is quite complex and in training a case was never encountered where having the direct translator was needed, this circuitry would just never get learned, and then whether the model realizes that you're asking for the direct translator is no longer the reason the model can't tell you. Maybe once your NN is sufficiently powerful to solve ELK on the fly this problem will no longer exist, but we're probably long dead at that point.

A way to intuitively think about this is if you have a room with a human and a computer running a molecular level simulation to determine future camera observations, it doesn’t help much if the human understands that you’re asking for the “actual” thing; that human still has to do the difficult work of comprehending the molecular simulation and pulling out the “actual” thing going on. 

Some other reasons to expect weird abstractions in LMs

  • Humans don’t actually think in words, we have some inscrutable process that generates words that we don’t have full introspective access to. This is extra true for people who don’t even have an inner monologue. This isn’t really cause for optimism, though, because the space of inscrutable complicated somewhat-incoherent things that output text is really big and weakly constrained and two things from this space almost certainly don’t generalize in the same way (even different humans typically have very different cognition!)
  • Predicting tokenized internet text is actually a really weird non-human task, and correspondingly, humans are really bad at next token prediction. Because you need to model the distribution of all text generating processes, you have to model the exact proportion of all the possible semantically equivalent responses in various contexts. Lots of this data is log files or other text data of non human origin. In the specific case of BPE tokenized data, you have to deal with a bizarre distorted universe where  tree, tree Tree, and Tree are fully ontologically distinct objects (tokens 21048, 5509, 27660, and 12200 respectively) but the various distinct meanings of tree have to share the same tokens, not even to mention the infamous  SolidGoldMagikarp. None of this prevents LMs from being insanely good, of course, because they don’t have to think like humans.
  • Anecdotally, most neurons in LMs seem like gibberish, and even the ones that initially look interpretable oftentimes become more mysterious again when you look more carefully at them. I think it's plausible that something like superposition explains most of this and that we'll get a handle on it eventually, so this isn't very strong evidence.
  1. ^

    or images, or video, etc; none of my arguments are text specific.

New Comment
2 comments, sorted by Click to highlight new comments since:

Nice post! I need to think about this more, but:

(1) Maybe if what we are aiming for is honesty & corrigibility to help us build a successor system, it's OK that the NN will learn concepts like the actual algorithm humans implement rather than some idealized version of that algorithm after much reflection and science. If we aren't optimizing super hard, maybe that works well enough? 

(2) Suppose we do just build an agentic AGI that's trying to maximize 'human values' (not the ideal thing, the actual algorithm thing) and initially it is about human level intelligence. Insofar as it's going to inevitably go off the rails as it learns and grows and self-improves, and end up with something very far from the ideal thing, couldn't you say the same about humans--over time a human society would also drift into something very far from ideal? If not, why? Is the idea that it's kinda like a random walk in both cases, but we define the ideal as whatever place the humans would end up at?

re:1, yeah that seems plausible, I'm thinking in the limit of really superhuman systems here and specifically pushing back against a claim that this human abstractions being somehow inside a superhuman AI is sufficient for things to go well.

re:2, one thing is that there are ways of drifting that we would endorse using our meta-ethics, and ways that we wouldn't endorse. More broadly, the thing I'm focusing on in this post is not really about drift over time or self improvement; in the setup I'm describing, the thing that goes wrong is it does the classical "fill the universe with pictures of smiling humans" kind of outer alignment failure case (or worse yet, the more likely outcome of trying to build an agentic AGI is we fail to retarget the search and end up with one that actually cares about microscopic squiggles, and then it does the deceptive alignment using those helpful human concepts it has lying around).