A rough and incomplete review of some of John Wentworth's research

27Oliver Habryka

34Nate Soares

24johnswentworth

New Comment

Perhaps I've simply been misreading John, and he's been intending to say "I have some beliefs, and separately I have some suggestive technical results, and they feel kinda related to me! Which is not to say that any onlooker is supposed to be able to read the technical results and then be persuaded of any of my claims; but it feels promising and exciting to me!".

For what it's worth, I ask John about once ever month or two about his research progress and his answer has so far been (paraphrased) "I think I am making progress. I don't think I have anything to show you that would definitely convince you of my progress, which is fine because this is a preparadigmatic field. I could give you some high-level summaries or we could try to dive into the math, though I don't think I have anything super robust in the math so far, though I do think I have interesting approaches."

You might have had a totally different experience, but I've definitely had the epistemic state so far that John's math was in the "trying to find remotely reasonable definitions with tenuous connection of formalism to reality" stage, and not the "I have actually demonstrated robust connection of math to reality stage", so I feel very non-mislead by John. A good chunk of this impression comes from random short social interactions I've had with John, so someone who more engaged with just his online writing might come away with a different impression (though I've also done that a lot and don't super feel like John has ever tried to sell me in his writing on having super robust math to back things up).

John has also made various caveats to me, of the form "this field is pre-paradigmatic and the math is merely suggestive at this point". I feel like he oversold his results *even so.*

Part of it is that I get the sense that John didn't understand the limitations of his own results--like the fact that the telephone theorem only says anything in the infinite case, and the thing it says then does not (in its current form) arise as a limit of sensible things that can be said in finite cases. Or like the fact that the alleged interesting results of the gKPD theorem are a relatively-shallow consequence of the overly-strong assumption of .

My impression was that I had to go digging into the theorems to see what they said, only to be disappointed by how little resemblance they bore to what I'd heard John imply. (And it sounds to me like Lawrence, Leon, and Erik had a similar experience, although I might be misreading them on account of confirmation bias or w/e.)

I acknowledge that it's tricky to draw a line between "someone has math that they think teaches them something, and is inarticulate about exactly what it teaches" and "someone has math that they don't understand and are overselling". The sort of observation that would push me towards the former end in John's case is stuff like: John being able to gesture more convincingly at ways concepts like "tree" or "window" are related to his conserved-property math even in messy finite cases. I acknowledge that this isn't a super legible distinction and that that's annoying.

(Also, I had the above convos with John >1y ago, and perhaps John simply changed since then.)

Note that I continue to think John's cool for pursuing this particular research direction, and I'd enjoy seeing his math further fleshed out (and with more awareness on John's part of its current limitations). I think there might be interesting results down this path.

(Also, I had the above convos with John >1y ago, and perhaps John simply changed since then.)

In hindsight, I do think the period when our discussions took place were a local maximum of (my own estimate of the extent of applicability of my math), partially thanks to your input and partially because I was in the process of digesting a bunch of the technical results we talked about and figuring out the next hurdles. In particular, I definitely underestimated the difficulty of extending the results to finite approximations.

That said, I doubt that fully accounts for the difference in perception.

This is going to be a half-assed review of John Wentworth's research. I studied his work last year, and was kinda hoping to write up a better review, but am lowering my standards on account of how that wasn't happening.

Short version: I've been unimpressed by John's technical ideas related to the natural abstractions hypothesis. He seems to me to have some fine intuitions, and to possess various laudable properties such as a vision for solving the whole dang problem and the ability to consider that everybody else is missing something obvious. That said, I've found his technical ideas to be oversold and underwhelming whenever I look closely.

(By my lights, Lawrence Chan, Leon Lang, and Erik Jenner’s

recent post on natural abstractionsis overall better than this post, being more thorough and putting a finger more precisely on various fishy parts of John's math. I'm publishing this draft anyway because my post adds a few points that I think are also useful (especially in the section “The Dream”).)To cite a specific example of a technical claim of John's that does not seem to me to hold up under scrutiny:

John has previously claimed that markets are a better model of intelligence than agents, because while collective agents don't have preference cycles, they're willing to pass up certain gains.

For example, if an alien tries to sell a basket "Alice loses $1, Bob gains $3", then the market will refuse (because Alice will refuse); and if the alien then switches to selling "Alice gains $3, Bob loses $1" then the market will refuse (because Bob will refuse); but now a certain gain has been passed over.

This argument seems straightforwardly wrong to me, as summarized in

a stylized dialogue I wrote(that includes more details about the point). If Alice and Bob are sufficiently capable reasoners then they take both trades and even things out using a side channel. (And even if they don't have a side channel, there are positive-EV contracts they can enter into in advance before they know who will be favored. And if they reason using LDT, they ofc don't need to sign contracts in advance.)(Aside: A bunch of the difficult labor in evaluating technical claims is in the part where you take a high-falutin' abstract thing like "markets are a better model of intelligence than agents" and pound on it until you get a specific minimal example like "neither of the alien's baskets is accepted by a market consisting of two people named Alice and Bob", at which point the error becomes clear. I haven't seen anybody else do that sort of distillation with John's claims. It seems to me that our community has a dearth of this kind of distillation work. If you're eager to do alignment work, don't know how to help, and think you can do some of this sort of distillation, I recommend trying. MATS might be able to help out.)

I pointed this out to John, and (to John's credit) he seemed to update (in realtime, which is rare) ((albeit with a caveat that communicating the point took a while, and didn't transmit the first few times that I tried to say it abstractly before having done the distillation labor)). The dialogue I wrote recounting that convo is probably not an entirely unfair summary (John said "there was not any point at which I thought my views were importantly misrepresented" when I asked him for comment).

My impression of John's other technical claims about natural abstractions is that they have similar issues. That said, I don't have nearly so crisp a distillation of John's views on natural abstractions, nor nearly so short a refutation. I spent a significant amount of time looking into John’s relevant views (we had overlapping travel plans and conspired to share a cross-country flight together, and a comedy of airline mishaps extended that journey to ~36 hours, during which I just would not stop pestering him) ((we also spent a decent amount of time going back-and-forth online)), and got a few whiffs of things that didn't sit right with me, although I was not quite able to get a sufficiently comprehensive understanding of the technical details to complete the distillation process and pin down my concerns.

This review has languished in my backlog long enough that I have now given up on completing that comprehend-digest-distill process, and so what follows is some poorly-organized and undigested reasons why I'm unpersuaded and unimpressed by John's technical work on natural abstractions. (John: sorry for not having something more distilled. Or more timely.)

## The Dream

(wherein Nate attempts to motivate natural abstractions in his own words)Suppose we have a box containing an ideal gas (that evolves according to classical mechanics, to keep things simple). Suppose we've numbered the particles, and we're tasked with predicting the velocity of particle #57 after some long time interval. Suppose we understand the initial conditions up to some finite degree of precision. In theory, if our knowledge of the initial conditions is precise enough and if we have enough computing power, then we can predict the velocity of particle #57 quite precisely. However, if we lack precision or computing power, the best we can do is probably a Maxwell-Boltzmann distribution, subject to the constraint that the expected energy (of any given particle) is the average energy (per-particle in the initial conditions).

This is interesting, because it suggests a sharp divide between the sorts of predictions that are accessible to a superintelligence, and the predictions that are accessible to an omniscience. Someone with

enoughknowledge of the initial conditions and withenoughcompute to simulate the entire history exactly can get the right answer, whereaseveryonewith substantially less power than that—be they superintelligences or humans—is relegated to a Maxwell-Boltzmann distribution subject to an energy constraint.(And in real life, not even the gods can know the initial conditions in sufficient detail, because there's no such thing as "the" initial conditions; we're in a quantum multiverse.)

So, in some sense, "energy" (or something equivalent) is an abstraction that even superintelligences must use in some form, if they want to predict the velocity of particle #57 after a really long time. It's not like you get a little smarter and then notice the existence of "double-energy" which lets you predict the velocity even better than we can. There's a gradient of how well you can predict the velocity as you fumble around understanding how physics works, and then there's the Maxwell-Boltzmann prediction that you make once you understand what the heck is going on, and then there's a vast barren plateau from here to "perfect simulation", in which even the superintelligences can do no better.

In the simple case of an ideal gas evolving classically, we can probably prove some theorems corresponding to this claim. I haven't seen theorems written from exactly this point of view, but if you're technically inclined you can probably prove something like "time-evolution is ergodic within regions of phase-space of constant energy", or *cough* *cough*

chaos*cough* so the only facts that are practically predictable in thermodynamic equilibrium correspond directly to conservation laws. Or something.This is relevant to our interests, because we sure would like a better understanding of when and where the abstractions of humans and the abstractions of superintelligences overlap. "Convergently useful" low-level abstractions could help us with

ontology identification; mastery of convergently-useful abstractions could help us manufacture circumstances that make the AI converge on humane abstractions; etc.The wild hope, here, is that all human concepts have some nature kinda like "energy" has in the simplest toy model of statistical mechanics. Like, obviously "trees" and "windows" are much more complicated concepts than "energy"; obviously there isn't going to be quite so crisp a notion of "the best concepts you can use short of simulating the whole system". But, like, various distributions from statistical mechanics turn out to be empirically useful

even thoughour universe isn't in thermodynamic equilibrium yet, and so there's some hope that these "idealized" or "convergently instrumentally useful" concepts degrade cleanly into practical real-world concepts like "trees" and "windows". Which are hopefully so convergently instrumentally useful that the AIs also use them.And, by understanding the circumstances that give rise to convergently-useful abstractions, and by understanding the extent of their reach, we might gain the ability to recognize them inside an AI's mind (rendering it much less alien to us), and/or to distinguish the concepts we care about from nominally nearby ones, and/or to shape the AI's learning such that its abstractions are particularly human-recognizable.

That's the dream of natural abstractions, as I understand it.

^{[1]}I think this is a fine dream. It’s a dream I developed independently at MIRI a number of years ago, in interaction with others. A big reason why I slogged through a review of John's work is because he seemed to be attempting to pursue a pathway that appeals to me personally, and I had some hope that he would be able to go farther than I could have.

John's research plans are broader than just this one dream, as I understand it; I'm going to focus on this one anyway because I think it's a dream that John and I share, and it's the one that I poked at in the past.

When I spent time thinking about this topic, my main hope was that an improved understanding might allow us to shape abstractions in alien minds. I credit John with the additional observation that we might use an improved understanding to

recognizeabstractions in alien minds.## Natural Abstractions

The following is an idea in pursuit of the above dream that I attribute to John (with my distillation and word choice):

A followup idea that I attribute to John (with my partial distillation):

To which I say: That sounds like an interesting idea! I don't understand it, and I'm not sure that it makes sense: for instance, trying to define the abstraction "window" as "that which you can deduce from other windows if you forget everything about one window" seems circular, and I'm not sure whether the circularity is fatal. And I have only a vague understanding of how this is supposed to generalize the case with energy. But if I squint, I can see how maybe it could be fleshed out into something interesting.

At which point John says something like "well today's your lucky day; I have math!". But, unfortunately, I wasn't able to make sense of the math, despite trying.

The MathJohn claims a proof-sketch of the following:

(He had another few versions, allegedly with fuller proofs, though I was not able to understand them and focused on this one.)

And... I'm not really up for rehashing the whole discussion we had here. But the short version is that I found a counterexample where (σ1,σ2,...) is constant, and John was like "oh, no, the (σ)s have to be non-repeating", and I was like "wait so this theorem only works if we have a literal infinitude of variables?" and he was like "yes".

And then I constructed another (infinite) example where the mutual information (MI) in the limit was not the limit of the mutual information in the finite approximations, and I was like “???”. And John was like "you got the sign wrong, but yes, the MI in the limit is not the limit of the MIs in each finite approximation." And I was like "Then how is this supposed to tell me anything about windows?? There are only finitely many windows!"

^{[4]}My ConcernsI don't particularly doubt that John's theorem is true. My issue is that, as far as I've been able to figure out, it works

onlyin the case where we have infinitely many independent random variables. I do not know a form that behaves nicely in finite approximations, and I was not able to extract one from John (despite trying).(This despite John handwaving past a variety of technical subtleties, on the grounds that he's aiming for the virtues of a physicist rather than a mathematician. I'm all for using frameworks that are making sharp empirical predictions regardless of whether we've sorted out the theoretical technicalities. But this isn't an issue where the mathematician is saying "hey wait, do those integrals actually commute?" and the physicist is saying "probably!". This is a case where the math

only works in the infinite caseandis not shedding lighthere in a world with only finitely many windows.)(Like, physicists sometimes write down equations that work well in finite approximations, and ignore the mathematicians as they complain that their series blows up when taken to infinity. If you're doing that, I have no objection. But John's doing the opposite! His theorem works in the infinite case, and doesn't say anything interesting about any finite case! John claiming the virtues of a physicist in his technical work, and then having stuff like this come up when I look closer, feels to me like a microcosm of my overall impression. I found it quite frustrating.)

This shortcoming strikes me as fatal: I already know how to identify abstractions like "energy" in extremely well-behaved cases; I'm trying to understand how we weaken our demands while keeping a working notion of "natural abstraction". Understanding this theorem

mightteach me something about how to shift from demanding perfect conservation and ergodicity/chaos/whatever outside of that one conservation law, which would be nice. But it doesn't evenhave anything to sayin the finite case. Which means it has nothing to say about windows, of which there are finitely many.Worse, my ability to understand the theorem is hindered by my inability to construct finite examples.

The math could perhaps be repaired to yield reasonable finite approximations, and those might well teach us something interesting. I didn't manage to get that far. I tried to understand the theorem, and tried a little to repair it to work in finite approximations. John and I disagreed about what sort of methods were likely to lead to repair. I gave up.

To be clear, it seems pretty plausible to me that something like this theorem can be repaired. As I said at the beginning, I think that John has some pretty good intuitions, and is trying to go in some reasonable directions. I'm just disappointed with his actual results, and think that he's often doing shoddy technical work, drawing wrong conclusions from it, and then building off of them enthusiastically.

And, to be clear: I'm interested in seeing a repaired theorem! It's plausible to me that a repaired version has something to teach me about how to identify convergently-useful abstractions. I don't have quite enough hope/vision about it to have done it myself in the past 11 months, but it's

onmy list of things to do, and I'd love a solution.^{[5]}## The Generalized Koopman-Pitman-Darmois theorem

One reason that I didn't have enough hope/vision myself to attempt to repair John's mutual-information theorem is because I didn't really see the connection from that theorem back to the dream (even if I imagined it making sense in worlds with only finitely many windows, which I don't yet see how to do). Like, OK, sure, perhaps we can take any physical system and ask which properties are kinda-sorta "conserved" in the sense that that information is repeated many times in the larger world. Suppose I grant that. What then? Where are we going?

I tried probing John on this point, and largely wasn't able to make sense of his English-language utterances. But he seemed to be pursuing a dream that I also have, and claimed to have even more math! "You're asking about the

generalized Koopman-Pitman-Darmoistheorem!", he said.And, sure, I'm always up for looking at your allegedly-related math. Especially when your English descriptions aren't making sense to me. So I took a look!

I was previously planning to process this section more, and give an intuitive description of the gKPD theorem and a summary of my take, but since this post has languished for a year, I’ll just post a section of chat logs (with some typos fixed) in which I talk John's ear off about KPD. (It contains some context and backreferences that will probably be confusing. Sorry.)

Nate:John:Nate:John:Nate:John:Nate:My overall take from looking into some of John's stuff is a mix of hope, disappointment, and exasperation.

Hope because I think he is barking up some interesting trees, and he's trying to wrap his whole head around big questions that have a chance of shedding lots of light on AI alignment ("are there convergently useful abstractions, and can we learn to recognize and/or manipulate them?").

Disappointment because when I look closer, he seems to regularly get some technical result that doesn't seem very insightful to me, read things off of it that don't seem particularly correct to me, and then barge off in some direction that doesn't seem particularly promising to me. I was prepared for John's lines of approach to seem unpromising to me—that's par for the course—but the thing where he seems to me to put undue weight on his flimsy technical results was a negative update for me.

Exasperation because of how John treats his technical results. My impression has been that he makes lots of high-falutin' nice-sounding English claims, and claims he has technical results to support them, and slings some math around, but when you look closely the math is... suggestive? Kinda? But not really doing what he seemed to be advertising?

Perhaps I've simply been misreading John, and he's been intending to say "I have some beliefs, and separately I have some suggestive technical results, and they feel kinda related to me! Which is not to say that any onlooker is supposed to be able to read the technical results and then be persuaded of any of my claims; but it feels promising and exciting to me!".

^{[6]}I wouldn't be exasperated if it were apparent to me that John’s doing that. But that's not the impression I got from how John billed his technical results, and I spent time trying to understand them (and think I did an OK job) only to find that the technical results

weren'tthe sort of thing that support his claims; they're the sort of thing that're maybe possibly suggestive if you already have his intuitions and are squinting at them the way that he squints. In particular, they didn't seem very suggestive to me.I think there's some sort of social institution I want to protect here. There's an art to knowing exactly what your technical results say, and knowing when you can carefully and precisely trace your English-language claims all the way back to their technical support. In the telephone theorem post, when John says stuff like "The theorems in this post show that those summaries are estimates/distributions of deterministic (in the limit) constraints in the systems around us.", I read that as John implying that he knows how to cash out some of his English-language claims into the math of the telephone theorem. And that's not what I found when I looked, at least not in a manner that's legible to me.

I think I'm exasperated in part because this seems to me like it erodes an important type of technical social trust we have around these parts (at least among people at John's level of cohesive pursuit of an alignment agenda; I hold most others to a lower standard). I hereby explicitly request that he be more careful about those sorts of implications in the future.

(While also noting that it's entirely plausible here that I'm misreading things; communication is hard; maybe I'm the only idiot in the room who's not able to understand how John's theorems relate to his words.)

Stepping back even further: John's approach here is not how I would approach achieving the dream that we share, and that I sketched out at the top. (We know this, because I've tried before, and I tried differently.) Which doesn't mean his directions are unpromising! My directions didn't pan out, after all. I think he's pursuing some interesting routes, and I'm interested to see where they go.

While I have some qualms about John thinking that merely-suggestive technical results say more than he thinks they do, I am sympathetic to a variety of his intuitions. The art of not reading too much into your technical results seems easier to acquire than the art of having good research intuitions, so on net I'm enthusiastic about John's research directions.

(Less enthusiastic than John, probably, on account of how my guess as to how all this plays out is that there are lots and lots of "natural abstractions" when you weaken the notion enough to allow for windows, and which ones a mind pays attention to winds up being highly contingent on specifics of its architecture, training, and objectives. Mastery in this domain surely would have its uses, but I think I'm much less optimistic than John about using the naturalness of abstractions to give a workable descriptive account of human values, which IIUC is part of John's plan.

^{[7]})Also, thanks again to John for putting up with my pestering.

^{^}Insofar as the above feels like a more concise description of why there might be any hope at all in studying natural abstractions, and what those studies might entail, I reiterate that it seems to me like this community has a dearth of distillations. Alternatively, it's plausible to me that John's motivations make more sense to everyone else than they do to me, and/or that my attempts at explanation make no more sense to anybody else than John's.

^{^}Analogy: if you know that the sum of two dice is 5, then you know that the first die definitely didn't come up six. This is some "extra" information above and beyond the fact that the average dice-value is 2.5. If instead you know that the sum of two thousand dice is 5000, then you can basically just ignore that "extra" information, and focus only on the average value. And somewhere around here, there's a theorem saying that the extra information goes to zero in the limit.

^{^}Or, well, when we know all the conserved properties, and the rest of the laws of physics are sufficiently ergodic or chaotic or something; I'm not sure exactly what theorem we'd want here; I'm just trying to summarize my understanding of John's position. I'd welcome further formalization.

^{^}If you want those examples, then… sorry. I'm going to go ahead and say that they're an exercise for the reader. If nobody else can reconstruct them, and you really want them, I might go delve through the chat logs. (My apologies for the inconvenience. Skipping that delve-and-cleanup process is part of the cost of getting this dang thing out at all, rather than never.)

^{^}I also note that I was

super annoyingin my attempts to extract a working version of this theorem from John. I started out by trying to probe all his verbal intuitions about the "natural abstractions are like conserved quantities" stuff, and then when I couldn't make any sense of that we went to the math. And, because none of his English phrases were making sense to me, I just meticulously tried to understand the details of the math, which involved a whole lot of not knowing what the heck his notation meant, and a whole lot of inability to fill out partial definitions in "the obvious way", which I suspect was frustrating. Sorry John; thanks for putting up with me.^{^}But John, commenting on a draft of this post, was like "Nope!" and helpfully provided a quote.

^{^}John noted in a draft of this document that this post of his was largely intended as a response to me on this point.