Nate Soares

Wiki Contributions


I was recently part of a group-chat where some people I largely respect were musing about this paper and this post and some of Scott Aaronson's recent "maybe intelligence makes things more good" type reasoning).

Here's my replies, which seemed worth putting somewhere public:

The claims in the paper seem wrong to me as stated, and in particular seems to conflate values with instrumental subgoals. One does not need to terminally value survival to avoid getting hit by a truck while fetching coffee; they could simply understand that one can't fetch the coffee when one is dead.

See also instrumental convergence.

And then in reply to someone pointing out that the paper was perhaps trying to argue that most minds tend to wind up with similar values because of the fact that all minds are (in some sense) rewarded in training for developing similar drives:

So one hypothesis is that in practice, all practically-trainable minds manage to survive by dint of a human-esque survival instinct (while admitting that manually-engineered minds could survive some other way, e.g. by simply correctly modeling the consequences).

This mostly seems to me to be like people writing sci-fi in which the aliens are all humanoid; it is a hypothesis about tight clustering of cognitive drives even across very disparate paradigms (optimizing genomes is very different from optimizing every neuron directly).

But a deeper objection I have here is that I'd be much more comfortable with people slinging this sort of hypothesis around if they were owning the fact that it's a hypothesis about tight clustering and non-alienness of all minds, while stating plainly that they think we should bet the universe on this intuition (despite how many times the universe has slapped us for believing anthropocentrism in the past).

FWIW, some reasons that I don't myself buy this hypothesis include:

(a) the specifics of various human drives seem to me to be very sensitive to the particulars of our ancestry (ex: empathy seems likely a shortcut for modeling others by repurposing machinery for modeling the self (or vice versa), that is likely not found by hillclimbing when the architecture of the self is very different from the architecture of the other);

(b) my guess is that the pressures are just very different for different search processes (genetic recombination of DNA vs SGD on all weights); and

(c) it looks to me like value is fragile, such that even if the drives were kinda close, I don't expect the obtainable optimum to be good according to our lights

(esp. given that the question is not just what drives the AI gets, but the reflective equilibrium of those drives: small changes to initial drives are allowed to have large changes to the reflective equilibrium, and I suspect this is so).

Someone recently privately asked me for my current state on my 'Dark Arts of Rationality' post. Here's some of my reply (lightly edited for punctuation and conversation flow), which seemed worth reproducing publicly:

FWIW, that post has been on my list of things to retract for a while.

(The retraction is pending a pair of blog posts that describe some of my thoughts on related matters, which have been in the editing queue for over a year and the draft queue for years before that.)

I wrote that post before reading much of the sequences, and updated away from the position pretty soon after. My current stance is that you can basically get all the nice things, and never need to compromise your epistemics.

For the record, the Minding Our Way post where I was like "people have a hard time separating 'certainty'-the-motivational-stance from 'certainty'-the-epistemic-state" was the logs of me figuring out my mistake (and updating away from the dark arts post).

On my current accounting, the mistake I was making at the time of the dark arts post was something like: lots of stuff comes culturally bundled, in ways that can confuse you into thinking you can't get good thing X without also swallowing bad thing Y.

And there's a skill of just, like, taking the good stuff and discarding the bad stuff, even if you don't yet know how to articulate a justification (which I lacked in full generality at the time of the dark arts post, and was developing at the time of the 'certainty' post.)

And it's a little tricky to write about, because you've got to balance it against "care about consistency" / "notice when you're pingponging between mutually-incosistent beliefs as is convenient", which is... not actually hard, I think, but I haven't found a way to write about the one without the words having an interpretation of "just drop your consistency drive". ...which is how these sorts of things end up languishing in my editing queue for years, whe I have other priorities.

(And for the record, another receipt here is that in some twitter thread somewhere--maybe the jargon thread?--I noted the insight about unbundling things, using "you can't be sad and happy at the same time" as an example of a bundled-thing. which isn't the whole concept, but which is another instance of the resolution intruding in a visible way.)

(More generally, a bunch of my early MoW posts are me, like, digesting parts of the sequences and correcting a bunch of my errors from before I encountered this community. And for the record, I'm grateful to the memes in this community--and to Eliezer in particular, who I count as originating many of them--for helping me stop being an idiot in that particular way.)

I've also gone ahead and added a short retraction-ish paragraph to the top of the dark arts post, and might edit it later to link it to the aforementioned update-posts, if they ever make it out of the editing queue.

Below is a sketch of an argument that might imply that the answer to Q5 is (clasically) 'yes'. (I thought about a question that's probably the same a little while back, and am reciting from cache, without checking in detail that my axioms lined up with your A1-4).

Pick a lottery with the property that forall with and , forall , we have . We will say that is "extreme(ly high)".

Pick a lottery with .

Now, for any with , define to be the guaranteed by continuity (A3).

Lemma: forall with , .


  1. , by and and the extremeness of .
  2. , by A4.
  3. , by some reduction.

We can use this lemma to get that implies , because , and , so invoke the above lemma with and .

Next we want to show that implies . I think this probably works, but it appears to require either the axiom of choice (!) or a strengthening of one of A3 or A4. (Either strengthen A3 to guarantee that if then it gives the same in both cases, or strengthen A4 to add that if then , or define not from A3 directly, but by using choice to pick out a for each -equivalence-class of lotteries.) Once you've picked one of those branches, the proof basically proceeds by contradiction. (And so it's not terribly constructive, unless you can do constructively.)

The rough idea is: if but then you can use the above lemma to get a contradiction, and so you basically only need to consider the case where in which case you want , which you can get by definition (if you use the axiom of choice), or directly by the strengthening of A3. And... my cache says that you can also get it by the strengthening of A4, albeit less directly, but I haven't reloaded that part of my cache, so \shrug I dunno.

Next we argue that this function is unique up to postcomposition by... any strictly isotone endofunction on the reals? I think? (Perhaps unique only among quasiconvex functions?) I haven't checked the details.

Now we have a class of utility-function-ish-things, defined only on with , and we want to extend it to all lotteries.

I'm not sure if this step works, but the handwavy idea is that for any lottery that you want to extend to include, you should be able to find a lower and an extreme higher that bracket it, at which point you can find the corresponding (using the above machinery), at which point you can (probably?) pick some canonical strictly-isotone real endofunction to compose with it that makes it agree with the parts of the function you've defined so far, and through this process you can extend your definition of to include any lottery. handwave handwave.

Note that the exact function you get depends on how you find the lower and higher , and which isotone function you use to get all the pieces to line up, but when you're done you can probably argue that the whole result is unique up to postcomposition by a strictly isotone real endofunction, of which your construction is a fine representative.

This gets you C1. My cache says it should be easy to get C2 from there, and the first paragraph of "Edit 3" to the OP suggests the same, so I haven't checked this again.

I'm awarding another $3,000 distillation prize for this piece, with complements to the authors.

A few people recently have asked me for my take on ARC evals, and so I've aggregated some of my responses here:

- I don't have strong takes on ARC Evals, mostly on account of not thinking about it deeply.
- Part of my read is that they're trying to, like, get a small dumb minimal version of a thing up so they can scale it to something real. This seems good to me.
- I am wary of people in our community inventing metrics that Really Should Not Be Optimized and handing them to a field that loves optimizing metrics.
- I expect there are all sorts of issues that would slip past them, and I'm skeptical that the orgs-considering-deployments would actually address those issues meaningfully if issues were detected ([cf](
- Nevertheless, I think that some issues can be caught, and attempting to catch them (and to integrate with leading labs, and make "do some basic checks for danger" part of their deployment process) is a step up from doing nothing.
- I have not tried to come up with better ideas myself.

Overall, I'm generally enthusiastic about the project of getting people who understand some of the dangers into the deployment-decision loop, looking for advance warning signs.

John has also made various caveats to me, of the form "this field is pre-paradigmatic and the math is merely suggestive at this point". I feel like he oversold his results even so.

Part of it is that I get the sense that John didn't understand the limitations of his own results--like the fact that the telephone theorem only says anything in the infinite case, and the thing it says then does not (in its current form) arise as a limit of sensible things that can be said in finite cases. Or like the fact that the alleged interesting results of the gKPD theorem are a relatively-shallow consequence of the overly-strong assumption of .

My impression was that I had to go digging into the theorems to see what they said, only to be disappointed by how little resemblance they bore to what I'd heard John imply. (And it sounds to me like Lawrence, Leon, and Erik had a similar experience, although I might be misreading them on account of confirmation bias or w/e.)

I acknowledge that it's tricky to draw a line between "someone has math that they think teaches them something, and is inarticulate about exactly what it teaches" and "someone has math that they don't understand and are overselling". The sort of observation that would push me towards the former end in John's case is stuff like: John being able to gesture more convincingly at ways concepts like "tree" or "window" are related to his conserved-property math even in messy finite cases. I acknowledge that this isn't a super legible distinction and that that's annoying.

(Also, I had the above convos with John >1y ago, and perhaps John simply changed since then.)

Note that I continue to think John's cool for pursuing this particular research direction, and I'd enjoy seeing his math further fleshed out (and with more awareness on John's part of its current limitations). I think there might be interesting results down this path.

John said "there was not any point at which I thought my views were importantly misrepresented" when I asked him for comment. (I added this note to the top of the post as a parenthetical; thanks.)

Here's a recent attempt of mine at a distillation of a fragment of this plan, copied over from a discussion elsewhere:

goal: make there be a logical statement such that a proof of that statement solves the strawberries-on-a-plate problem (or w/e).

summary of plan:

  • the humans put in a herculean effort to build a multi-level world-model that is interpretable to them (ranging from quantum chemistry at the lowest level, to strawberries and plates at the top)
  • we interpret this in a very conservative way, as a convex set of models that hopefully contains something pretty close to reality.
  • we can then form the logical statement "this policy puts two cellularly-but-not-molecularly identical strawberries on a plate (and has no other high-level effect) with pretty-high probability across all of those models simultaneously"

background context: there's some fancy tools with very nice mathematical properties for combining probabilistic reasoning and worst-case reasoning.

key hope: these tools will let us interpret this "manual multi-level world-model" in a way that keeps enough of the probabilistic nature for tractable policies to exist, and enough adversarial nature for this constraint to be pretty tight.

in more concrete terms: for any given UFAI, somewhere in the convex hull of all the interpretations of the manual multi-level world model, there's a model that sees (in its high-level) the shady shit that the UFAI was hoping to slip past us. So such "shady" policies fail in the worst-case, and fail to satisfy the theorem. But also enough of the probabilistic nature is retained that your policies don't need to handle the literal worst-cases of thermodynamic heat, and so there are some "reasonable" policies that could satisfy the theorem.

capabilities requirements: the humans need to be able to construct the world model; something untrusted and quite capable needs to search for proofs of the theorem; the policy extracted from said theorem is then probably an AGI with high capabilities but you've (putatively) proven that all it does is put strawberries on a plate and shut down so \shrug :crossed_fingers: hopefully that proof bound to reality.

(note: I'm simply attempting to regurgitate the idea here; not defend it. obvious difficulties are obvious, like "the task of finding such a policy is essentially the task of building and aligning an AGI" and "something that can find that policy is putting adversarial pressure on your theorem". even if proving the theorem requires finding a sufficiently-corrigible AGI, it would still be rad to have a logical statement of this form (and perhaps there's even some use to it if it winds up not quite rated for withstanding superintelligent adversaries?).)

Anticipating an obvious question: yes, I observed to Davidad that the part where we imagine convex sets of distributions that contain enough of the probabilistic nature to admit tractable policies and enough of the worst-case nature to prevent UFAI funny business is where a bunch of the work is being done, and that if it works then there should be a much smaller example of it working, and probably some minimal toy example where it's easy to see that the only policies that satisfy the analogous theorem are doing some new breed of optimization, that is neither meliorization nor satisfaction and that is somehow more mild. And (either I'm under the illusion of transparency or) Davidad agreed that this should be possible, and claims it is on his list of things to demonstrate.

(note: this is the fragment of Davidad's plan that I was able to distill out into something that made sense to me; i suspect he thinks of this as just one piece among many. I welcome corrections :-))

I don't see this as worst-case thinking. I do see it as speaking from a model that many locals don't share (without any particular attempt made to argue that model).

In particular, if the AGI has some pile of kludges disproportionately pointed towards accomplishing X, and the AGI does self-reflection and “irons itself out”, my prediction is “maybe this AGI will wind up pursuing X, or maybe not, I dunno”.

AFAICT, our degree of disagreement here turns on what you mean by "pointed". Depending on that, I expect I'd either say "yeah maybe, but that kind of pointing is hard" or "yep, my highest-credence models have pretty high probability on this thing failing to optimize X once it's sorted out".

For instance, the latter response obtains if the "pointing" is done by naive training.

(Though I also have some sense that I see the situation as more fragile than you--there's lots of ways for reflection to ruin your day, if the wrong kludge is pointed the wrong way. So maybe we have a broader disagreement about that, too.)

Also, as a reminder, my high credence in doom doesn't come from high confidence in a claim like this. You can maybe get one nine here; I doubt you can get three. My high credence in doom comes from its disjunctive nature.

It would still help like me to have a "short version" section at the top :-)

Load More