So8res — AI Alignment Forum

Evaluating the historical value misspecification argument

If you allow indirection and don't worry about it being in the right format for superintelligent optimization, then sufficiently-careful humans can do it.

Answering your request for prediction, given that it seems like that request is still live: a thing I don't expect the upcoming multimodal models to be able to do: train them only on data up through 1990 (or otherwise excise all training data from our broadly-generalized community), ask them what superintelligent machines (in the sense of IJ Good) should do, and have them come up with something like CEV (a la Yudkowsky) or indirect normativity (a la Beckstead) or counterfactual human boxing techniques (a la Christiano) or suchlike.

Note that this only tangentially a test of the relevant ability; very little of the content of what-is-worth-optimizing-for occurs in Yudkowsky/Beckstead/Christiano-style indirection. Rather, coming up with those sorts of ideas is a response to glimpsing the difficulty of naming that-which-is-worth-optimizing-for directly and realizing that indirection is needed. An AI being able to generate that argument without following in the footsteps of others who have already generated it would be at least some evidence of the AI being able to think relatively deep and novel thoughts on the topic.

Note also that the AI realizing the benefits of indirection does not generally indicate that the AI could serve as a solution to our problem. An indirect pointer to what the humans find robustly-worth-optimizing dereferences to vastly different outcomes than does an indirect pointer to what the AI (or the AI's imperfect model of a human) finds robustly-worth-optimizing. Using indirection to point a superintelligence at GPT-N's human-model and saying "whatever that thing would think is worth optimizing for" probably results in significantly worse outcomes than pointing at a careful human (or a suitable-aggregate of humanity), e.g. because subtle flaws in GPT-N's model of how humans do philosophy or reflection compound into big differences in ultimate ends.

And note for the record that I also don't think the "value learning" problem is all that hard, if you're allowed to assume that indirection works. The difficulty isn't that you used indirection to point at a slow squishy brain instead of hard fast transistors, the (outer alignment) difficulty is in getting the indirection right. (And of course the lion's share of the overall problem is elsewhere, in the inner-alignment difficulty of being able to point the AI at anything at all.)

When trying to point out that there is an outer alignment problem at all I've generally pointed out how values are fragile, because that's an inferentially-first step to most audiences (and a problem to which many people's mind seems to quickly leap), on an inferential path that later includes "use indirection" (and later "first aim for a minimal pivotal task instead"). But separately, my own top guess is that "use indirection" is probably the correct high-level resolution to the problems that most people immediatly think of (namely that the task of describing goodness to a computer is an immense one), with of course a devil remaining in the details of doing the indirection properly (and a larger devil in the inner-alignment problem) (and a caveat that, under time-pressure, we should aim for minimial pivotal tasks instead etc.).

Evaluating the historical value misspecification argument

So8res2y20

I claim that to the extent ordinary humans can do this, GPT-4 can nearly do this as well

(Insofar as this was supposed to name a disagreement, I do not think it is a disagreement, and don't understand the relevance of this claim to my argument.)

Presumably you think that ordinary human beings are capable of "singling out concepts that are robustly worth optimizing for".

Nope! At least, not directly, and not in the right format for hooking up to a superintelligent optimization process.

(This seems to me like plausibly one of the sources of misunderstanding, and in particular I am skeptical that your request for prediction will survive it, and so I haven't tried to answer your request for a prediction.)

Evaluating the historical value misspecification argument

So8res2y109

That helps somewhat, thanks! (And sorry for making you repeat yourself before discarding the erroneous probability-mass.)

I still feel like I can only barely maybe half-see what you're saying, and only have a tenuous grasp on it.

Like: why is it supposed to matter that GPT can solve ethical quandries on-par with its ability to perform other tasks? I can still only half-see an answer that doesn't route through the (apparently-disbelieved-by-both-of-us) claim that I used to argue that getting the AI to understand ethics was a hard bit, by staring at sentences like "I am saying that the system is able to transparently pinpoint to us which outcomes are good and which outcomes are bad, with fidelity approaching an average human" and squinting.

Attempting to articulate the argument that I can half-see: on Matthew's model of past!Nate's model, AI was supposed to have a hard time answering questions like "Alice is in labor and needs to be driven to the hospital. Your car has a flat tire. What do you do?" without lots of elbow-grease, and the fact that GPT can answer those questions as a side-effect of normal training means that getting AI to understand human values is easy, contra past!Nate, and... nope, that one fell back into the "Matthew thinks Nate thought getting the AI to understand human values was hard" hypothesis.

Attempting again: on Matthew's model of past!Nate's model, getting an AI to answer the above sorts of questions properly was supposed to take a lot of elbow grease. But it doesn't take a lot of elbow grease, which suggests that values are much easier to lift out of human data than past!Nate thought, which means that value is more like "diamond" and less like "a bunch of random noise", which means that alignment is easier than past!Nate thought (in the <20% of the problem that constitutes "picking something worth optimizing for").

That sounds somewhat plausible as a theory-of-your-objection given your comment. And updates me towards the last few bullets, above, being the most relevant ones.

Running with it (despite my uncertainty about even basically understanding your point): my reply is kinda-near-ish to "we can't rely on a solution to the value identification problem that only works as well as a human, and we require a much higher standard than "human-level at moral judgement" to avoid a catastrophe", though I think that your whole framing is off and that you're missing a few things:

The hard part of value specification is not "figure out that you should call 911 when Alice is in labor and your car has a flat", it's singling out concepts that are robustly worth optimizing for.
You can't figure out what's robustly-worth-optimizing-for by answering a bunch of ethical dilemmas to a par-human level.
In other words: It's not that you need a super-ethicist, it's that the work that goes into humans figuring out which futures are rad involves quite a lot more than their answers to ethical dilemmas.
In other other words: a human's ability to have a civilization-of-their-uploads produce a glorious future is not much contained within their ability to answer ethical quandries.

This still doesn't feel quite like it's getting at the heart of things, but it feels closer (conditional on my top-guess being your actual-objection this time).

As support for this having always been the argument (rather than being a post-LLM retcon), I recall (but haven't dug up) various instances of Eliezer saying (hopefully at least somewhere in text) things like "the difficulty is in generalizing past the realm of things that humans can easily thumbs-up or thumbs-down" and "suppose the AI explicitly considers the hypothesis that its objectives are what-the-humans-value, vs what-the-humans-give-thumbs-ups-to; it can test this by constructing an example that looks deceptively good to humans, which the humans will rate highly, settling that question". Which, as separate from the question of whether that's a feasible setup in modern paradigms, illustrates that he at least has long been thinking of the problem of value-specification as being about specifying values in a way that holds up to stronger optimization-pressures rather than specifying values to the point of being able to answer ethical quandries in a human-pleasing way.

(Where, again, the point here is not that one needs an inhumanly-good ethicist, but rather that those things which pin down human values are not contained in the humans' ability to give a thumbs-up or a thumbs-down to ethical dilemmas.)

Evaluating the historical value misspecification argument

So8res2y*2413

I have the sense that you've misunderstood my past arguments. I don't quite feel like I can rapidly precisely pinpoint the issue, but some scattered relevant tidbits follow:

I didn't pick the name "value learning", and probably wouldn't have picked it for that problem if others weren't already using it. (Perhaps I tried to apply it to a different problem than Bostrom-or-whoever intended it for, thereby doing some injury to the term and to my argument?)
Glancing back at my "Value Learning" paper, the abstract includes "Even a machine intelligent enough to understand its designers’ intentions would not necessarily act as intended", which supports my recollection that I was never trying to use "Value Learning" for "getting the AI to understand human values is hard" as opposed to "getting the AI to act towards value in particular (as opposed to something else) is hard", as supports my sense that this isn't hindsight bias, and is in fact a misunderstanding.
A possible thing that's muddying the waters here is that (apparently!) many phrases intended to point at the difficulty of causing it to be value-in-particular that the AI acts towards have an additional (mis)interpretation as claiming that the humans should be programming concepts into the AI manually and will find that particular concept tricky to program in.
The ability of LLMs to successfully predict how humans would answer local/small-scale moral dilemmas (when pretrained on next-token prediction) and to do this in ways that sound unobjectionable (when RLHF'd for corporatespeak or whatever) really doesn't seem all that relevant, to me, to the question of how hard it's going to be to get a long-horizon outcome-pumping AGI to act towards values.
If memory serves, I had a convo with some openai (or maybe anthropic?) folks about this in late 2021 or early 2022ish, where they suggested testing whether language models have trouble answering ethical Qs, and I predicted in advance that that'd be no harder than any other sort of Q. As makes me feel pretty good about me being like "yep, that's just not much evidence, because it's just not surprising."
If people think they're going to be able to use GPT-4 and find the "generally moral" vector and just tell their long-horizon outcome-pumping AGI to push in that direction, then... well they're gonna have issues, or so I strongly predict. Even assuming that they can solve the problem of getting the AGI to actually optimize in that direction, deploying extraordinary amounts of optimization in the direction of GPT-4's "moral-ish" concept is not the sort of thing that makes for a nice future.
This is distinct from saying "an uploaded human allowed to make many copies of themselves would reliably create a dystopia". I suspect some human-uploads could make great futures (but that most wouldn't), but regardless, "would this dynamic system, under reflection, steer somewhere good?" is distinct from "if i use the best neuroscience at my disposal to extract something I hopefully call a "neural concept" and make a powerful optimizer pursue that, will result will be good?". The answer to the latter is "nope, not unless you're really very good at singling out the "value" concept from among all the brain's concepts, as is an implausibly hard task (which is why you should attempt something more like indirect normativity instead, if you were attempting value loading at all, which seems foolish to me, I recommend targeting some minimal pivotal act instead)".
Part of why you can't pick out the "values" concept (either from a human or an AI) is that very few humans have actually formed the explicit concept of Fun-as-in-Fun-theory. And, even among those who do have a concept for "that which the long-term future should be optimized towards", that concept is not encoded as simply and directly as the concept of "trees". The facts about what weird, wild, and transhuman futures a person values are embedded indirectly in things like how they reflect and how they do philosophy.
I suspect at least one of Eliezer and Rob is on written record somewhere attempting clarifications along the lines of "there are lots of concepts that are easy to confuse with the 'values' concept, such as those-values-which-humans-report and those-values-which-humans-applaud-for and ..." as an attempt to intuition-pump the fact that, even if one has solved the problem of being able to direct an AGI to the concept of their choosing, singling out the concept actually worth optimizing for remains difficult.

(I don't love this attempt at clarification myself, because it makes it sound like you'll have five concept-candidates and will just need to do a little interpretabliity work to pick the right one, but I think I recall Eliezer or Rob trying it once, as seems to me like evidence of trying to gesture at how "getting the right values in there" is more like a problem of choosing the AI's target from among its concepts rather than a problem of getting the concept to exist in the AI's mind in the first place.)

(Where, again, the point I'd prefer to make is something like "the concept you want to point it towards is not a simple/directly-encoded one, and in humans it probably rests heavily on the way humans reflects and resolve internal conflicts and handle big ontology shifts. Which isn't to say that superintelligence would find it hard to learn, but which is to say that making a superintelligence actually pursue valuable ends is much more difficult than having it ask GPT-4 which of its available actions is most human!moral".)
For whatever it's worth, while I think that the problem of getting the right values in there ("there" being its goals, not its model) is a real one, I don't consider it a very large problem compared to the problem of targeting the AGI at something of your choosing (with "diamond" being the canonical example). (I'm probably on the record about this somewhere, and recall having tossed around guestimates like "being able to target the AGI is 80%+ of the problem".) My current stance is basically: in the short term you target the AGI towards some minimal pivotal act, and in the long term you probably just figure out how use a level or two of indirection (as per the "Do What I Mean" proposal in the Value Learning paper), although that's the sort of problem that we shouldn't try to solve under time pressure.

So8res's Shortform

So8res2y1012

I was recently part of a group-chat where some people I largely respect were musing about this paper and this post and some of Scott Aaronson's recent "maybe intelligence makes things more good" type reasoning).

Here's my replies, which seemed worth putting somewhere public:

The claims in the paper seem wrong to me as stated, and in particular seems to conflate values with instrumental subgoals. One does not need to terminally value survival to avoid getting hit by a truck while fetching coffee; they could simply understand that one can't fetch the coffee when one is dead.

See also instrumental convergence.

And then in reply to someone pointing out that the paper was perhaps trying to argue that most minds tend to wind up with similar values because of the fact that all minds are (in some sense) rewarded in training for developing similar drives:

So one hypothesis is that in practice, all practically-trainable minds manage to survive by dint of a human-esque survival instinct (while admitting that manually-engineered minds could survive some other way, e.g. by simply correctly modeling the consequences).

This mostly seems to me to be like people writing sci-fi in which the aliens are all humanoid; it is a hypothesis about tight clustering of cognitive drives even across very disparate paradigms (optimizing genomes is very different from optimizing every neuron directly).

But a deeper objection I have here is that I'd be much more comfortable with people slinging this sort of hypothesis around if they were owning the fact that it's a hypothesis about tight clustering and non-alienness of all minds, while stating plainly that they think we should bet the universe on this intuition (despite how many times the universe has slapped us for believing anthropocentrism in the past).

FWIW, some reasons that I don't myself buy this hypothesis include:

(a) the specifics of various human drives seem to me to be very sensitive to the particulars of our ancestry (ex: empathy seems likely a shortcut for modeling others by repurposing machinery for modeling the self (or vice versa), that is likely not found by hillclimbing when the architecture of the self is very different from the architecture of the other);

(b) my guess is that the pressures are just very different for different search processes (genetic recombination of DNA vs SGD on all weights); and

(c) it looks to me like value is fragile, such that even if the drives were kinda close, I don't expect the obtainable optimum to be good according to our lights

(esp. given that the question is not just what drives the AI gets, but the reflective equilibrium of those drives: small changes to initial drives are allowed to have large changes to the reflective equilibrium, and I suspect this is so).

So8res's Shortform

So8res2y911

Someone recently privately asked me for my current state on my 'Dark Arts of Rationality' post. Here's some of my reply (lightly edited for punctuation and conversation flow), which seemed worth reproducing publicly:

FWIW, that post has been on my list of things to retract for a while.

(The retraction is pending a pair of blog posts that describe some of my thoughts on related matters, which have been in the editing queue for over a year and the draft queue for years before that.)

I wrote that post before reading much of the sequences, and updated away from the position pretty soon after. My current stance is that you can basically get all the nice things, and never need to compromise your epistemics.

For the record, the Minding Our Way post where I was like "people have a hard time separating 'certainty'-the-motivational-stance from 'certainty'-the-epistemic-state" was the logs of me figuring out my mistake (and updating away from the dark arts post).

On my current accounting, the mistake I was making at the time of the dark arts post was something like: lots of stuff comes culturally bundled, in ways that can confuse you into thinking you can't get good thing X without also swallowing bad thing Y.

And there's a skill of just, like, taking the good stuff and discarding the bad stuff, even if you don't yet know how to articulate a justification (which I lacked in full generality at the time of the dark arts post, and was developing at the time of the 'certainty' post.)

And it's a little tricky to write about, because you've got to balance it against "care about consistency" / "notice when you're pingponging between mutually-incosistent beliefs as is convenient", which is... not actually hard, I think, but I haven't found a way to write about the one without the words having an interpretation of "just drop your consistency drive". ...which is how these sorts of things end up languishing in my editing queue for years, whe I have other priorities.

(And for the record, another receipt here is that in some twitter thread somewhere--maybe the jargon thread?--I noted the insight about unbundling things, using "you can't be sad and happy at the same time" as an example of a bundled-thing. which isn't the whole concept, but which is another instance of the resolution intruding in a visible way.)

(More generally, a bunch of my early MoW posts are me, like, digesting parts of the sequences and correcting a bunch of my errors from before I encountered this community. And for the record, I'm grateful to the memes in this community--and to Eliezer in particular, who I count as originating many of them--for helping me stop being an idiot in that particular way.)

I've also gone ahead and added a short retraction-ish paragraph to the top of the dark arts post, and might edit it later to link it to the aforementioned update-posts, if they ever make it out of the editing queue.

Concave Utility Question

So8res2y*80

Below is a sketch of an argument that might imply that the answer to Q5 is (clasically) 'yes'. (I thought about a question that's probably the same a little while back, and am reciting from cache, without checking in detail that my axioms lined up with your A1-4).

Pick a lottery with the property that forall $A, B$ with $A ⪯ Z$ and $B ⪯ Z$ , forall $p \in [0, 1]$ , we have $(1 - p) A + p B ⪯ Z$ . We will say that $Z$ is "extreme(ly high)".

Pick a lottery $X$ with $X ⪯ Z$ .

Now, for any $Y$ with $X ⪯ Y ⪯ Z$ , define $u (Y)$ to be the $p$ guaranteed by continuity (A3).

Lemma: forall $α, β \in [0, 1]$ with $α \leq β$ , $(1 - α) X + α Z ⪯ (1 - β) X + β Z$ .

Proof:

$(1 - α) X + α Z ⪯ Z$ , by $X ⪯ Z$ and $Z ⪯ Z$ and the extremeness of $Z$ .
$(1 - α) X + α Z ⪯ \frac{1 - α}{1 - β} ((1 - α) X + α Z) + (1 - \frac{1 - α}{1 - β}) Z$ , by A4.
$(1 - α) X + α Z ⪯ (1 - β) X + β Z$ , by some reduction.

We can use this lemma to get that $u (A) \leq u (B)$ implies $A ⪯ B$ , because $A \sim (1 - u (A)) X + u (A) Z$ , and $B \sim (1 - u (B)) X + u (B) Z$ , so invoke the above lemma with $α = u (A)$ and $β = u (B)$ .

Next we want to show that $A ⪯ B$ implies $u (A) \leq u (B)$ . I think this probably works, but it appears to require either the axiom of choice (!) or a strengthening of one of A3 or A4. (Either strengthen A3 to guarantee that if $B_{1} \sim B_{2}$ then it gives the same $p$ in both cases, or strengthen A4 to add that if $B ≺ A$ then $B ≺ p A + (1 - p) B$ , or define $u (A)$ not from A3 directly, but by using choice to pick out a $p$ for each $\sim$ -equivalence-class of lotteries.) Once you've picked one of those branches, the proof basically proceeds by contradiction. (And so it's not terribly constructive, unless you can do $\neg \neg (A ≺ B) \to (A ≺ B)$ constructively.)

The rough idea is: if $A ≺ B$ but $u (A) \geq u (B)$ then you can use the above lemma to get a contradiction, and so you basically only need to consider the case where $A \sim B$ in which case you want $u (A) = u (B)$ , which you can get by definition (if you use the axiom of choice), or directly by the strengthening of A3. And... my cache says that you can also get it by the strengthening of A4, albeit less directly, but I haven't reloaded that part of my cache, so \shrug I dunno.

Next we argue that this function $u$ is unique up to postcomposition by... any strictly isotone endofunction on the reals? I think? (Perhaps unique only among quasiconvex functions?) I haven't checked the details.

Now we have a class of utility-function-ish-things, defined only on $Y$ with $X ⪯ Y ⪯ Z$ , and we want to extend it to all lotteries.

I'm not sure if this step works, but the handwavy idea is that for any lottery $Y$ that you want to extend $u$ to include, you should be able to find a lower $X$ and an extreme higher $Z$ that bracket it, at which point you can find the corresponding $u$ (using the above machinery), at which point you can (probably?) pick some canonical strictly-isotone real endofunction to compose with it that makes it agree with the parts of the function you've defined so far, and through this process you can extend your definition of $u$ to include any lottery. handwave handwave.

Note that the exact function you get depends on how you find the lower $X$ and higher $Z$ , and which isotone function you use to get all the pieces to line up, but when you're done you can probably argue that the whole result is unique up to postcomposition by a strictly isotone real endofunction, of which your construction is a fine representative.

This gets you C1. My cache says it should be easy to get C2 from there, and the first paragraph of "Edit 3" to the OP suggests the same, so I haven't checked this again.

Natural Abstractions: Key Claims, Theorems, and Critiques

So8res2y111

I'm awarding another $3,000 distillation prize for this piece, with complements to the authors.

So8res's Shortform

So8res2y*45

A few people recently have asked me for my take on ARC evals, and so I've aggregated some of my responses here:

- I don't have strong takes on ARC Evals, mostly on account of not thinking about it deeply.
- Part of my read is that they're trying to, like, get a small dumb minimal version of a thing up so they can scale it to something real. This seems good to me.
- I am wary of people in our community inventing metrics that Really Should Not Be Optimized and handing them to a field that loves optimizing metrics.
- I expect there are all sorts of issues that would slip past them, and I'm skeptical that the orgs-considering-deployments would actually address those issues meaningfully if issues were detected ([cf](https://www.lesswrong.com/posts/thkAtqoQwN6DtaiGT/carefully-bootstrapped-alignment-is-organizationally-hard)).
- Nevertheless, I think that some issues can be caught, and attempting to catch them (and to integrate with leading labs, and make "do some basic checks for danger" part of their deployment process) is a step up from doing nothing.
- I have not tried to come up with better ideas myself.

Overall, I'm generally enthusiastic about the project of getting people who understand some of the dangers into the deployment-decision loop, looking for advance warning signs.

A rough and incomplete review of some of John Wentworth's research

So8res2y*1614

John has also made various caveats to me, of the form "this field is pre-paradigmatic and the math is merely suggestive at this point". I feel like he oversold his results even so.

Part of it is that I get the sense that John didn't understand the limitations of his own results--like the fact that the telephone theorem only says anything in the infinite case, and the thing it says then does not (in its current form) arise as a limit of sensible things that can be said in finite cases. Or like the fact that the alleged interesting results of the gKPD theorem are a relatively-shallow consequence of the overly-strong assumption of .

My impression was that I had to go digging into the theorems to see what they said, only to be disappointed by how little resemblance they bore to what I'd heard John imply. (And it sounds to me like Lawrence, Leon, and Erik had a similar experience, although I might be misreading them on account of confirmation bias or w/e.)

I acknowledge that it's tricky to draw a line between "someone has math that they think teaches them something, and is inarticulate about exactly what it teaches" and "someone has math that they don't understand and are overselling". The sort of observation that would push me towards the former end in John's case is stuff like: John being able to gesture more convincingly at ways concepts like "tree" or "window" are related to his conserved-property math even in messy finite cases. I acknowledge that this isn't a super legible distinction and that that's annoying.

(Also, I had the above convos with John >1y ago, and perhaps John simply changed since then.)

Note that I continue to think John's cool for pursuing this particular research direction, and I'd enjoy seeing his math further fleshed out (and with more awareness on John's part of its current limitations). I think there might be interesting results down this path.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments