I claim that to the extent ordinary humans can do this, GPT-4 can nearly do this as well
(Insofar as this was supposed to name a disagreement, I do not think it is a disagreement, and don't understand the relevance of this claim to my argument.)
Presumably you think that ordinary human beings are capable of "singling out concepts that are robustly worth optimizing for".
Nope! At least, not directly, and not in the right format for hooking up to a superintelligent optimization process.
(This seems to me like plausibly one of the sources of misunderstandi...
That helps somewhat, thanks! (And sorry for making you repeat yourself before discarding the erroneous probability-mass.)
I still feel like I can only barely maybe half-see what you're saying, and only have a tenuous grasp on it.
Like: why is it supposed to matter that GPT can solve ethical quandries on-par with its ability to perform other tasks? I can still only half-see an answer that doesn't route through the (apparently-disbelieved-by-both-of-us) claim that I used to argue that getting the AI to understand ethics was a hard bit, by staring at sentences ...
I have the sense that you've misunderstood my past arguments. I don't quite feel like I can rapidly precisely pinpoint the issue, but some scattered relevant tidbits follow:
I didn't pick the name "value learning", and probably wouldn't have picked it for that problem if others weren't already using it. (Perhaps I tried to apply it to a different problem than Bostrom-or-whoever intended it for, thereby doing some injury to the term and to my argument?)
Glancing back at my "Value Learning" paper, the abstract includes "Even a machine intelligent enough
Glancing back at my "Value Learning" paper, the abstract includes "Even a machine intelligent enough to understand its designers’ intentions would not necessarily act as intended", which supports my recollection that I was never trying to use "Value Learning" for "getting the AI to understand human values is hard" as opposed to "getting the AI to act towards value in particular (as opposed to something else) is hard", as supports my sense that this isn't hindsight bias, and is in fact a misunderstanding.
For what it's worth, I didn't claim that you argue...
I was recently part of a group-chat where some people I largely respect were musing about this paper and this post and some of Scott Aaronson's recent "maybe intelligence makes things more good" type reasoning).
Here's my replies, which seemed worth putting somewhere public:
...The claims in the paper seem wrong to me as stated, and in particular seems to conflate values with instrumental subgoals. One does not need to terminally value survival to avoid getting hit by a truck while fetching coffee; they could simply understand that one can't fetch the coffee
Someone recently privately asked me for my current state on my 'Dark Arts of Rationality' post. Here's some of my reply (lightly edited for punctuation and conversation flow), which seemed worth reproducing publicly:
...FWIW, that post has been on my list of things to retract for a while.
(The retraction is pending a pair of blog posts that describe some of my thoughts on related matters, which have been in the editing queue for over a year and the draft queue for years before that.)
I wrote that post before reading much of the sequences, and updated away from
Below is a sketch of an argument that might imply that the answer to Q5 is (clasically) 'yes'. (I thought about a question that's probably the same a little while back, and am reciting from cache, without checking in detail that my axioms lined up with your A1-4).
Pick a lottery with the property that forall with and , forall , we have . We will say that is "extreme(ly high)".
Pick a lottery with .
Now, for any with , define to be the guaranteed by continuity (A3).
Lemma: forall with , ...
A few people recently have asked me for my take on ARC evals, and so I've aggregated some of my responses here:
- I don't have strong takes on ARC Evals, mostly on account of not thinking about it deeply.
- Part of my read is that they're trying to, like, get a small dumb minimal version of a thing up so they can scale it to something real. This seems good to me.
- I am wary of people in our community inventing metrics that Really Should Not Be Optimized and handing them to a field that loves optimizing metrics.
- I expect there are all sorts of issues that wo...
John has also made various caveats to me, of the form "this field is pre-paradigmatic and the math is merely suggestive at this point". I feel like he oversold his results even so.
Part of it is that I get the sense that John didn't understand the limitations of his own results--like the fact that the telephone theorem only says anything in the infinite case, and the thing it says then does not (in its current form) arise as a limit of sensible things that can be said in finite cases. Or like the fact that the alleged interesting results of the gKPD theorem...
(Also, I had the above convos with John >1y ago, and perhaps John simply changed since then.)
In hindsight, I do think the period when our discussions took place were a local maximum of (my own estimate of the extent of applicability of my math), partially thanks to your input and partially because I was in the process of digesting a bunch of the technical results we talked about and figuring out the next hurdles. In particular, I definitely underestimated the difficulty of extending the results to finite approximations.
That said, I doubt that fully accounts for the difference in perception.
John said "there was not any point at which I thought my views were importantly misrepresented" when I asked him for comment. (I added this note to the top of the post as a parenthetical; thanks.)
More details:
Here's a recent attempt of mine at a distillation of a fragment of this plan, copied over from a discussion elsewhere:
goal: make there be a logical statement such that a proof of that statement solves the strawberries-on-a-plate problem (or w/e).
summary of plan:
I don't see this as worst-case thinking. I do see it as speaking from a model that many locals don't share (without any particular attempt made to argue that model).
In particular, if the AGI has some pile of kludges disproportionately pointed towards accomplishing X, and the AGI does self-reflection and “irons itself out”, my prediction is “maybe this AGI will wind up pursuing X, or maybe not, I dunno”.
AFAICT, our degree of disagreement here turns on what you mean by "pointed". Depending on that, I expect I'd either say "yeah maybe, but that kind of po...
I think that distillations of research agendas such as this one are quite valuable, and hereby offer LawrenceC a $3,000 prize for writing it. (I'll follow up via email.) Thanks, LawrenceC!
Going forward, I plan to keep an eye out for distillations such as this one that seem particularly skilled or insightful to me, and offer them a prize in the $1-10k range, depending on how much I like them.
Insofar as I do this, I'm going to be completely arbitrary about it, and I'm only going to notice attempts haphazardly, so please don't do rely on the assumption that I...
well, in your search for that positive result, i recommend spending some time searching for a critch!simplified alternative to the Y combinator :-p.
not every method of attaining self-reference in the λ-calculus will port over to logic (b/c in the logical setting lots of things need to be quoted), but the quotation sure isn't making the problem any easier. a solution to the OP would yield a novel self-reference combinator in the λ-calculus, and the latter might be easier to find (b/c you don't need to juggle quotes).
if you can lay bare the self-referential ...
which self-referential sentence are you trying to avoid?
it keeps sounding to me like you're saying "i want a λ-calculus combinator that produces the fixpoint of a given function f, but i don't want to use the Y combinator".
do you deny the alleged analogy between the normal proof of löb and the Y combinator? (hypothesis: maybe you see that the diagonal lemma is just the type-level Y combinator, but have not yet noticed that löb's theorem is the corresponding term-level Y combinator?)
if you follow the analogy, can you tell me what λ-term should come out when...
various attempts to distill my objection:
the details of the omitted "zip" operation are going to be no simpler than the standard proof of löb's theorem, and will probably turn out to be just a variation on the standard proof of löb's theorem (unless you can find a way of building self-reference that's shorter than the Y combinator (b/c the standard proof of löb's theorem is already just the Y combinator plus the minimal modifications to interplay with gödel-codes))
even more compressed: the normal proof of löb is contained in the thing labeled "zip". th...
an attempt to rescue what seems to me like the intuition in the OP:
(note that the result is underwhelming, but perhaps informative.)
in the lambda calculus we might say "given (f : A → A), we can get an element (a : A) by the equation (a := f a)".
recursive definitions such as this one work perfectly well, modulo the caveat that you need to have developed the Y combinator first so that you can ground out this new recursive syntax (assuming you don't want to add any new basic combinators).
by a directly analogous argument, we might wish define (löb f := f "löb...
various attempts to distill my objection:
the details of the omitted "zip" operation are going to be no simpler than the standard proof of löb's theorem, and will probably turn out to be just a variation on the standard proof of löb's theorem (unless you can find a way of building self-reference that's shorter than the Y combinator (b/c the standard proof of löb's theorem is already just the Y combinator plus the minimal modifications to interplay with gödel-codes))
even more compressed: the normal proof of löb is contained in the thing labeled "zip". th...
(Epistemic status: quickly-recounted lightly-edited cached state that I sent in response to an email thread on this topic, that I now notice had an associated public post. Sorry for the length; it was easier to just do a big unfiltered brain-dump than to cull, with footnotes added.)
here's a few quick thoughts i have cached about the proof of löb's theorem (and which i think imply that the suggested technique won't work, and will be tricky to repair):
#1. löb's theorem is essentially just the Y combinator, but with an extra level of quotation mixed in.^{[1]}
...an attempt to rescue what seems to me like the intuition in the OP:
(note that the result is underwhelming, but perhaps informative.)
in the lambda calculus we might say "given (f : A → A), we can get an element (a : A) by the equation (a := f a)".
recursive definitions such as this one work perfectly well, modulo the caveat that you need to have developed the Y combinator first so that you can ground out this new recursive syntax (assuming you don't want to add any new basic combinators).
by a directly analogous argument, we might wish define (löb f := f "löb...
my guess is it's not worth it on account of transaction-costs. what're they gonna do, trade half a universe of paperclips for half a universe of Fun? they can already get half a universe of Fun, by spending on Fun what they would have traded away to paperclips!
and, i'd guess that one big universe is more than twice as Fun as two small universes, so even if there were no transaction costs it wouldn't be worth it. (humans can have more fun when there's two people in the same room, than one person each in two separate rooms.)
there's also an issue where it's n...
and, i'd guess that one big universe is more than twice as Fun as two small universes, so even if there were no transaction costs it wouldn't be worth it. (humans can have more fun when there's two people in the same room, than one person each in two separate rooms.)
This sounds astronomically wrong to me. I think that my personal utility function gets close to saturation with a tiny fraction of the resources in universe-shard. Two people is one room is better than two people in separate rooms, yes. But, two rooms with trillion people each is virtually t...
For sure. It's tricky to wipe out humanity entirely without optimizing for that in particular -- nuclear war, climate change, and extremely bad natural pandemics look to me like they're at most global catastrophes, rather than existential threats. It might in fact be easier to wipe out humanity by enginering a pandemic that's specifically optimized for this task (than it is to develop AGI), but we don't see vast resources flowing into humanity-killing-virus projects, the way that we see vast resources flowing into AGI projects. By my accounting, most other...
Question for Richard, Paul, and/or Rohin: What's a story, full of implausibly concrete details but nevertheless a member of some largish plausible-to-you cluster of possible outcomes, in which things go well? (Paying particular attention to how early AGI systems are deployed and to what purposes, or how catastrophic deployments are otherwise forstalled.)
I wrote this doc a couple of years ago (while I was at CHAI). It's got many rough edges (I think I wrote it in one sitting and never bothered to rewrite it to make it better), but I still endorse the general gist, if we're talking about what systems are being deployed to do and what happens amongst organizations. It doesn't totally answer your question (it's more focused on what happens before we get systems that could kill everyone), but it seems pretty related.
(I haven't brought it up before because it seems to me like the disagreement is much more in th...
In response to your last couple paragraphs: the critique, afaict, is not "a real human cannot keep multiple concrete scenarios in mind and speak probabilistically about those", but rather "a common method for representing lots of hypotheses at once, is to decompose the hypotheses into component properties that can be used to describe lots of concrete hypotheses. (toy model: instead of imagining all numbers, you note that some numbers are odd and some numbers are even, and then think of evenness and oddness). A common failure mode when attempting this is th...
Relevant Feynman quote:
I had a scheme, which I still use today when somebody is explaining something that I’m trying to understand: I keep making up examples.
For instance, the mathematicians would come in with a terrific theorem, and they’re all excited. As they’re telling me the conditions of the theorem, I construct something which fits all the conditions. You know, you have a set (one ball)-- disjoint (two balls). Then the balls turn colors, grow hairs, or whatever, in my head as they put more conditions on.
Finally they state the theorem, which is some dumb thing about the ball which isn’t true for my hairy green ball thing, so I say “False!” [and] point out my counterexample.
("near-zero" is a red herring, and I worry that that phrasing bolsters the incorrect view that the reason MIRI folk think alignment is hard is that we want implausibly strong guarantees. I suggest replacing "reduce x-risk to near-zero" with "reduce x-risk to sub-50%".)
My take on the exercise:
Is Humbali right that generic uncertainty about maybe being wrong, without other extra premises, should increase the entropy of one's probability distribution over AGI, thereby moving out its median further away in time?
Short version: Nah. For example, if you were wrong by dint of failing to consider the right hypothesis, you can correct for it by considering predictable properties of the hypotheses you missed (even if you don't think you can correctly imagine the true research pathway or w/e in advance). And if you were wrong
I have discussed with MIRI their decision to make their research non-disclosed-by-default and we agreed that my research agenda is a reasonable exception.
Small note: my view of MIRI's nondisclosed-by-default policy is that if all researchers involved with a research program think it should obviously be public then it should obviously be public, and that doesn't require a bunch of bureaucracy. I think this while simultaneously predicting that when researchers have a part of themselves that feels uncertain or uneasy about whether their research sho...
The second statement seems pretty plausible (when we consider human-accessible AGI designs, at least), but I'm not super confident of it, and I'm not resting my argument on it.
The weaker statement you provide doesn't seem like it's addressing my concern. I expect there are ways to get highly capable reasoning (sufficient for, e.g., gaining decisive strategic advantage) without understanding low-K "good reasoning"; the concern is that said systems are much more difficult to align.
As I noted when we chatted about this in person, my intuition is less "there is some small core of good consequentialist reasoning (it has “low Kolmogorov complexity” in some sense), and this small core will be quite important for AI capabilities" and more "good consequentialist reasoning is low-K and those who understand it will be better equipped to design AGI systems where the relevant consequentialist reasoning happens in transparent boxes rather than black boxes."
Indeed, if I thought one had to understand good consequentialist reasoning in order to design a highly capable AI system, I'd be less worried by a decent margin.
Weighing in late here, I'll briefly note that my current stance on the difficulty of philosophical issues is (in colloquial terms) "for the love of all that is good, please don't attempt to implement CEV with your first transhuman intelligence". My strategy at this point is very much "build the minimum AI system that is capable of stabilizing the overall strategic situation, and then buy a whole lot of time, and then use that time to figure out what to do with the future." I might be more optimistic than you about how easy it will turn out to be to find a
...Nice work!
Minor note: in equation 1, I think the should be an .
I'm not all that familiar with paraconsistent logic, so many of the details are still opaque to me. However, I do have some intuitions about where there might be gremlins:
Solution 4.1 reads, "The agent could, upon realizing the contradiction, ..." You've got to be a bit careful here: the formalism you're using doesn't contain a reasoner that does something like "realize the contradiction." As stated, the agent is simply constructed to simply execute an action if it can prove ; it i
...Thanks for the link! I appreciate your write-ups. A few points:
1. As you've already noticed, your anti-newcomb problem an instance of Dr. Nick Bone's "problematic problems". Benja actually gave a formalism of the general class of problems in the context of provability logic in a recent forum post. We dub these problems "evil problems," and I'm not convinced that your XDT is a sane way to deal with evil problems.
For one thing, every decision theory has an evil problem. As shown in the links above, even in if we consider "fair" games, there is always a probl
...Also, FYI, I tossed together reflective implementations of Solomonoff Induction and AIXI using Haskell, which you can find on the MIRI github. It's not very polished, but it typechecks.
We might be talking about different things when we talk about counterfactuals. Let me be more explicit:
Say an agent is playing against a copy of itself on the prisoner's dilemma. It must evaluate what happens if it cooperates, and what happens if it defects. To do so, it needs to be able to predict what the world would look like "if it took action A". That prediction is what I call a "counterfactual", and it's not always obvious how to construct one. (In the counterfactual world where the agent defects, is the action of its copy also set to 'defect', or is
...Patrick and I discussed something like this at a previous MIRIx. I think the big problem is that (if I understand what you're suggesting) it basically just implements CDT.
For example, in Newcomb's problem, if X=1 implies Omega is correct and X=0 implies the agent won't necessarily act as predicted, and it acts conditioned on X=0, then it will twobox.
Yeah, causation in logical uncertainty land would be nice. It wouldn't necessarily solve the whole problem, though. Consider the scenario
outcomes = [3, 2, 1, None]
strategies = {Hi, Med, Low}
A = lambda: Low
h = lambda: Hi
m = lambda: Med
l = lambda: Low
payoffs = {}
payoffs[h()] = 3
payoffs[m()] = 2
payoffs[l()] = 1
E = lambda: payoffs.get(A())
Now it's pretty unclear that (lambda: Low)()==Hi
should logically cause E()=3
.
When considering (lambda: Low)()==Hi
, do we want to change l
without A
, A
without l
, or both? These correspond to answers None
, 3
, and
Typos & syntax complaints:
and let’s consider an oracle to be a function , such that specifies the probability that the oracle will return "true" when invoked on the pair
Confusing notation. In the paragraph above, had the type .
We want to find an oracle machine that will output if , output if , and output either or if the expectations are equal.
Should be "And output either or if the expectations are equal", presumably.
[Will edit this post as I fi
...
If you allow indirection and don't worry about it being in the right format for superintelligent optimization, then sufficiently-careful humans can do it.
Answering your request for prediction, given that it seems like that request is still live: a thing I don't expect the upcoming multimodal models to be able to do: train them only on data up through 1990 (or otherwise excise all training data from our broadly-generalized community), ask them what superintelligent machines (in the sense of IJ Good) should do, and have them come up with something like CEV (... (read more)