Adam Shimi

Full time independent alignment researcher. Although instead of having a concrete research agenda, I mostly study epistemic strategies (https://www.alignmentforum.org/s/LLEJJoaYpCoS5JYSY) and help other alignment researchers by extracting their processes, modelling them, and explaining with less inferential distances their positions and disagreements. (Also PhD in the theory of distributed computing).

Sequences

Epistemic Cookbook for Alignment
Reviews for the Alignment Forum
AI Alignment Unwrapped
Deconfusing Goal-Directedness
Toying With Goal-Directedness

Wiki Contributions

Comments

The Natural Abstraction Hypothesis: Implications and Evidence

Thanks for the post! Two general points I want to make before going into more general comments:

  • I liked the section on concepts difference across, and hadn't thought much about it before, so thanks!
  • One big aspect of the natural abstraction hypothesis that you missed IMO is "how do you draw the boundaries around abstractions?" — more formally how do you draw the markov blanket. This to me is the most important question to answer for settling the NAH, and John's recent work on sequences of markov blanket is IMO him trying to settle this.

In general, we should expect that systems will employ natural abstractions in order to make good predictions, because they allow you to make good predictions without needing to keep track of a huge number of low-level variables .

Don't you mean "abstractions" instead of "natural abstractions"?

John Maxwell proposes differentiating between the unique abstraction hypothesis (there is one definitive natural abstraction which we expect humans and AIs to converge on), and the useful abstraction hypothesis (there is a finite space of natural abstractions which humans use for making predictions and inference, and in general this space is small enough that an AGI will have a good chance of finding abstractions that correspond to ours, if we have it "cast a wide enough net").

Note that both can be reconciled by saying that the abstraction in John's sense (the high-level summary statistics that capture everything that isn't whipped away by noise) is unique up to isomorphism (because it's a summary statistics), but different predictors will learn different parts of this abstraction depending on the information that they don't care about (things they don't have sensor for, for example). Hence you have a unique natural abstraction, which generates a constrained space of subabstractions which are the ones actually learned by real world predictors.

This might be relevant for your follow-up arguments.

To put it another way — the complexity of your abstractions depends on the depth of your prior knowledge. The NAH only says that AIs will develop abstractions similar to humans when they have similar priors, which may not always be the case.

Or another interpretation, following my model above, is that with more knowledge and more data and more time taken, you get closer and closer to the natural abstraction, but you don't generally start up there. Although that would mean that the refinement of abstractions towards the natural abstraction often breaks the abstraction, which sounds fitting with the course of science and modelling in general.

Different strengths of the NAH can be thought of as corresponding to different behaviours of this graph. If there is a bump at all, this would suggest some truth to the NAH, because it means (up to a certain level) models can become more powerful as their concepts more closely resemble human ones. A very strong form of the NAH would suggest this graph doesn't tail off at all in some cases, because human abstractions are the most natural, and anything more complicated won't provide much improvement. This seems quite unlikely—especially for tasks where humans have poor prior knowledge—but the tailing off problem could be addressed by using an amplified overseer, and incentivising interpretability during the training process. The extent to which NAH is true has implications for how easy this process is (for more on this point, see next section).

Following my model above, one interpretation is that the bump means that when reaching human-level of competence, the most powerful abstractions available as approximations of the natural abstractions are the ones humans are using. Which is not completely ridiculous, if you expect that AIs for "human-level competence at human tasks" would have similar knowledge, inputs and data than humans.

The later fall towards alien concepts might come from the approximation shifting to very different abstractions as the best approximation, just like the shift between classical physics and quantum mechanics.

However, it might make deceptive alignment more likely. There are arguments for why deceptive models might be favoured in training processes like SGD—in particular, learning and modelling the training process (and thereby becoming deceptive) may be a more natural modification than internal or corrigible alignment. So if  is a natural abstraction, this would make it easier to point to, and (since having a good model of the base objective is a sufficient condition for deception) the probability of deception is subsequently higher.

Here is my understanding of your argument: because X is a natural abstraction and NAH is true, models will end up actually understanding X, which is a condition for the apparition of mesa-optimization. Thus NAH might make deception more likely by making one of its necessary condition more likely. Is that a good description of what you propose?

I would assume that NAH actually makes deception less likely, because of the reasons you gave above about the better proxy, which might entail that the mesa-objective is actually the base-objective.

This abstractions-framing of instrumental convergence implies that getting better understanding of which abstractions are learned by which agents in which environments might help us better understand how an agent with instrumental goals might behave (since we might expect an agent will try to gain control over some variable  only to the extent that it is controlling the features described by the abstraction  which it has learned, which summarizes the information about the current state relevant to the far-future action space).

Here too I feel that the model of "any abstraction is an approximation of the natural abstraction" might be relevant. Especially because if it's correct, then in addition to knowing the natural abstraction, you have to understand which part of it could the model approximate, and which parts of that approximations are relevant for its goal.

If NAH is strongly true, then maybe incentivising interpretability during training is quite easy, just akin to a "nudge in the right direction". If NAH is not true, this could make incentivising interpretability really hard, and applying pressure away from natural abstractions and towards human-understandable ones will result in models Goodharting interpretability metrics—where a model is be emphasised to trick the metrics into thinking it is forming human-interpretable concepts when it actually isn't.

The less true NAH is, the harder this problem becomes. For instance, maybe human-interpretability and "naturalness" of abstractions are actually negatively correlated for some highly cognitively-demanding tasks, in which case the trade-off between these two will mean that our model will be pushed further towards Goodharting.

The model I keep describing could actually lead to both hard to interpret  and yet potentially interpretable abstractions. Like, if someone from 200 years ago had to be taught Quantum Mechanics. It might be really hard, depends on a lot of mental moves that we don't know how to convey, but that would be the sort of problem equivalent to interpreting better approximations than ours of the natural abstractions. It sounds at least possible to me.

A danger of this process is that the supervised learner would model the data-collection process instead of using the unsupervised model - this could lead to misalignment. But suppose we supplied data about human values which is noisy enough to make sure the supervised learner never switched to directly modelling the data-collection process, while still being a good enough proxy for human values that the supervised learner will actually use the unsupervised model in the first place.

Note that this amount to solving ELK, which probably takes more than "adequately noisy data".

We can argue that human values are properties of the abstract object "humans", in an analogous way to branching patterns being properties of the abstract object "trees". However there are complications to this analogy: for instance, human values seem especially hard to infer from behaviour without using the "inside view" that only humans have.

I mean, humans learning approximations of the natural abstractions through evolutionary processes makes a lot of sense to me, as it's just evolution training predictors, right?

For another, even if we accept that "which abstractions are natural?" is a function of factors like culture, this argues against the "unique abstraction hypothesis" but not the "useful abstraction hypothesis". We could argue that navigators throughout history were choosing from a discrete set of abstractions, with their choices determined by factors like available tools, objectives, knowledge, or cultural beliefs, but the set of abstractions itself being a function of the environment, not the navigators.

Hence with the model I'm discussing, the culture determined their choice of subabstractions but they were all approximating the same natural abstractions.

Robustness to Scale

Rereading this post while thinking about the approximations that we make in alignment, two points jump at me:

  • I'm not convinced that robustness to relative scale is as fundamental as the other two, because there is no reason to expect that in general the subcomponents will be significantly different in power, especially in settings like adversarial training where both parts are trained according to the same approach. That being said, I still agree that this is an interesting question to ask, and some proposal might indeed depend on a version of this.
  • Robustness to scaling up and robustness to scaling down sounds like they can be summarized by: "does it break in the limit of optimality? and "does it only work in the limit of optimality?". Where the first gives us an approximation for studying and designing alignment proposals, and the second points out a potential issue in this approximation. (Not saying that this is capturing all of your meaning, though)
Reply to Eliezer on Biological Anchors

Thanks for pushing back on my interpretation.

I feel like you're using "strongest" and "weakest" to design "more concrete" and "more abstract", with maybe the value judgement (implicit in your focus on specific testable claims) that concreteness is better. My interpretation doesn't disagree with your point about Bio Anchors, it simply says that this is a concrete instantiation of a general pattern, and that the whole point of the original post as I understand it is to share this pattern. Hence the title who talks about all biology-inspired timelines, the three examples in the post, and the seven times that Yudkowsky repeats his abstract arguments in differents ways.

It's hardly surprising there are 'two paths through a space' - if you reran either (biological or cultural/technological) evolution with slightly different initial conditions you'd get a different path. However technological evolution is aware of biological evolution and thus strongly correlated to and influenced by it. IE deep learning is in part brain reverse engineering (explicitly in the case of DeepMind, but there are many other examples). The burden proof is thus arguably more opposite of what you claim (EY claims).

Maybe a better way of framing my point here is that the optimization processes are fundamentally different (something about which Yudkowsky has written a lot, see for example this post from 13 years ago), and that the burden of proof is on showing that they have enough similarity to extract a lot of info from the evolutionary optimization to the human research optimization.

I also don't think your point about DeepMind works, because DM is working in a way extremely different from evolution. They are in part reverse engineering the brain, but that's a very different (and very human and insight heavy) paths towards AGI than the one evolution took.

Lastly for this point, I don't think the interpretation that "Yudkowsky says that the burden of proof is on showing that the optimization of evolution and human research are non correlated" survives the contact with a text where Yudkowsky constantly berates his interlocutors for assuming such correlation, and keeps drawing again and again the differences.

To the extent EY makes specific testable claims about the inefficiency of biology, those claims are in err - or at least easily contestable.

Hum, I find myself feeling like this comment: Yudkowsky's main point about biology IMO is that brains are not at all the most efficient computational way of implementing AGI. Another way of phrasing it is that Yudkowsky says (according to me) that you could use significantly less hardware and ops/sec to make an AGI.

My Overview of the AI Alignment Landscape: Threat Models

Thanks so much for the effort your putting in this work! It looks particularly relevant to my current interest of understanding the different approximations and questions used in alignment, and what forbids us the Grail of paradigmaticity.

Here is my more concrete feedback

A common approach when setting research agendas in AI Alignment is to be specific, and focus on a threat model. That is, to extrapolate from current work in AI and our theoretical understanding of what to expect, to come up with specific stories for how AGI could cause an existential catastrophe. And then to identify specific problems in current or future AI systems that make these failure modes more likely to happen, and try to solve them now.

Given that AFAIK it’s Rohin who introduced the term in alignment, linking to his corresponding talk might be a good idea. I also like this drawing from his slides, which might clarify the explanation for more visual readers.

While I’m at threat models, you confused me at first because “threat model” always makes me think of “development model”, and so I expected a discussion of seed AI vs Prosaic AI vs Brain-based-AGI vs CAIS vs alternatives. What you do instead is more a discussion of “risk models”, with a mention in passing that the first one traditionally came from the more seed AI development model.

Of course that’s your choice, but neglecting a bunch of development models with a lot of recent work, notably the brain-based AGI model  of Steve Byrnes, feels incoherent with the stated aim of the sequence — “mapping out the AI Alignment research landscape”.

And having a specific story to guide what you do can be a valuable source of direction, even if ultimately you know it will be flawed in many ways. Nate Soares makes the case for having a specific but flawed story in general well.

My first reaction when reading this part was “Hum, that doesn’t seem to be exactly what Nate is justifying here”. After rereading the post, I think what disturbed me was my initial reading that you were saying something like “the correctness of a threat model doesn’t matter, you just choose one and do stuff”. Which is not what either you or Nate are saying; instead, it’s that spending all the time waiting for a perfect plan/threat model is less productive than taking the best option available, getting your hands dirty and trying things.

Note that I think there is very much a spectrum between this category and robustly good approaches (a forthcoming post in this sequence). Most robustly good ways to help also address specific threat models, and many ways to address specific threat models feel useful even if that specific threat model is wrong. But I find this a helpful distinction to keep in mind.

This sounds to me like a better defense of threat model thinking, and I would like to read more about your ideas (especially the last two sentences).

When naively considered, this framework often implicitly thinks of intelligence as a mysterious black box that caches out as 'better able to achieve plans than us', without much concrete detail. Further, it assumes that all goals would lead to these issues.

I agree with the gist of the paragraph, but “all goals”  is an overstatement: both Nick Bostrom and Steve Omohundro note that some goals obviously don’t have power-seeking incentives, like the goal of dying as fast as possible. They say that most goals would have instrumental subgoals, which is the part that Richard Ngo criticizes and Alex Turner formalizes.

See Tom Adamczewski’s discussion of how arguments have shifted

Oh, awesome resource! Thanks for the link!

Understanding the incentives and goals of the agent, and how the training process can affect these in subtle ways

I feel like you should definitely mention Alex Turner’s work here, where he formalizes Bostrom’s instrumental convergence thesis.

Limited optimization: Many of these problems inherently stem from having a goal-directed utility-maximiser, which will find creative ways to achieve these goals. Can we shift away from this paradigm?

Shouldn’t you include work on impact measures here? For example this survey post and Alex Turner’s sequence.

A particularly concerning special case of the power-seeking concern is inner misalignment. This was an idea that had been floating around MIRI for a while, but was first properly clarified by Evan Hubinger in Risks from Learned Optimization.

Evan is adamant that the paper was done equally by all coauthors, and so should be cited as done by “Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, Scott Garrabrant”.

Sub-Threat model: Inner Alignment

I feel like you’re sticking a bit close to the paper’s case, when there are more compact statements of the problem. Especially with your previous case, you could just say that inner alignment is about justifying power-seeking behavior and treacherous turns in the case where the AI is found by search instead of programmed by hand.

Plausibility of misaligned cognition: It is likely that, in practice, we will end up with networks with misaligned cognition

There’s also an argument that deception is robust once it has been found: making a deceptive model less deceptive would make it do more what it really wants to do, and so have a worse loss, which means it’s not pushed out of deception by SGD.

Better understanding how and when mesa-optimization arises (if it does at all).

One cool topic here is gradient hacking — see for example this recent survey.

Anecdotally, some researchers I respect take this very seriously - it was narrowly rated the most plausible threat model in a recent survey.

I want to note that this scenario looks more normal, which makes me think that by default, anyone would find this more plausible than the Bostrom/Yudkowsky scenario due to normalcy bias. So I tend to cancel this advantage when looking at what scenario people favor.

But this error-correction mechanism may break-down for AI. There are three key factors to analyse here: pace, comprehensibility and lock-in.

I like this decomposition!

So, why would AI make cooperation worse/harder?

At least for Critch’s RAAPs, my understanding is that it’s mostly Pace that makes a difference: the process already exists, but it’s not moving as fast as it could because of the fallibility of humans, because of legislation and restrictions. Replacing humans with AIs in most tasks removes the slow down, and so the process moves faster, towards loss of control.

Reply to Eliezer on Biological Anchors

Thanks for this post!

That being said, my model of Yudkowsky, which I built by spending time interpreting and reverse engineering the post you're responding to, feels like you're not addressing his points (obviously, I might have missed the real Yudkowsky's point)

My interpretation is that he is saying that Evolution (as the generator of most biological anchors) explores the solution space in a fundamentally different path than human research.  So what you have is two paths through a space. The burden of proof for biological anchors thus lies in arguing that there are enough connections/correlations between the two paths to use one in order to predict the other.

Here it sounds like you're taking as an assumption that human research follows the same or a faster path towards the same point in search space. But that's actually the assumption that IMO Yudkowsky is criticizing!

In his piece, Yudkowsky is giving arguments that the human research path should lead to more efficient AGIs than evolution, in part due to the ability of humans to have and leverage insights, which the naive optimization process of evolution can't do. He also points to the inefficiency of biology in implementing new (in geological-time) complex solutions. On the other hand, he doesn't seem to see a way of linking the amount of resources needed by evolution to the amount of resources needed by human research, because they are so different.

If the two paths are very different and don't even aim at the same parts of the search space, there's nothing telling you that computing the optimization power of the first path helps in understanding the second one.

I think Yudkowsky would agree that if you could estimate the amount of resources needed to simulate all evolution  until humans at the level of details that you know is enough to capture all relevant aspects, that amount of resources would be an upper bound on the time taken by human research because that's a way to get AGI if you have the resources. But the number is so vastly large (and actually unknown due to the "level of details" problem) that it's not really relevant for timelines calculations

(Also, I already had this discussion with Daniel Kokotajlo in this thread, but I really think that Platt's law is one of the least cruxy aspects of the original post. So I don't think discussing it further or pointing attention to it is a good idea)

Biology-Inspired AGI Timelines: The Trick That Never Works

First, I want to clarify that I feel we're going into a more interesting place, where there's a better chance that you might find a point that invalidates Yudkowsky's argument, and can thus convince him of the value of the model.

But it's also important to realize that IMO, Yudkowsky is not just saying that biological anchors are bad. The more general problem (which is also developed in this post) is that predicting the Future is really hard. In his own model of AGI timelines, the factor that is basically impossible to predict until you can make AGI is the "how much resources are needed to build AGI".

So saying "let's just throw away the biological anchors" doesn't evade the general counterargument that to predict timelines at all, you need to find information on "how much resources are needed to build AGI", and that is incredibly hard. If you or Ajeya can argue for actual evidence in that last question, then yeah, I expect Yudkowsky would possibly update on the validity of the timeline estimates.

But at the moment, in this thread, I see no argument like that.

Biology-Inspired AGI Timelines: The Trick That Never Works

I do think you are misconstruing Yudkowsky's argument. I'm going to give evidence (all of which are relatively strong IMO) in order of "ease of checkability". So I'll start with something anyone can check in a couple of minutes, and close by the more general interpretation that requires rereading the post in details.

Evidence 1: Yudkowsky flags Simulated-Eliezer as talking smack in the part you're mentioning

If I follow you correctly, your interpretation mostly comes from this part:

OpenPhil:  We did already consider that and try to take it into account: our model already includes a parameter for how algorithmic progress reduces hardware requirements.  It's not easy to graph as exactly as Moore's Law, as you say, but our best-guess estimate is that compute costs halve every 2-3 years.

Eliezer:  Oh, nice.  I was wondering what sort of tunable underdetermined parameters enabled your model to nail the psychologically overdetermined final figure of '30 years' so exactly.

OpenPhil:  Eliezer.

Note that this is one of the two times in this dialogue where Simulated-OpenPhil calls out Simulated-Eliezer. But remember that this whole dialogue was written by Yudkowsky! So he is flagging himself that this particular answer is a quip. Simulated-Eliezer doesn't reexplain it as he does most of his insulting points to Humbali; instead Simulated-Eliezer goes for a completely different explanation in the next answer.

Evidence 2: Platt's law is barely mentioned in the whole dialogue

"Platt" is used 6-times in the 20k words piece. "30 years" is used 8 times (basically at the same place where "Platt" is used").

Evidence 3: Humbali spends far more time discussing and justifying the "30 years" time than Simulated-OpenPhil. And Humbali is the strawman character, whereas Simulated-OpenPhil actually tries to discuss and to understand what Simulated Eliezer is saying.

Evidence 4: There is an alternative interpretation that takes into account the full text and doesn't use Platt's law at all: see this comment on your other thread for my current best version of that explanation.

Evidence 5: Yudkowsky's whole criticism relying on a purely empirical and superficial similarity goes contrary to everything that I extracted from his writing in my recent post, and also to all the time he spends here discussing deep knowledge and the need for an underlying model.

 

So my opinion is that Platt's law is completely superfluous here, and is present here only because it gives a way of pointing to the ridiculousness of some estimates, and because to Yudkowsky it probably means that people are not even making interesting new mistakes but just the same mistakes over and over again. I think discussing it in this post doesn't add much, and weakens the post significantly as it allows reading like yours Daniel, missing the actual point.

Biology-Inspired AGI Timelines: The Trick That Never Works

Here I think I share your interpretation of Yudkowsky; I just disagree with Yudkowsky. I agree on the second part; the model overestimates median TAI arrival time. But I disagree on the first part -- I think that having a probability distribution over when to expect TAI / AGI / AI-PONR etc. is pretty important/decision-relevant, e.g. for advising people on whether to go to grad school, or for deciding what sort of research project to undertake. (Perhaps Yudkowsky agrees with this much.) 

Hum, I would say Yudkowsky seems to agree with the value of a probability distribution for timelines.

(Quoting The Weak Inside View (2008) from the AI FOOM Debate)

So to me it seems “obvious” that my view of optimization is only strong enough to produce loose, qualitative conclusions, and that it can only be matched to its retrodiction of history, or wielded to pro-
duce future predictions, on the level of qualitative physics.

“Things should speed up here,” I could maybe say. But not “The doubling time of this exponential should be cut in half.”

I aspire to a deeper understanding of intelligence than this, mind you. But I’m not sure that even perfect Bayesian enlightenment would let me predict quantitatively how long it will take an AI to solve various problems in advance of it solving them. That might just rest on features of an unexplored solution space which I can’t guess in advance, even though I understand the process that searches.

On the other hand, my interpretation of Yudkowsky strongly disagree with the second part of your paragraph:

And I think that Ajeya's framework is the best framework I know of for getting that distribution. I think any reasonable distribution should be formed by Ajeya's framework, or some more complicated model that builds off of it (adding more bells and whistles such as e.g. a data-availability constraint or a probability-of-paradigm-shift mechanic.). Insofar as Yudkowsky was arguing against this, and saying that we need to throw out the whole model and start from scratch with a different model, I was not convinced. (Though maybe I need to reread the post and/or your steelman summary)

So my interpretation of the text is that Yudkowsky says that you need to know how compute will be transformed into AGI to estimate the timelines (then you can plug your estimates for the compute), and that the default of any approach which relies on biological analogies for that part will be sprouting nonsense, because evolution and biology optimize in fundamentally different ways than human researchers do.

For each of the three examples, he goes into more detail about the way this is instantiated. My understanding of his criticism of Ajeya's model is that he disagrees that just current deep learning algorithms are actually a recipe for turning compute into AGI, and so saying "we keep to current deep learning and estimated the required compute" doesn't make sense and doesn't solve the question of how to turn compute into AGI. (Note that his might be the place where you or someone defending Ajeya's model want to disagree with Yudkowsky. I'm just pointing that this is a more productive place to debate him because that might actually make him change his mind — or change your mind if he convinces you)

The more general argument (the reason why "the trick" doesn't work) is that if you actually have a way of transforming compute into AGI, that means you know how to build AGI. And if you do, you're very, very close to the end of the timeline.

Biology-Inspired AGI Timelines: The Trick That Never Works

Strongly disagree with this, to the extent that I think this is probably the least cruxy topic discussed in this post, and thus the comment is as wrong as is physically possible.

Remove Platt's law, and none of the actual arguments and meta-discussions changes. It's clearly a case of Yudkowsky going for the snappy "hey, see like even your new-and-smarter report makes exactly the same estimation predicted by a random psychological law" + his own frustration with the law still applying despite expected progress.

But once again, if Platt's law was so wrong that there was never in the history of the universe a single instance of people predicting strong AI and/or AGI in 30 years, this would have no influence whatsoever on the arguments in this post IMO.

Biology-Inspired AGI Timelines: The Trick That Never Works

I do agree that the halving-of-compute-costs-every-2.5-years estimate seems too slow to me; it seems like that's the rate of "normal incremental progress" but that when you account for the sort of really important ideas (or accumulations of ideas, or shifts in research direction towards more fruitful paths) that happen about once a decade, the rate should be faster than that.

I don't think this is what Yudkowsky is saying at all in the post. Actually, I think he is saying the exact opposite: that 2.5 years estimate is too fast as an estimate that is supposed to always work. If I understand correctly, his point is that you have significantly less than that most of the time, except during the initial growth after paradigms shifts where you're pushing as much compute as you can on your new paradigm. (That being said, Yudkowsky seems to agree with you that this should make us directionally update towards AGI arriving in less time)

My interpretation seems backed by this quote (and the fact that he's presenting these points as if they're clearly wrong):

Eliezer:  Backtesting this viewpoint on the previous history of computer science, it seems to me to assert that it should be possible to:

  • Train a pre-Transformer RNN/CNN-based model, not using any other techniques invented after 2017, to GPT-2 levels of performance, using only around 2x as much compute as GPT-2;
  • Play pro-level Go using 8-16 times as much computing power as AlphaGo, but only 2006 levels of technology.

[...]

Your model apparently suggests that we have gotten around 50 times more efficient at turning computation into intelligence since that time; so, we should be able to replicate any modern feat of deep learning performed in 2021, using techniques from before deep learning and around fifty times as much computing power.

 

 

This seems true but changing the subject. Insofar as the subject is "what should our probability distribution over date-of-AGI-creation look like" then Ajeya's framework (broadly construed) is the right way to think about it IMO. Separately, we should worry that this will never let us predict with confidence that it is happening in X years, and thus we should be trying to have a general policy that lets us react quickly to e.g. two years of warning.

I don't understand how Yudkowsky can be changing the subject when his subject has never been about "probability distribution over date-of-AGI-creation"? His point IMO is that this is a bad question to ask, not because you wouldn't want the true answer if you could magically get it, but because we don't have and won't have even close to the amount of evidence needed to do this non-trivially until 2 years before AGI (and maybe not even then, because you need to know the Thielian secrets). As such, to reach an answer that fit that type, you must contort the evidence and extract more bits of information that the analogies actually contain, which means that this is a recipe for saying nonsense.

(Note that I'm not arguing Yudkowsky is right, just that I think this is his point, and that your comment is missing it — might be wrong about all of those ^^)

I think OpenPhil is totally right here. My own stance is that the 2050-centered distribution is a directional overestimate because e.g. the long-horizon anchor is a soft upper bound (in fact I think the medium-horizon anchor is a soft upper bound too, see Fun with +12 OOMs.)

Here too this sounds like missing Yudkowsky's point, which is made in the paragraph just after your original quote:

Eliezer:  Mmm... there's some justice to that, now that I've come to write out this part of the dialogue.  Okay, let me revise my earlier stated opinion:  I think that your biological estimate is a trick that never works and, on its own terms, would tell us very little about AGI arrival times at all.  Separately, I think from my own model that your timeline distributions happen to be too long.

My interpretation is that he's saying that:

  • The model, and the whole approach, is a fundamentally bad and misguided way of thinking about these questions, which falls in the many ways he's arguing for before in the dialogue
  • If he stops talking about whether the model is bad, and just looks at its output, then he thinks that's an overestimate compared to the output of his own model.
Load More