Samuel Dylan Martin

MTAIR project and Center on Long-term Risk, PhD candidate working on cooperative AI, Philosophy and Physics BSc, AI MSc at Edinburgh. Interested in philosophy, longtermism and AI Alignment. I write science fiction at

Wiki Contributions


I think this is a good description of what agent foundations is and why it might be needed. But the binary of 'either we get alignment by default or we need to find the True Name' isn't how I think about it.

Rather, there's some unknown parameter, something like 'how sharply does the pressure towards incorrigibility ramp up, what capability level does it start at, how strong is it'?

Setting this at 0 means alignment by default. Setting this higher and higher means we need various kinds of Prosaic alignment strategies which are better at keeping systems corrigible and detecting bad behaviour. And setting it at 'infinity' means we need to find the True Names/foundational insights.


My rough model is that there's an unknown quantity about reality which is roughly "how strong does the oversight process have to be before the trained model does what the oversight process intended for it to do". p(doom) mainly depends on whether the actors training the powerful systems have sufficiently powerful oversight processes.

Maybe one way of getting at this is to look at ELK - if you think the simplest dumbest ELK proposals probably work, that's Alignment by Default. The harder you think prosaic alignment is, the more complex an ELK solution you expect to need. And if you think we need agent foundations, you think we need a worst-case ELK solution.

Like I said in my first comment, the in practice difficulty of alignment is obviously connected to timeline and takeoff speed.

But you're right that you're talking about the intrinsic difficulty of alignment Vs takeoff speed in this post, not the in practice difficulty.

But those are also still correlated, for the reasons I gave - mainly that a discontinuity is an essential step in Eleizer style pessimism and fast takeoff views. I'm not sure how close this correlation is.

Do these views come apart in other possible worlds? I.e. could you believe in a discontinuity to a core of general intelligence but still think prosaic alignment can work?

I think that potentially you can - if you think that still enough capabilities in pre-HLMI AI (pre discontinuity) to help you do alignment research before dangerous HLMI shows up. But prosaic alignment seems to require more assumptions to be feasible assuming a discontinuity, like that the discontinuity doesn't occur before all the important capabilities you need to do good alignment research.

three possibilities about AI alignment which are orthogonal to takeoff speed and timing

I think "AI Alignment difficulty is orthogonal to takeoff speed/timing" is quite conceptually tricky to think through, but still isn't true. It's conceptually tricky because the real truth about 'alignment difficulty' and takeoff speed, whatever it is, is probably logically or physically necessary: there aren't really alternative outcomes there. But we have a lot of logical uncertainty and conceptual confusion, so it still looks like there are different possibilities. Still, I think they're correlated.

First off, takeoff speed and timing are correlated: if you think HLMI is sooner, you must think progress towards HLMI will be faster, which implies takeoff will also be faster.

The faster we expect takeoff to go, the more likely it is that alignment is also difficult. There are two reasons for this. One is practical: the faster takeoff is, the less time you have to solve the problem before unaligned competitors become a problem. But the second is about the intrinsic difficulty of alignment (which I think is what you're talking about here).

Much of the reason that alignment pessimists like Eliezer think that prosaic alignment can't work, is that they expect that when we reach a capability discontinuity/find the core of general intelligence/enter the regime where AI capabilities start generalizing much further than they were before, whatever we were using to ensure corrigibility will suddenly break on us and probably trigger deceptive alignment immediately with no intermediate phase.

The more gradual and continuous you expect this scaling up to be, the more confident you should be in prosaic alignment, or alignment by default. There are other variables at play, the two aren't in direct correlation, but they aren't orthogonal.

(Also, the whole idea of getting assistance from AI tools on alignment research is in the mix here as well. If there's a big capability discontinuity when we find the core of generality, that causes systems to generalize really far, and also breaks corrigibility, then plausibly but not necessarily, all the capabilities we need to do useful alignment research in time to avoid unaligned AI disasters are on the other side of that discontinuity, creating a chicken-and-egg problem.)

Another way of picking up on this fact is that many of the analogy arguments used for fast takeoff (for example, that human evolution gives us evidence for giant qualitative jumps in capability) also in very similar form are used to argue for difficult alignment (e.g. that when humans started ramping up in intelligence suddenly we also started ignoring the goals of our 'outer optimiser').

catastrophists: when evolution was gradually improving hominid brains, suddenly something clicked - it stumbled upon the core of general reasoning - and hominids went from banana classifiers to spaceship builders. hence we should expect a similar (but much sharper, given the process speeds) discontinuity with AI.

gradualists: no, there was no discontinuity with hominids per se; human brains merely reached a threshold that enabled cultural accumulation (and in a meaningul sense it was culture that built those spaceships). similarly, we should not expect sudden discontinuities with AI per se, just an accelerating (and possibly unfavorable to humans) cultural changes as human contributions will be automated away.


I found the extended Fire/Nuclear Weapons analogy to be quite helpful. Here's how I think it goes:

In 1870 the gradualist and the catastrophist physicist wonder about whether there will ever be a discontinuity in explosive power

  • Gradualist: we've already had our zero-to-one discontinuity - we've invented black powder, dynamite and fuses, from now on there'll be incremental changes and inventions that increase explosive power but probably not anything qualitatively new, because that's our default expectation with a technology like explosives where there are lots of paths to improvement and lots of effort exerted
  • Catastrophist: that's all fine and good, but those priors don't mean anything if we have already seen an existence proof for qualitatively new energy sources. What about the sun? The energy the sun outputs is overwhelming, enough to warm the entire earth. One day, we'll discover how to release those energies ourselves, and that will give us qualitatively better explosives.
  • Gradualist: But we don't know anything about how the sun works! It's probably just be a giant ball of gas heated by gravitational collapse! One day, in some crazy distant future, we might be able to pile on enough gas that gravity implodes and heats it, but that'll require us to be able to literally build stars, it's not going to occur suddenly. We'll pile up a small amount of gas, then a larger amount, and so on after we've given up on assembling bigger and bigger piles of explosives. There's no secret physics there, just a lot of conventional gravitational and chemical energy in one place
  • Catastrophist: ah, but don't you know Lord Kelvin calculated the Sun could only shine for a few million years under the gravitational mechanism, and we know the Earth is far older than that? So there has to be some other, incredibly powerful energy source that we've not yet discovered within the sun. And when we do discover it, we know it can under the right circumstances, release enough energy to power the Sun, so it seems foolhardy to assume it'll just happen to be as powerful as our best normal explosive technologies are whenever we make the discovery. Imagine the coincidence if that was true! So I can't say when this will happen or even exactly how powerful it'll be, but when we discover the Sun's power it will probably represent a qualitatively more powerful new energy source. Even if there are many ways to try to tweak our best chemical explosives to be more powerful and/or the potential new sun-power explosives to be weaker, and we'd still not hit the narrow target of the two being roughly on the same level.
  • Gradualist: Your logic works, but I doubt Lord Kelvin's calculation


It seems like the AGI Gradualist sees the example of humans like my imagined Nukes Gradualist sees the sun, i.e. just a scale up of what we have now. While the AGI Catastrophist sees Humans as my imagined Nukes Catastrophist sees the sun.

The key disanalogy is that for the Sun case, there's a very clear 'impossibility proof' given by the Nukes Catastrophist that the sun couldn't just be a scale up of existing chemical and gravitational energy sources.

Compare this,


We're in the Eliezerverse with huge kinks in loss graphs on automated programming/Putnam problems.

Not from scaling up inputs but from a local discovery that is much bigger in impact than the sorts of jumps we observe from things like Transformers.


but, sure, "huge kinks in loss graphs on automated programming / Putnam problems" sounds like something that is, if not mandated on my model, much more likely than it is in the Paulverse. though I am a bit surprised because I would not have expected Paul to be okay betting on that.

to this,

[Rohin] To the extent this is accurate, it doesn't seem like you really get to make a bet that resolves before the end times, since you agree on basically everything until the point at which Eliezer predicts that you get the zero-to-one transition on the underlying driver of impact. 

Eliezer does (at least weakly) expect more trend breaks before The End (even on metrics that aren't qualitative measures of impressiveness/intelligence, but just things like model loss functions), despite the fact that Rohin's summary of his view is (I think) roughly accurate.

What explains this? I think it's something roughly like, part of the reason Eliezer expects a sudden transition when we reach the core of generality in the first place is because he thinks that's how things usually go in the history of tech/AI progress - there's also specific reasons to think it will happen in the case of finding the core of generality, but there are also general reasons. See e.g. this from Eliezer: 

well, the Eliezerverse has more weird novel profitable things, because it has more weirdness

I take 'more weirdness' to mean something like more discoveries that induce sudden improvements out there in general.

So I think that's why his view does make (weaker) differential predictions about earlier events that we can test, not because the zero-to-one core of generality hypothesis predicts anything about narrow AI progress, but because some of the beliefs that led to that hypothesis do.



We can see there's two (connected) lines of argument and that Eliezer and Paul/Carl/Richard have different things to say on each - 1 is more localized and about seeing what we can learn about AGI specifically, and 2 is about reference class reasoning and what tech progress in general tells us about AGI:

  1. Specific to AGI: What can we infer from human evolution and interrogating our understanding of general intelligence (?) about whether AGI will arrive suddenly?
  2. Reference Class: What can we infer about AGI progress from the general record of technological progress, especially how common big impacts are when there's lots of effort and investment?

My sense is that Eliezer answers

  1. Big update for Eliezer's view: This tells us a lot, in particular we learn evolution got to the core of generality quickly, so AI progress will probably get there quickly as well. Plus, Humans are an existence proof for the core of generality, which suggests our default expectation should be sudden progress when we hit the core.
  2. Smaller update for Eliezer's view: This isn't that important - there's no necessary connection between AGI and e.g. bridges or nukes. But, you can at least see that there's not a strong consistent track record of continuous improvement once you understand the historical record the right way (plus the underlying assumption in 2 that there will be a lot of effort and investment is probably wrong as well). Nonetheless, if you avoid retrospective trend-fitting and look at progress in the most natural (qualitative?) way, you'll see that early discoveries that go from 0 to 1 are all over the place - Bitcoin, the Wright flyer, nuclear weapons are at least not crazy exceptions and quite possibly the default.

While Paul and Carl(?) answer,

  1. Smaller update for Paul's view: The disanalogies between AI progress and Evolution all point in the direction of AI progress being smoother than evolution (we're intelligently trying to find the capabilities we want) - we get a weak update in favour of the smooth progress view from understanding this disanalogy between AI progress and evolution, but really we don't learn much, except that there aren't any good reasons to think there are only a few paths to a large set of powerful world affecting capabilities. Also, the core of generality idea is wrong, so the idea that Humans are an existence proof for it or that evolution tells us something about how to find it is wrong.
  2. Big update for Paul's view: reasoning from the reference class of 'technologies where there are many opportunities for improvement and many people trying different things at once' lets us see why expecting smooth progress should be the default. It's because as long as there are lots of paths to improvement in the underlying capability landscape (which is the default because that's how the world works by default), and there are lots of people trying to make improvements in different ways, the incremental changes add up to smooth outputs.

So Eliezer's claim that Paul et al's trend-fitting must include

doing something sophisticated but wordless, where they fit a sophisticated but wordless universal model of technological permittivity to bridge lengths, then have a wordless model of cognitive scaling in the back of their minds

is sort of correct, but the model isn't really sophisticated or wordless.

The model is: as long as the underlying 'capability landscape' offers many paths to improvements, not just a few really narrow ones that swamp everything else, lots of people intelligently trying lots of different approaches will lead to lots of small discoveries that add up. Additionally, most examples of tech progress look like 'multiple ways of doing something that add up', this is confirmed by the historical record.

And then the model of cognitive scaling consists of (among other things) specific counterarguments to the claim that AGI progress is one of those cases with a few big paths to improvement (e.g. Evolution doesn't give us evidence that AGI progress will be sudden).




As I work through sectors and the rollout of past automation I see opportunities for large-scale rollout that is not heavily blocked by regulation...[Long list of examples]


so... when I imagine trying to deploy this style of thought myself to predict the recent past without benefit of hindsight, it returns a lot of errors. perhaps this is because I do not know how to use this style of thought...

..."There are many possible regulatory regimes in the world, some of which would permit rapid construction of mRNA-vaccine factories well in advance of FDA approval. Given the overall urgency of the pandemic some of those extra-USA vaccines would be sold to individuals or a few countries like Israel willing to pay high prices for them, which would provide evidence of efficacy and break the usual impulse towards regulatory uniformity among developed countries..."

On Carl's view, it sure seems like you'd just say something like "Healthcare is very overregulated, there will be an unusually strong effort anyway in lots of countries because Covid is an emergency, so it'll be faster by some hard to predict amount but still bottlenecked by regulatory pressures." And indeed the fastest countries got there in ~10 months instead of the multiple years predicted by superforecasters, or the ~3 months it would have taken with immediate approval.

The obvious object-level difference between Eliezer 'applying' Carl's view to retrodict covid vaccine rollout and Carl's prediction about AI is that Carl is saying there's an enormous number of potential applications of intermediately general AI tech, and many of them aren't blocked by regulation, while Eliezer's attempted operating of Carl's view for covid vaccines is saying "There are many chances for countries with lots of regulatory barriers to do the smart thing".

The vaccine example is a different argument than AI predictions, because what Carl is saying is that there are many completely open goals for improvement like automating factories and call centres etc. not that there are many opportunities to avoid the regulatory barriers that will block everything by default.

But it seems like Eliezer is making a more outside view appeal, i.e. approach stories where big innovations are used wisely with a lot of scepticism because of our past record, even if you can tell a story about why it will be quite different this time.

Summary of why I think the post's estimates are too low as estimates of what's required for a system capable of seizing a decisive strategic advantage:

To be an APS-like system OmegaStar needs to be able to control robots or model real world stuff and also plan over billions, not hundreds of action steps.

Each of those problems adds on a few extra OOMs that aren't accounted for in e.g. the setup for Omegastar (which can transfer learn across tens of thousands of games, each requiring thousands of action steps to win in a much less complicated environment than the real world).

You'd need something that can transfer learn across tens of thousands of 'games' each requiring billions of action steps, each one of which has way more sensory input to parse than StarCraft per time step.

When you correct Omegastar's requirements by adding on (1) a factor for number of action steps needed to win a war Vs win a game of StarCraft, (2) a factor for the real world Vs StarCraft's complexity of sensory input and. When you do this, the total requirement would look more like Ajeya's reports.

I still get the intuition that OmegaStar would not just be a fancy game player! I find it hard to think about what it would be like - maybe good at gaming quite constrained systems or manipulating people?


Therefore, I think the arguments provide a strong case (unless scaling laws break - which I also think is fairly likely for technical reasons) for 'something crazy happening by 2030' but less strong a case for 'AI takeover by 2030'


Summary of mine and Daniel's disagreements:

(1) Horizon Length: Daniel thinks we'll get a long way towards planning over a billion action steps 'for free' if we transfer learn over lots of games that take a thousand action steps each - so the first correction factor I gave is a lot smaller than it seems just by comparing the raw complexity of StarCraft Vs fighting a war

(2) No Robots: the complexity in sensory input difference doesn't matter since the system won't need to control robots [Or, as I should have also said, build robot-level models of the external world even if you're not running the actuators yourself] - so the second correction factor isn't an issue, because-

(3) Lower capability threshold: to take a DSA doesn't require as many action steps as it seems or as many capabilities as often assumed. You can just do it by taking to people and over a smaller number of action steps than it would take to conquer the world yourself.

To me, it seems like Daniel's view on horizon length reducing one of the upward corrections (1) is doing less total work than (2) and (3) in terms of shortening the timeline - hence this view looks to me like a case of plausible DSA from narrow AI with specialized abilities. Although point taken that it won't look that narrow to most people today.

(Re scaling laws - there's a whole debate I about how scaling laws are just a v crude observable for what's really going on, so we shouldn't be confident in extrapolation. This is also all conditional on the underlying assumptions of these forecasting models being correct.)

Updates on this after reflection and discussion (thanks to Rohin):

Human Evolution tells us very little about the 'cognitive landscape of all minds' (if that's even a coherent idea) - it's simply a loosely analogous individual historical example

Saying Paul's view is that the cognitive landscape of minds might be simply incoherent isn't quite right - at the very least you can talk about the distribution over programs implied by the random initialization of a neural network.

I could have just said 'Paul doesn't see this strong generality attractor in the cognitive landscape' but it seems to me that it's not just a disagreement about the abstraction, but that he trusts claims made on the basis of these sorts of abstractions less than Eliezer.

Also, on Paul's view, it's not that evolution is irrelevant as a counterexample. Rather, the specific fact of 'evolution gave us general intelligence suddenly by evolutionary timescales' is an unimportant surface fact, and the real truth about evolution is consistent with the continuous view.

No core of generality and extrapolation of quantitative metrics for things we care about and lack of common huge secrets in relevant tech progress reference class

These two initial claims are connected in a way I didn't make explicit - No core of generality and lack of common secrets in the reference class together imply that there are lots of paths to improving on practical metrics (not just those that give us generality), that we are putting in lots of effort into improving such metrics and that we tend to take the best ones first, so the metric improves continuously, and trend extrapolation will be especially correct.

Core of generality and very common presence of huge secrets in relevant tech progress reference class

The first clause already implies the second clause (since "how to get the core of generality" is itself a huge secret), but Eliezer seems to use non-intelligence related examples of sudden tech progress as evidence that huge secrets are common in tech progress in general, independent of the specific reason to think generality is one such secret.


Nate's Summary

... Eliezer was saying something like "the fact that humans go around doing something vaguely like weighting outcomes by possibility and also by attractiveness, which they then roughly multiply, is quite sufficient evidence for my purposes, as one who does not pay tribute to the gods of modesty", while Richard protested something more like "but aren't you trying to use your concept to carry a whole lot more weight than that amount of evidence supports?"..

And, ofc, at this point, my Eliezer-model is again saying "This is why we should be discussing things concretely! It is quite telling that all the plans we can concretely visualize for saving our skins, are scary-adjacent; and all the non-scary plans, can't save our skins!"

Nate's summary brings up two points I more or less ignored in my summary because I wasn't sure what I thought - one is, just what role do the considerations about expected incompetent response/regulatory barriers/mistakes in choosing alignment strategies play? Are they necessary for a high likelihood of doom, or just peripheral assumptions? Clearly, you have to posit some level of "civilization fails to do the x-risk-minimizing thing" if you want to argue doom, but how extreme are the scenarios Eliezer is imagining where success is likely?

The other is the role that the modesty worldview plays in Eliezer's objections.

I feel confused/suspect we might have all lost track of what Modesty epistemology is supposed to consist of - I thought it was something like "overuse of the outside view, especially in a social cognition context".

Which of the following is:

a) probably the product of a Modesty world-view?

b) no good reason to think comes from a Modesty world-view but still bad epistemology?

c) good epistemology?

  1. Not believing theories which don’t make new testable predictions just because they retrodict lots of things in a way that the theories proponents claim is more natural, but that you don’t understand, because that seems generally suspicious
  2. Not believing theories which don’t make new testable predictions just because they retrodict lots of things in the world naturally (in a way you sort of get intuitively), because you don’t trust your own assessments of naturalness that much in the absence of discriminating evidence
  3. Not believing theories which don’t make new testable predictions just because they retrodict lots of things in the world naturally (in a way you sort of get intuitively), because most powerful theories which cause conceptual revolutions also make new testable predictions, so it’s a bad sign if the newly proposed theory doesn’t.
  4. As a general matter, accepting that there are lots of cases of theories which are knowably true independent of any new testable predictions they make because of features of the theory. Things like the implication of general relativity from the equivalence principle, or the second law of thermodynamics from Noether’s theorem, or many-worlds from QM are real, but you’ll only believe you’ve found a case like this if you’re walked through to the conclusion, so you're sure that the underlying concepts are clear and applicable, or there’s already a scientific consensus behind it.

Holden also mentions something a bit like Eliezer's criticism in his own write-up,

In particular, I think it's hard to rule out the possibility of ingenuity leading to transformative AI in some far more efficient way than the "brute-force" method contemplated here.

When Holden talks about 'ingenuity' methods that seems consistent with Eliezer's 

They're not going to be taking your default-imagined approach algorithmically faster, they're going to be taking an algorithmically different approach that eats computing power in a different way than you imagine it being consumed.

I.e. if you wanted to fold this consideration into OpenAI's estimate you'd have to do it by having a giant incredibly uncertain free-floating variable for 'speedup factor' because you'd be nonsensically trying to estimate the 'speed-up' to brain processing applied from using some completely non-Deep Learning or non-brainlike algorithm for intelligence. All your uncertainty just gets moved into that one factor, and you're back where you started.


It's possible that Eliezer is confident in this objection partly because of his 'core of generality' model of intelligence - i.e. he's implicitly imagining enormous numbers of varied paths to improvement that end up practically in the same place, while 'stack more layers in a brainlike DL model' is just one of those paths (and one that probably won't even work), so he naturally thinks estimating the difficulty of this one path we definitely won't take (and which probably wouldn't work even if we did try it) out of the huge numbers of varied paths to generality is useless.

However, if you don't have this model, then perhaps you can be more confident that what we're likely to build will look at least somewhat like a compute-limited DL system and that these other paths will have to share some properties of this path. Relatedly, it's an implication of the model that there's some imaginable (and not e.g. galaxy sized) model we could build right now that would be an AGI, which I think Eliezer disputes?

isn't trying to do anything like "sketch a probability distribution over the dynamics of an AI project that is nearing AGI". This includes all technical MIRI papers I'm familiar with.

I think this specific scenario sketch is from a mainstream AI safety perspective a case where we've already failed - i.e. we've invented a useless corrigibility intervention that we confidently but wrongly think is scalable.

And if you try training the AI out of that habit in a domain of lower complexity and intelligence, it is predicted by me that generalizing that trained AI or subsystem to a domain of sufficiently higher complexity and intelligence, but where you could still actually see overt plots, would show you the AI plotting to kill you again.

If people try this repeatedly with other corrigibility training tricks on the level where plots are easily observable, they will eventually find a try that seems to generalize to the more complicated and intelligent validation set, but which kills you on the test set.

Most AI safety researchers just don't agree with Eliezer that there's no (likely to be found) corrigibility interventions that won't suddenly and invisibly fail when you increase intelligence, no matter how well you've validated them on low capability regimes and how carefully you try to scale up. This is because they don't agree with/haven't heard of Eliezer's arguments about consequentialism being a super-strong attractor.

So they'd think the 'die with the most dignity' interventions would just work, while the 'die with no dignity' interventions are risky, and quite reasonably push for the former (since it's far from clear we'll take the 'dignified' option by default): trying corrigibility interventions at low levels of intelligence, testing the AI on validation sets to see if it plots to kill them, while scaling up.

They might be wrong about this working, but if so, the wrongness isn't in lacking enough security mindset to see that an AI trying to kill you would just alter its own cognition to cheat its way past the tests. Rather, their mistake is not expecting the corrigibility interventions they presumably trust to suddenly break in a way that means you get no useful safety guarantees from any amount of testing at lower capability levels.

I think it's a shame Eliezer didn't pose the 'validation set' question first before answering it himself, because I think if you got rid of the difference in underlying assumptions - i.e. asked an alignment researcher "Assume there's a strong chance your corrigibility intervention won't work upon scaling up and the AGI might start plotting against you, so you're going to try these transparency/validation schemes on the AGI to check if it's safe, how could they go wrong and is this a good idea?" they'd give basically the same answer - i.e. if you try this you're probably going to die.


You could still reasonably say, "even if the AI safety community thinks it's not the best use of resources because ensuring knowably stable corrigibility looks a lot easier to us, shouldn't we still be working on some strongly deception-proof method of verifying if an agent is safe, so we can avoid killing ourselves if plan A fails?"

My answer would be yes.

Load More