Alex Lintz's take: https://forum.effectivealtruism.org/posts/eggdG27y75ot8dNn7/three-pillars-for-avoiding-agi-catastrophe-technical
Thanks for the update, Ajeya! I found the details here super interesting.
I already thought that timelines disagreements within EA weren't very cruxy, and this is another small update in that direction: I see you and various MIRI people and Metaculans give very different arguments about how to think about timelines, and then the actual median year I tend to hear is quite similar.
(And also, all of the stated arguments on all sides continue to seem weak/inconclusive to me! So IMO there's not much disagreement, and it would be very easy for all of us to be wro... (read more)
Some added context for this list: Nate and Eliezer expect the first AGI developers to encounter many difficulties in the “something forces you to stop and redesign (and/or recode, and/or retrain) large parts of the system” category, with the result that alignment adds significant development time.
By default, safety-conscious groups won't be able to stabilize the game board before less safety-conscious groups race ahead and destroy the world. To avoid this outcome, humanity needs there to exist an AGI group that…
One caveat to the claim that we should prioritize serial alignment work over parallelizable work, is that this assumes an omniscient and optimal allocator of researcher-hours to problems.
Why do you think it assumes that?
This isn't a coincidence; the state of alignment knowledge is currently "we have no idea what would be involved in doing it even in principle, given realistic research paths and constraints", very far from being a well-specified engineering problem. Cf. https://intelligence.org/2013/11/04/from-philosophy-to-math-to-engineering/.
If you succeed at the framework-inventing "how does one even do this?" stage, then you can probably deploy an enormous amount of engineering talent in parallel to help with implementation, small iterative improvements, building-upon-foundations, targeting-established-metrics, etc. tasks.
From A central AI alignment problem: capabilities generalization, and the sharp left turn:
Suppose that the fictional team OpenMind is training up a variety of AI systems, before one of them takes that sharp left turn. Suppose they've put the AI in lots of different video-game and simulated environments, and they've had good luck training it to pursue an objective that the operators described in English. "I don't know what those MIRI folks were talking about; these systems are easy to direct; simple training suffices", they say. At the same time, they apply
(Most of the QR-upvotes at the moment are from me. I think 1-4 are all good questions, for Nate or others; but I'm extra excited about people coming up with ideas for 3.)
Nate's follow-up post is now up: On how various plans miss the hard bits of the alignment challenge.
When I think about the strawberry problem, it seems unnatural, and perhaps misleading of our attention, since there's no guarantee there's even a reasonable solution.
Why would there not be a solution?
To clarify, I said there might not be a reasonable solution (i.e. such that solving the strawberry problem isn't significantly harder than solving pivotal-act alignment).
Not directly answering your Q, but here's why it seems unnatural and maybe misleading-of-attention. Copied from a Slack message I sent:
First, I suspect that even an aligned AI would fail the "duplicate a strawberry and do nothing else" challenge, because such an AI would care about human life and/or about cooperating with humans, and would be asked to stand by while 1.8 humans
On my model, the point of ass numbers isn't to demand perfection of your gut (e.g., of the sort that would be needed to avoid multiple-stage fallacies when trying to conditionalize a lot), but to:
Collecting all of the quantitative AI predictions I know of MIRI leadership making on Arbital (let me know if I missed any):
Note that I advocate for considering much more weird solutions, and also thinking much more weird world states when talking with the "general world". While in contrast, on LW and AF, I'd like to see more discussion of various "boring" solutions on which the world can roughly agree.
Can I get us all to agree to push for including pivotal acts and pivotal processes in the Overton window, then? :) I'm happy to publicly talk about pivotal processes and encourage people to take them seriously as options to evaluate, while flagging that I'm ~2-5% on them be... (read more)
With pretty high confidence, you expect sharp left turn to happen (in almost all trajectories)This is to a large extent based on the belief that at some point "systems start to work really well in domains really far beyond the environments of their training" which is roughly the same as "discovering a core of generality" and few other formulations. These systems will be in some meaningful sense fundamentally different from eg Gato
That's right, though the phrasing "discovering a core of generality" here sounds sort of mystical and mysterious to me, which ma... (read more)
In my view, in practice, the pivotal acts framing actually pushes people to consider a more narrow space of discrete powerful actions, "sharp turns", "events that have a game-changing impact on astronomical stakes".
My objection to Critch's post wasn't 'you shouldn't talk about pivotal processes, just pivotal acts'. On the contrary, I think bringing in pivotal processes is awesome.
My objection (more so to "Pivotal Act" Intentions, but also to the new one) is specifically to the idea that we should socially shun the concept of "pivotal acts", and socia... (read more)
An example of a possible "pivotal act" I like that isn't "melt all GPUs" is:
Use AGI to build fast-running high-fidelity human whole-brain emulations. Then run thousands of very-fast-thinking copies of your best thinkers. Seems to me this plausibly makes it realistic to keep tabs on the world's AGI progress, and locally intervene before anything dangerous happens, in a more surgical way rather than via mass property destruction of any sort.
Looking for pivotal acts that are less destructive (and, more importantly for humanity's sake, less difficult to align)... (read more)
Some hopefully-unnecessary background info for people attempting this task:
A description of corrigibility Eliezer wrote a few months ago: "'corrigibility' is meant to refer to the sort of putative hypothetical motivational properties that prevent a system from wanting to kill you after you didn't build it exactly right".
An older description of "task-directed AGI" he wrote in 2015-2016: "A task-based AGI is an AGI intended to follow a series of human-originated orders, with these orders each being of limited scope", where the orders can be "accomplished using bounded amounts of effort and resources (as opposed to the goals being more and more fulfillable using more and more effort)."
Ronny Fernandez on Twitter:
I think I don’t like AI safety analogies with human evolution except as illustrations. I don’t think they’re what convinced the people who use those analogies, and they’re not what convinced me. You can convince yourself of the same things just by knowing some stuff about agency.Corrigibility, human values, and figure-out-while-aiming-for-human-values, are not short description length. I know because I’ve practiced finding the shortest description lengths of things a lot, and they just don’t seem like the right sort of thing.Also
I think I don’t like AI safety analogies with human evolution except as illustrations. I don’t think they’re what convinced the people who use those analogies, and they’re not what convinced me. You can convince yourself of the same things just by knowing some stuff about agency.
Corrigibility, human values, and figure-out-while-aiming-for-human-values, are not short description length. I know because I’ve practiced finding the shortest description lengths of things a lot, and they just don’t seem like the right sort of thing.
From an Eliezer comment:
Interventions on the order of burning all GPUs in clusters larger than 4 and preventing any new clusters from being made, including the reaction of existing political entities to that event and the many interest groups who would try to shut you down and build new GPU factories or clusters hidden from the means you'd used to burn them, would in fact really actually save the world for an extended period of time and imply a drastically different gameboard offering new hopes and options. [...]
If Iceland did this, it would plausibly need... (read more)
I kind of like the analogous idea of an alignment target as a repeller cone / dome.
Corrigibility is a repeller. Human values aren't a repeller, but they're a very narrow target to hit.
A lot of models of what can or can't work in AI alignment depends on intuitions about whether to expect "true discontinuities" or just "steep bits".
Note that Nate and Eliezer expect there to be some curves you can draw after-the-fact that shows continuity in AGI progress on particular dimensions. They just don't expect these to be the curves with the most practical impact (and they don't think we can identify the curves with foresight, in 2022, to make strong predictions about AGI timing or rates of progress).
Quoting Nate in 2018:
On my model, the key point
I'm not Eliezer, but my high-level attempt at this:
[...] The things I'd mainly recommend are interventions that:Help ourselves think more clearly. (I imagine this including a lot of trying-to-become-more-rational, developing and following relatively open/honest communication norms, and trying to build better mental models of crucial parts of the world.)Help relevant parts of humanity (e.g., the field of ML, or academic STEM) think more clearly and understand the situation.Help us understand and resolve major disagreements. (Especially current disagreements
[...] The things I'd mainly recommend are interventions that:
I think most worlds that successfully navigate AGI risk have properties like:
I understand the first part of your comment as "sure, it's possible for minds to care about reality, but we don't know how to target value formation so that the mind cares about a particular part of reality." Is this a good summary?
I was, first, pointing out that this problem has to be solvable, since the human genome solves it millions of times every day!
True! Though everyone already agreed (e.g., EY asserted this in the OP) that it's possible in principle. The updatey thing would be if the case of the human genome / brain development sugg... (read more)
Why is the process by which humans come to reliably care about the real world, not a process we could leverage analogously to make AIs care about the real world?
Maybe I'm not understanding your proposal, but on the face of it this seems like a change of topic. I don't see Eliezer claiming 'there's no way to make the AGI care about the real world vs. caring about (say) internal experiences in its own head'. Maybe he does think that, but mostly I'd guess he doesn't care, because the important thing is whether you can point the AGI at very, very specifi... (read more)
For example, I claim that while AlphaGo could be said to be agent-y, it does not care about atoms. And I think that we could make it fantastically more superhuman at Go, and it would still not care about atoms. Atoms are just not in the domain of its utility function.In particular, I don't think it has an incentive to break out into the real world to somehow get itself more compute, so that it can think more about its next move. It's just not modeling the real world at all. It's not even trying to rack up a bunch of wins over time. It's just playing the si
For example, I claim that while AlphaGo could be said to be agent-y, it does not care about atoms. And I think that we could make it fantastically more superhuman at Go, and it would still not care about atoms. Atoms are just not in the domain of its utility function.
In particular, I don't think it has an incentive to break out into the real world to somehow get itself more compute, so that it can think more about its next move. It's just not modeling the real world at all. It's not even trying to rack up a bunch of wins over time. It's just playing the si
Here's my answer: https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities?commentId=LowEED2iDkhco3a5d
We have to actually figure out how to build aligned AGI, and the details are crucial. If you're modeling this as a random blog post aimed at persuading people to care about this cause area, a "voice of AI safety" type task, then sure, the details are less important and it's not so clear that Yet Another Marginal Blog Post Arguing For "Care About AI Stuff" matters much.
But humanity also has to do the task of actually figuring o... (read more)
On Twitter, Eric Rogstad wrote:
"the thing where it keeps being literally him doing this stuff is quite a bad sign"I'm a bit confused by this part. Some thoughts on why it seems odd for him (or others) to express that sentiment...1. I parse the original as, "a collection of EY's thoughts on why safe AI is hard". It's EY's thoughts, why would someone else (other than @robbensinger) write a collection of EY's thoughts?(And if we generalize to asking why no-one else would write about why safe AI is hard, then what about Superintelligence, or the AI stuff in co
"the thing where it keeps being literally him doing this stuff is quite a bad sign"
I'm a bit confused by this part. Some thoughts on why it seems odd for him (or others) to express that sentiment...
1. I parse the original as, "a collection of EY's thoughts on why safe AI is hard". It's EY's thoughts, why would someone else (other than @robbensinger) write a collection of EY's thoughts?
(And if we generalize to asking why no-one else would write about why safe AI is hard, then what about Superintelligence, or the AI stuff in co
The conclusion we should take from the concept of mesa-optimisation isn't "oh no alignment is impossible", that's equivalent to "oh no learning is impossible".
The OP isn't claiming that alignment is impossible.
If we were actually inner aligned to the crude heuristics that evolution installed in us for bootstrapping the entire process, we would be totally disfunctional weirdoes.
I don't understand the point you're making here.
The point I'm making is that the human example tells us that:
If first we realize that we can't code up our values, therefore alignment is hard. Then, when we realize that mesa-optimisation is a thing. we shouldn't update towards "alignment is even harder". We should update in the opposite direction.
Because the human example tells us that a mesa-optimiser can reliably point to a complex thing even if the optimiser points to only a few crude things.
But I only ever see these three points, human example, inability to code up values, mesa-optimisation to separately argue for "alignment is even harder than previously thought". But taken together that is just not the picture.
this can (roughly) be read as a set of 42 statements that need to be true for us to in fact be doomed, and statistically speaking it seems unlikely that all of these statements are true.
I don't think these statements all need to be true in order for p(doom) to be high, and I also don't think they're independent. Indeed, they seem more disjunctive than conjunctive to me; there are many cases where any one of the claims being true increases risk substantially, even if many others are false.
a mistake to leave you as the main "public advocate / person who writes stuff down" person for the cause.
It sort of sounds like you're treating him as the sole "person who writes stuff down", not just the "main" one. Noam Chomsky might have been the "main linguistics guy" in the late 20th century, but people didn't expect him to write more than a trivial fraction of the field's output, either in terms of high-level overviews or in-the-trenches research.
I think EY was pretty clear in the OP that this is not how things go on earths that survive. Even if there aren't many who can write high-level alignment overviews today, more people should make the attempt and try to build skill.
The counter-concern is that if humanity can't talk about things that sound like sci-fi, then we just die. We're inventing AGI, whose big core characteristic is 'a technology that enables future technologies'. We need to somehow become able to start actually talking about AGI.
One strategy would be 'open with the normal-sounding stuff, then introduce increasingly weird stuff only when people are super bought into the normal stuff'. Some problems with this:
I mean, all of this feels very speculative and un-cruxy to me; I wouldn't be surprised if the ASI indeed is able to conclude that humanity is no threat at all, in which case it kills us just to harvest the resources.
I do think that normal predators are a little misleading in this context, though, because they haven't crossed the generality ('can do science and tech') threshold. Tigers won't invent new machines, so it's easier to upper-bound their capabilities. General intelligences are at least somewhat qualitatively trickier, because your enemy is 'the space of all reachable technologies' (including tech that may be surprisingly reachable). Tigers can surprise you, but not in very many ways and not to a large degree.
But once you invent cheap tech that can control them you don't need to kill them anymore.
A paperclipper mainly cares about humans because we might have some way to threaten the paperclipper (e.g., by pushing a button that deploys a rival superintelligence); and secondarily, we're made of atoms that can be used to build paperclips.
It's harder to monitor the actions of every single human on Earth, than it is to kill all humans; and there's a risk that monitoring people visibly will cause someone to push the 'deploy a rival superintelligence' button, if such ... (read more)
Yes, where killing all humans is an example of "controlling the people", from the perspective of an Unfriendly AI.
If we go down that path then it becomes the sort of conversation where I have no idea what common assumptions do we have, if any, that we could use to agree. As a general rule, I find it unconstructive, for the purpose of trying to agree on anything, to say things like "this (intuitively compelling) assumption is false" unless you also provide a concrete argument or an alternative of your own. Otherwise the discussion is just ejected into vacuum.
Fair enough! I don't think I agree in general, but I think 'OK, but what's your alternative to agency?' is an es... (read more)
Second, the only reason why the question "what X wants" can make sense at all, is because X is an agent. As a corollary, it only makes sense to the extent that X is an agent.
I'm not sure this is true; or if it's true, I'm not sure it's relevant. But assuming it is true...
Therefore, if X is not entirely coherent then X's preferences are only approximately defined, and hence we only need to infer them approximately.
... this strikes me as not capturing the aspect of human values that looks strange and complicated. Two ways I could imagine the strangeness and ... (read more)
There is a big chunk of what you're trying to teach which not weird and complicated, namely: "find this other agent, and what their values are". Because, "agents" and "values" are natural concepts, for reasons strongly related to "there's a relatively simple core structure that explains why complicated cognitive machines work".
This seems like it must be true to some degree, but "there is a big chunk" feels a bit too strong to me.
Possibly we don't disagree, and just have different notions of what a "big chunk" is. But some things that make the chunk feel sm... (read more)
Humans are at least a little coherent, or we would never get anything done; but we aren't very coherent, so the project of piecing together 'what does the human brain as a whole "want"' can be vastly more difficult than the problem of figuring out what a coherent optimizer wants.
Humans are at least a little coherent, or we would never get anything done; but we aren't very coherent, so the project of piecing together 'what does the human brain as a whole "want"' can be vastly more difficult than the problem of figuring out what a coherent optimizer wants.
This is a point where I feel like I do have a substantial disagreement with the "conventional wisdom" of LessWrong.
First, LessWrong began with a discussion of cognitive biases in human irrationality, so this naturally became a staple of the local narrative. On the other hand, I ... (read more)
I don't think I personally could have written it; if others think they could have, I'd genuinely be interested to hear them brag, even if they can't prove it.
Maybe the ideal would be 'I generated the core ideas of [a,b,c] with little or no argument from others; I had to be convinced of [d,e,f] but I now agree with them; I disagree with [g,h,i]; I think you left out important considerations [x,y,z].' Just knowing people's self-model is interesting to me, I don't demand that everything you believe be immediately provable to me.
It's very clear to me I could have written this if I had wanted to—and at the very least I'm sure Paul could have as well. As evidence: it took me ~1 hour to list off all the existing sources that cover every one of these points in my comment.
I have a couple object-level disagreements including relevance of evolution / nature of inner alignment problem and difficulty of attaining corrigibility. But leaving those aside, I wouldn’t have exactly written this kind of document myself, because I’m not quite sure what the purpose is. It seems to be trying to do a lot of different things for different audiences, where I think more narrowly-tailored documents would be better.
So, here are four useful things to do, and whether I’m personally doing them:
First, there is a mass of people who think AGI risk i... (read more)
I think as of early this year (like, January/February, before I saw a version of this doc) I could have produced a pretty similar list to this one. I definitely would not derive it from the empty string in the closest world-without-Eliezer; I'm unsure how much I'd pay attention to AI alignment at all in that world. I'd very likely be working on agent foundations in that world, but possibly in the context of biology or AI capabilities rather than alignment. Arguments about AI foom and doom were obviously-to-me correct once I paid attention to them at all, b... (read more)
Yes, please do rewrite the post, or make your own version of a post like this!! :) I don't suggest trying to persuade arbitrary policymakers of AGI risk, but I'd be very keen on posts like this optimized to be clear and informative to different audiences. Especially groups like 'lucid ML researchers who might go into alignment research', 'lucid mathematicians, physicists, etc. who might go into alignment research', etc.
Suggestion: make it a CYOA-style interactive piece, where the reader is tasked with aligning AI, and could choose from a variety of approaches which branch out into sub-approaches and so on. All of the paths, of course, bottom out in everyone dying, with detailed explanations of why. This project might then evolve based on feedback, adding new branches that counter counter-arguments made by people who played it and weren't convinced. Might also make several "modes", targeted at ML specialists, general public, etc., where the text makes different tradeoffs ... (read more)
I'm thinking of people like Paul Christiano, Nate Soares, John Wentworth, Ajeya Cotra... [...] I do agree with you that they seem to on average be way way too optimistic, but I don't think it's because they are ignorant of the considerations and arguments you've made here.
I don't think Nate is that much more optimistic than Eliezer, but I believe Eliezer thinks Nate couldn't have generated enough of the list in the OP, or couldn't have generated enough of it independently ("using the null string as input").
I agree that this would be scary if the system is, for example, as smart as physically possible. What I'm imagining is:
Conversely, it doesn't seem realistic to define limited impact or corrigibility or whatever without relying on an awful lot of values information - like e.g. what sort of changes-to-the-world we do/don't care about, what thing-in-the-environment the system is supposed to be corrigible with, etc.
I suspect you could do this in a less value-loaded way if you're somehow intervening on 'what the AGI wants to pay attention to', as opposed to just intervening on 'what sorts of directions it wants to steer the world in'.
'Only spend your cognition thinking about in... (read more)
I'm not sure whether you mean "95% correct CEV has a lot of S-risk" or "95% correct CEV has a little S-risk, and even a tiny amount of S-risk is terrifying"?
The latter, as I was imagining "95%".
Nate is writing that post. :)
You're basically saying, your aim is not to design ethical/friendly/aligned AI [...]
My goal is an awesome, eudaimonistic long-run future. To get there, I strongly predict that you need to build AGI that is fully aligned with human values. To get there, I strongly predict that you need to have decades of experience actually working with AGI, since early generations of systems will inevitably have bugs and limitations and it would be catastrophic to lock in the wrong future because we did a rush job.
(I'd also expect us to need the equivalent of subjective ce... (read more)
Yeah, I'm very interested in hearing counter-arguments to claims like this. I'll say that although I think task AGI is easier, it's not necessarily strictly easier, for the reason you mentioned.
Maybe a cruxier way of putting my claim is: Maybe corrigibility / task AGI / etc. is harder than CEV, but it just doesn't seem realistic to me to try to achieve full, up-and-running CEV with the very first AGI systems you build, within a few months or a few years of humanity figuring out how to build AGI at all.
And I do think you need to get CEV up and running withi... (read more)
I think there are multiple viable options, like the toy example EY uses:
I think that after AGI becomes possible at all and then possible to scale to dangerously superhuman levels, there will be, in the best-case scenario where a lot of other social difficulties got resolved, a 3-month to 2-year period where only a very few actors have AGI, meaning that it was socially possible for those few actors to decide to not just scale it to where it automatically destroys the world.During this step, if humanity is to survive, somebody has to perform some feat that c
I think that after AGI becomes possible at all and then possible to scale to dangerously superhuman levels, there will be, in the best-case scenario where a lot of other social difficulties got resolved, a 3-month to 2-year period where only a very few actors have AGI, meaning that it was socially possible for those few actors to decide to not just scale it to where it automatically destroys the world.
During this step, if humanity is to survive, somebody has to perform some feat that c
The 2017 document postulates an "acute risk period" in which people don't know how to align, and then a "stable period" once alignment theory is mature.
"Align" is a vague term. Let's distinguish "strawberry alignment" (where we can safely and reliably use an AGI to execute a task like "Place, onto this particular plate here, two strawberries identical down to the cellular but not molecular level.") from "CEV alignment" (where we can safely and reliably use an AGI to carry out a CEV-like procedure.)
Strawberry alignment seems vastly easier than CEV ali... (read more)
Quoting a thing I said in March:
The two big things we feel bottlenecked on are:(1) people who can generate promising new alignment ideas. (By far the top priority, but seems empirically rare.)(2) competent executives who are unusually good at understanding the kinds of things MIRI is trying to do, and who can run their own large alignment projects mostly-independently.For 2, I think the best way to get hired by MIRI is to prove your abilities via the Visible Thoughts Project. The post there says a bit more about the kind of skills we're looking for:Eliezer
The two big things we feel bottlenecked on are:
For 2, I think the best way to get hired by MIRI is to prove your abilities via the Visible Thoughts Project. The post there says a bit more about the kind of skills we're looking for: