All of Rob Bensinger's Comments + Replies

Thanks for the update, Ajeya! I found the details here super interesting.

I already thought that timelines disagreements within EA weren't very cruxy, and this is another small update in that direction: I see you and various MIRI people and Metaculans give very different arguments about how to think about timelines, and then the actual median year I tend to hear is quite similar.

(And also, all of the stated arguments on all sides continue to seem weak/inconclusive to me! So IMO there's not much disagreement, and it would be very easy for all of us to be wro... (read more)

3Ajeya Cotra2mo
Yeah I agree more of the value of this kind of exercise (at least within the community) is in revealing more granular disagreements about various things. But I do think there's value in establishing to more external people something high level like "It really could be soon and it's not crazy or sci fi to think so."

Some added context for this list: Nate and Eliezer expect the first AGI developers to encounter many difficulties in the “something forces you to stop and redesign (and/or recode, and/or retrain) large parts of the system” category, with the result that alignment adds significant development time.

By default, safety-conscious groups won't be able to stabilize the game board before less safety-conscious groups race ahead and destroy the world. To avoid this outcome, humanity needs there to exist an AGI group that

  • is highly safety-conscious.
  • has a la
... (read more)

One caveat to the claim that we should prioritize serial alignment work over parallelizable work, is that this assumes an omniscient and optimal allocator of researcher-hours to problems.

Why do you think it assumes that?

This isn't a coincidence; the state of alignment knowledge is currently "we have no idea what would be involved in doing it even in principle, given realistic research paths and constraints", very far from being a well-specified engineering problem. Cf. https://intelligence.org/2013/11/04/from-philosophy-to-math-to-engineering/.

If you succeed at the framework-inventing "how does one even do this?" stage, then you can probably deploy an enormous amount of engineering talent in parallel to help with implementation, small iterative improvements, building-upon-foundations, targeting-established-metrics, etc. tasks.

From A central AI alignment problem: capabilities generalization, and the sharp left turn:

Suppose that the fictional team OpenMind is training up a variety of AI systems, before one of them takes that sharp left turn. Suppose they've put the AI in lots of different video-game and simulated environments, and they've had good luck training it to pursue an objective that the operators described in English. "I don't know what those MIRI folks were talking about; these systems are easy to direct; simple training suffices", they say. At the same time, they apply

... (read more)
1Lauro Langosco2mo
Thanks!

(Most of the QR-upvotes at the moment are from me. I think 1-4 are all good questions, for Nate or others; but I'm extra excited about people coming up with ideas for 3.)

When I think about the strawberry problem, it seems unnatural, and perhaps misleading of our attention, since there's no guarantee there's even a reasonable solution.

Why would there not be a solution?

To clarify, I said there might not be a reasonable solution (i.e. such that solving the strawberry problem isn't significantly harder than solving pivotal-act alignment). 

Not directly answering your Q, but here's why it seems unnatural and maybe misleading-of-attention. Copied from a Slack message I sent: 

First, I suspect that even an aligned AI would fail the "duplicate a strawberry and do nothing else" challenge, because such an AI would care about human life and/or about cooperating with humans, and would be asked to stand by while 1.8 humans

... (read more)

On my model, the point of ass numbers isn't to demand perfection of your gut (e.g., of the sort that would be needed to avoid multiple-stage fallacies when trying to conditionalize a lot), but to:

  1. Communicate with more precision than English-language words like 'likely' or 'unlikely' allow. Even very vague or uncertain numbers will, at least some of the time, be a better guide than natural-language terms that weren't designed to cover the space of probabilities (and that can vary somewhat in meaning from person to person).
  2. At least very vaguely and roughly b
... (read more)

On my model, the point of ass numbers isn't to demand perfection of your gut (e.g., of the sort that would be needed to avoid multiple-stage fallacies when trying to conditionalize a lot), but to:

  1. Communicate with more precision than English-language words like 'likely' or 'unlikely' allow. Even very vague or uncertain numbers will, at least some of the time, be a better guide than natural-language terms that weren't designed to cover the space of probabilities (and that can vary somewhat in meaning from person to person).
  2. At least very vaguely and roughly b
... (read more)

Note that I advocate for considering much more weird solutions, and also thinking much more weird world states when talking with the "general world".  While in contrast, on LW and AF, I'd like to see more discussion of various "boring" solutions on which the world can roughly agree.

Can I get us all to agree to push for including pivotal acts and pivotal processes in the Overton window, then? :) I'm happy to publicly talk about pivotal processes and encourage people to take them seriously as options to evaluate, while flagging that I'm ~2-5% on them be... (read more)

  • With pretty high confidence, you expect sharp left turn to happen (in almost all trajectories)
  • This is to a large extent based on the belief that at some point "systems start to work really well in domains really far beyond the environments of their training" which is roughly the same as "discovering a core of generality" and few other formulations. These systems will be in some meaningful sense fundamentally different from eg Gato

That's right, though the phrasing "discovering a core of generality" here sounds sort of mystical and mysterious to me, which ma... (read more)

In my view, in practice, the pivotal acts framing actually pushes people to consider a more narrow space of discrete powerful actions, "sharp turns", "events that have a game-changing impact on astronomical stakes". 

My objection to Critch's post wasn't 'you shouldn't talk about pivotal processes, just pivotal acts'. On the contrary, I think bringing in pivotal processes is awesome.

My objection (more so to "Pivotal Act" Intentions, but also to the new one) is specifically to the idea that we should socially shun the concept of "pivotal acts", and socia... (read more)

3Jan_Kulveit3mo
With the last point: I think can roughly pass your ITT - we can try that, if you are interested. So, here is what I believe are your beliefs * With pretty high confidence, you expect sharp left turn [https://www.lesswrong.com/posts/GNhMPAWcfBCASy8e6/a-central-ai-alignment-problem-capabilities-generalization] to happen (in almost all trajectories) * This is to a large extent based on the belief that at some point "systems start to work really well in domains really far beyond the environments of their training" which is roughly the same as "discovering a core of generality" and few other formulations. These systems will be in some meaningful sense fundamentally different from eg Gato * From your perspective, this is based on thinking deeply about the nature of such system (note that this mostly based on hypothetical systems, and an analogy with evolution) * My claim roughly is this is only part of what's going on, where the actual think is: people start with a deep prior on "continuity in the space of intelligent systems". Looking into a specific question about hypothetical systems, their search in argument space is guided by this prior, and they end up mostly sampling arguments supporting their prior. (This is not to say the arguments are wrong.) * You probably don't agree with the above point, but notice the correlations: * You expect sharp left turn due to discontinuity in "architectures" dimensions (which is the crux according to you) * But you also expect jumps in capabilities of individual systems (at least I think so) * Also, you expect majority of hope in a "sharp right turn" histories (in contrast to smooth right turn histories) * And more * In my view yours (or rather MIRI-esque) views on the above dimensions are correlated more than expected, which suggest the existence of hidden variable/hidden model explaining the correlation. Can'

An example of a possible "pivotal act" I like that isn't "melt all GPUs" is:

Use AGI to build fast-running high-fidelity human whole-brain emulations. Then run thousands of very-fast-thinking copies of your best thinkers. Seems to me this plausibly makes it realistic to keep tabs on the world's AGI progress, and locally intervene before anything dangerous happens, in a more surgical way rather than via mass property destruction of any sort.

Looking for pivotal acts that are less destructive (and, more importantly for humanity's sake, less difficult to align)... (read more)

2Jan_Kulveit3mo
In my view, in practice, the pivotal acts framing actually pushes people to consider a more narrow space of discrete powerful actions, "sharp turns", "events that have a game-changing impact on astronomical stakes". As I understand it, the definition of "pivotal acts" explicitly forbids to consider things like "this process would make 20% per year of AI developers actually take safety seriously with 80% chance" or "what class of small shifts would in aggregate move the equilibrium?". (Where things in this category get straw-manned as "Rube-Goldberg-machine-like") As often, one of the actual cruxes is in continuity assumptions [https://www.lesswrong.com/posts/cHJxSJ4jBmBRGtbaE/continuity-assumptions], where basically you have a low prior on "smooth trajectory changes by many acts" and high prior on "sharp turns left or right". Second crux, as you note, is doom-by-default probability: if you have a very high doom probability, you may be in favour of variance-increasing acts, where people who are a few bits more optimistic may be much less excited about them, in particular if all plans for such acts they have very unclear shapes of impact distributions. Given this deep prior differences, it seems reasonable to assume this discussion will lead nowhere in particular. (I've a draft with a more explicit argument why.)

Some hopefully-unnecessary background info for people attempting this task:

A description of corrigibility Eliezer wrote a few months ago: "'corrigibility' is meant to refer to the sort of putative hypothetical motivational properties that prevent a system from wanting to kill you after you didn't build it exactly right".

An older description of "task-directed AGI" he wrote in 2015-2016: "A task-based AGI is an AGI intended to follow a series of human-originated orders, with these orders each being of limited scope", where the orders can be "accomplished using bounded amounts of effort and resources (as opposed to the goals being more and more fulfillable using more and more effort)."

Ronny Fernandez on Twitter:

I think I don’t like AI safety analogies with human evolution except as illustrations. I don’t think they’re what convinced the people who use those analogies, and they’re not what convinced me. You can convince yourself of the same things just by knowing some stuff about agency.

Corrigibility, human values, and figure-out-while-aiming-for-human-values, are not short description length. I know because I’ve practiced finding the shortest description lengths of things a lot, and they just don’t seem like the right sort of thing.

Also

... (read more)

From an Eliezer comment:

Interventions on the order of burning all GPUs in clusters larger than 4 and preventing any new clusters from being made, including the reaction of existing political entities to that event and the many interest groups who would try to shut you down and build new GPU factories or clusters hidden from the means you'd used to burn them, would in fact really actually save the world for an extended period of time and imply a drastically different gameboard offering new hopes and options. [...]

If Iceland did this, it would plausibly need... (read more)

I kind of like the analogous idea of an alignment target as a repeller cone / dome.

Corrigibility is a repeller. Human values aren't a repeller, but they're a very narrow target to hit.

3Vladimir Nesov4mo
In the sense of moving a system towards many possible goals? But I think in a more appropriate space (where the aiming should take place) it's again an attractor. Corrigibility is not a goal, a corrigible system doesn't necessarily have any well-defined goals, traditional goal-directed agents can't be corrigible in a robust way, and it should be possible to use it for corrigibility towards corrigibility, making this aspect stronger if that's what the operators work towards happening. More generally, non-agentic aspects of behavior can systematically reinforce non-agentic character of each other, preventing any opposing convergent drives (including the drive towards agency [https://www.lesswrong.com/posts/oiftkZnFBqyHGALwv/agents-as-p-b-chain-reactions] ) from manifesting if they've been set up to do so. Sufficient intelligence/planning advantage pushes this past exploitability hazards, repelling selection theorems [https://www.lesswrong.com/posts/G2Lne2Fi7Qra5Lbuf/selection-theorems-a-program-for-understanding-agents] , even as some of the non-agentic behaviors might be about maintaining specific forms of exploitability.

A lot of models of what can or can't work in AI alignment depends on intuitions about whether to expect "true discontinuities" or just "steep bits".

Note that Nate and Eliezer expect there to be some curves you can draw after-the-fact that shows continuity in AGI progress on particular dimensions. They just don't expect these to be the curves with the most practical impact (and they don't think we can identify the curves with foresight, in 2022, to make strong predictions about AGI timing or rates of progress).

Quoting Nate in 2018:

On my model, the key point

... (read more)
1Jan_Kulveit4mo
Yes, but conversely, I could say I'd expect some curves to show discontinuous jumps, mostly in dimensions which no one really cares about. Clearly the cruxes are about discontinuities in dimensions which matter. As I tried to explain in the post, I think continuity assumptions mostly get you different things than "strong predictions about AGI timing". I would paraphrase this as "assuming discontinuities at every level" - both one-system training, and the more macroscopic exploration in the "space of learning systems" - but stating the key disagreement is about the discontinuities in the space of model architectures, rather than in jumpiness of single model training. Personally, I don't think the distinction between 'movement by learning of a single model' and 'movement by scaling' and 'movement by architectural changes' will be necessarily big. This seem more or less support what I wrote? Expecting a Big Discontinuity, and this being a pretty deep difference? My overall impression is Eliezer likes to argue against "Hansonian views", but something like "continuity assumptions" seem much broader category than Robin's views. In my view continuity assumptions are not just about takeoff speeds. E.g, IDA make much more sense in a continuous world - if you reach a cliff, working IDA should slow down, and warn you. In the Truly Discontinuous world, you just jump off the cliff at some unknown step. I would guess probably a majority of all debates and disagreements between Paul and Eliezer has some "continuity" component: e.g. the question whether we can learn a lot of important alignment stuff on non-AGI systems is a typical continuity problem, but only tangentially relevant to takeoff speeds.

I'm not Eliezer, but my high-level attempt at this:

[...] The things I'd mainly recommend are interventions that:

  • Help ourselves think more clearly. (I imagine this including a lot of trying-to-become-more-rational, developing and following relatively open/honest communication norms, and trying to build better mental models of crucial parts of the world.)
  • Help relevant parts of humanity (e.g., the field of ML, or academic STEM) think more clearly and understand the situation.
  • Help us understand and resolve major disagreements. (Especially current disagreements
... (read more)

I think most worlds that successfully navigate AGI risk have properties like:

  • AI results aren't published publicly, going back to more or less the field's origin.
  • The research community deliberately steers toward relatively alignable approaches to AI, which includes steering away from approaches that look like 'giant opaque deep nets'.
    • This means that you need to figure out what makes an approach 'alignable' earlier, which suggests much more research on getting de-confused regarding alignable cognition.
      • Many such de-confusions will require a lot of software ex
... (read more)

I understand the first part of your comment as "sure, it's possible for minds to care about reality, but we don't know how to target value formation so that the mind cares about a particular part of reality." Is this a good summary? 

Yes!

I was, first, pointing out that this problem has to be solvable, since the human genome solves it millions of times every day! 

True! Though everyone already agreed (e.g., EY asserted this in the OP) that it's possible in principle. The updatey thing would be if the case of the human genome / brain development sugg... (read more)

4Alex Turner4mo
Feat #2 is: Design a mind which cares about anything at all in reality which isn't a shallow sensory phenomenon which is directly observable by the agent. Like, maybe I have a mind-training procedure, where I don't know what the final trained mind will value (dogs, diamonds, trees having particular kinds of cross-sections at year 5 of their growth), but I'm damn sure the AI will care about something besides its own sensory signals. Such a procedure would accomplish feat #2, but not #3. Feat #3 is: Design a mind which cares about a particular kind of object. We could target the mind-training process to care about diamonds, or about dogs, or about trees, but to solve this problem, we have to ensure the trained mind significantly cares about one kind of real-world entity in particular. Therefore, feat #3 is strictly harder than feat #2. I actually think that the dog- and diamond-maximization problems are about equally hard, and, to be totally honest, neither seems that bad[1] [#fn9ul512dz53s]in the shard theory paradigm. Surprisingly, I weakly suspect the harder part is getting the agent to maximize real-world dogs in expectation, not getting the agent to maximize real-world dogs in expectation. I think "figure out how to build a mind which cares about the number of real-world dogs, such that the mind intelligently selects plans which lead to a lot of dogs" is significantly easier than building a dog-maximizer. 1. ^ [#fnref9ul512dz53s]I appreciate that this claim is hard to swallow. In any case, I want to focus on inferentially-closer questions first, like how human values form.

Why is the process by which humans come to reliably care about the real world, not a process we could leverage analogously to make AIs care about the real world? 

Maybe I'm not understanding your proposal, but on the face of it this seems like a change of topic. I don't see Eliezer claiming 'there's no way to make the AGI care about the real world vs. caring about (say) internal experiences in its own head'. Maybe he does think that, but mostly I'd guess he doesn't care, because the important thing is whether you can point the AGI at very, very specifi... (read more)

6Alex Turner4mo
Hm, I'll give this another stab. I understand the first part of your comment as "sure, it's possible for minds to care about reality, but we don't know how to target value formation so that the mind cares about a particular part of reality ." Is this a good summary? Let me distinguish three alignment feats: 1. Producing a mind which terminally values sensory entities. 2. Producing a mind which reliably terminally values some kind of non-sensory entity in the world, like dogs or bananas. 1. AFAIK we have no idea how to ensure this happens reliably -- to produce an AGI which terminally values some element of {diamonds, dogs, cats, tree branches, other real-world objects}, such that there's a low probability that the AGI actually just cares about high-reward sensory observations. 2. In other words: Design a mind which cares about anything at all in reality which isn't a shallow sensory phenomenon which is directly observable by the agent. Like, maybe I have a mind-training procedure, where I don't know what the final trained mind will value (dogs, diamonds, trees having particular kinds of cross-sections at year 5 of their growth), but I'm damn sure the AI will care about something besides its own sensory signals. 3. I was, first, pointing out that this problem has to be solvable, since the human genome solves it millions of times every day! 3. Producing a mind which reliably terminally values a specific non-sensory entity, like diamonds [https://arbital.com/p/ontology_identification]. 1. Design a mind which cares about a particular kind of object. We could target the mind-training process to care about diamonds, or about dogs, or about trees, but to solve this problem, we have to ensure the trained mind significantly cares about one kind of real-world entity in particular. Therefore, feat #3 is stri

For example, I claim that while AlphaGo could be said to be agent-y, it does not care about atoms. And I think that we could make it fantastically more superhuman at Go, and it would still not care about atoms. Atoms are just not in the domain of its utility function.

In particular, I don't think it has an incentive to break out into the real world to somehow get itself more compute, so that it can think more about its next move. It's just not modeling the real world at all. It's not even trying to rack up a bunch of wins over time. It's just playing the si

... (read more)

Here's my answer: https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities?commentId=LowEED2iDkhco3a5d 

We have to actually figure out how to build aligned AGI, and the details are crucial. If you're modeling this as a random blog post aimed at persuading people to care about this cause area, a "voice of AI safety" type task, then sure, the details are less important and it's not so clear that Yet Another Marginal Blog Post Arguing For "Care About AI Stuff" matters much.

But humanity also has to do the task of actually figuring o... (read more)

On Twitter, Eric Rogstad wrote:

"the thing where it keeps being literally him doing this stuff is quite a bad sign"

I'm a bit confused by this part. Some thoughts on why it seems odd for him (or others) to express that sentiment...

1. I parse the original as, "a collection of EY's thoughts on why safe AI is hard". It's EY's thoughts, why would someone else (other than @robbensinger) write a collection of EY's thoughts?

(And if we generalize to asking why no-one else would write about why safe AI is hard, then what about Superintelligence, or the AI stuff in co

... (read more)
0handoflixue4mo
I don't think making this list in 1980 would have been meaningful. How do you offer any sort of coherent, detailed plan for dealing with something when all you have is toy examples like Eliza? We didn't even have the concept of machine learning back then - everything computers did in 1980 was relatively easily understood by humans, in a very basic step-by-step way. Making a 1980s computer "safe" is a trivial task, because we hadn't yet developed any technology that could do something "unsafe" (i.e. beyond our understanding). A computer in the 1980s couldn't lie to you, because you could just inspect the code and memory and find out the actual reality. What makes you think this would have been useful? Do we have any historical examples to guide us in what this might look like?

The conclusion we should take from the concept of mesa-optimisation isn't "oh no alignment is impossible", that's equivalent to "oh no learning is impossible".

The OP isn't claiming that alignment is impossible.

If we were actually inner aligned to the crude heuristics that evolution installed in us for bootstrapping the entire process, we would be totally disfunctional weirdoes. 

I don't understand the point you're making here.

The point I'm making is that the human example tells us that: 

If first we realize that we can't code up our values, therefore alignment is hard. Then, when we realize that mesa-optimisation is a thing. we shouldn't update towards "alignment is even harder". We should update in the opposite direction. 

Because the human example tells us that a mesa-optimiser can reliably point to a complex thing even if the optimiser points to only a few crude things. 

But I only ever see these three points, human example, inability to code up values, mesa-optimisation to separately argue for "alignment is even harder than previously thought". But taken together that is just not the picture. 

this can (roughly) be read as a set of 42 statements that need to be true for us to in fact be doomed, and statistically speaking it seems unlikely that all of these statements are true.

I don't think these statements all need to be true in order for p(doom) to be high, and I also don't think they're independent. Indeed, they seem more disjunctive than conjunctive to me; there are many cases where any one of the claims being true increases risk substantially, even if many others are false.

1David Scott Krueger4mo
I basically agree. I am arguing against extreme levels of pessimism (~>99% doom).

a mistake to leave you as the main "public advocate / person who writes stuff down" person for the cause.

It sort of sounds like you're treating him as the sole "person who writes stuff down", not just the "main" one. Noam Chomsky might have been the "main linguistics guy" in the late 20th century, but people didn't expect him to write more than a trivial fraction of the field's output, either in terms of high-level overviews or in-the-trenches research.

I think EY was pretty clear in the OP that this is not how things go on earths that survive. Even if there aren't many who can write high-level alignment overviews today, more people should make the attempt and try to build skill.

0handoflixue4mo
In the counterfactual world where Eliezer was totally happy continuing to write articles like this and being seen as the "voice of AI Safety", would you still agree that it's important to have a dozen other people also writing similar articles? I'm genuinely lost on the value of having a dozen similar papers - I don't know of a dozen different versions of fivethirtyeight.com or GiveWell, and it never occurred to me to think that the world is worse for only having one of those.

The counter-concern is that if humanity can't talk about things that sound like sci-fi, then we just die. We're inventing AGI, whose big core characteristic is 'a technology that enables future technologies'. We need to somehow become able to start actually talking about AGI.

One strategy would be 'open with the normal-sounding stuff, then introduce increasingly weird stuff only when people are super bought into the normal stuff'. Some problems with this:

  • A large chunk of current discussion and research happens in public; if it had to happen in private becau
... (read more)

I mean, all of this feels very speculative and un-cruxy to me; I wouldn't be surprised if the ASI indeed is able to conclude that humanity is no threat at all, in which case it kills us just to harvest the resources.

I do think that normal predators are a little misleading in this context, though, because they haven't crossed the generality ('can do science and tech') threshold. Tigers won't invent new machines, so it's easier to upper-bound their capabilities. General intelligences are at least somewhat qualitatively trickier, because your enemy is 'the space of all reachable technologies' (including tech that may be surprisingly reachable). Tigers can surprise you, but not in very many ways and not to a large degree.

But once you invent cheap tech that can control them you don't need to kill them anymore.

A paperclipper mainly cares about humans because we might have some way to threaten the paperclipper (e.g., by pushing a button that deploys a rival superintelligence); and secondarily, we're made of atoms that can be used to build paperclips.

It's harder to monitor the actions of every single human on Earth, than it is to kill all humans; and there's a risk that monitoring people visibly will cause someone to push the 'deploy a rival superintelligence' button, if such ... (read more)

1romeostevensit4mo
This is exactly what I was thinking about though, this idea of monitoring every human on earth seems like a failure of imagination on our part. I'm not safe from predators because I monitor the location of every predator on earth. I admit that many (overwhelming majority probably) of scenarios in this vein are probably pretty bad and involve things like putting only a few humans on ice while getting rid of the rest.

Yes, where killing all humans is an example of "controlling the people", from the perspective of an Unfriendly AI.

If we go down that path then it becomes the sort of conversation where I have no idea what common assumptions do we have, if any, that we could use to agree. As a general rule, I find it unconstructive, for the purpose of trying to agree on anything, to say things like "this (intuitively compelling) assumption is false" unless you also provide a concrete argument or an alternative of your own. Otherwise the discussion is just ejected into vacuum.

Fair enough! I don't think I agree in general, but I think 'OK, but what's your alternative to agency?' is an es... (read more)

1Vanessa Kosoy4mo
My 0th approximation answer is: you're describing something logically incoherent, like a p-zombie. My 1st approximation answer is more nuanced. Words that, in the pre-Turing era, referred exclusively to humans (and sometimes animals, and fictional beings), such as "wants", "experiences" et cetera, might have two different referents. One referent is a natural concept, something tied into deep truths about how the universe (or multiverse) works. In particular, deep truths about the "relatively simple core structure that explains why complicated cognitive machines work". The other referent is something in our specifically-human "ontological model" of the world (technically, I imagine that to be an infra-POMDP that all our hypotheses our refinements of). Since the latter is a "shard" of the former produced by evolution, the two referents are related, but might not be the same. (For example, I suspect that cats lack natural!consciousness but have human!consciousness.) The creature you describe does not natural!want anything. You postulated that it is "experiencing more pleasurable and less pleasurable states", but there is no natural method that would label its states as such, or that would interpret them as any sort of "experience". On the other hand, maybe if this creature is designed as a derivative of the human brain, then it does human!want something, because our shard of the concept of "wanting" mislabels (relatively to natural!want) weird states that wouldn't occur in the ancestral environment. You can then ask, why should we design the AI to follow what we natural!want rather than what we human!want? To answer this, notice that, under ideal conditions, you converge to actions that maximize your natural!want, (more or less) according to definition of natural!want. In particular, under ideal conditions, you would build an AI that follows your natural!want. Hence, it makes sense to take a shortcut and "update now to the view you will predictably update to later":

Second, the only reason why the question "what X wants" can make sense at all, is because X is an agent. As a corollary, it only makes sense to the extent that X is an agent.

I'm not sure this is true; or if it's true, I'm not sure it's relevant. But assuming it is true...

Therefore, if X is not entirely coherent then X's preferences are only approximately defined, and hence we only need to infer them approximately.

... this strikes me as not capturing the aspect of human values that looks strange and complicated. Two ways I could imagine the strangeness and ... (read more)

3Vanessa Kosoy4mo
If we go down that path then it becomes the sort of conversation where I have no idea what common assumptions do we have, if any, that we could use to agree. As a general rule, I find it unconstructive, for the purpose of trying to agree on anything, to say things like "this (intuitively compelling) assumption is false" unless you also provide a concrete argument or an alternative of your own. Otherwise the discussion is just ejected into vacuum. Which is to say, I find it self-evident that "agents" are exactly the sort of beings that can "want" things, because agency is about pursuing objectives and wanting is about the objectives that you pursue. If you don't believe this then I don't know what these words even mean for you. Maybe, and maybe this means we need to treat "composite agents" explicitly in our models. But, there is also a case to be made that groups of (super)rational agents effectively converge into a single utility function, and if this is true, then the resulting system can just as well be interpreted as a single agent having this effective utility function, which is a solution that should satisfy the system of agents according to their existing bargaining equilibrium. If your agent converges to optimal behavior asymptotically, then I suspect it's still going to have infinite g and therefore an asymptotically-crisply-defined utility function. Of course it doesn't help on its own. What I mean is, we are going to find a precise mathematical formalization of this concept and then hard-code this formalization into our AGI design.

There is a big chunk of what you're trying to teach which not weird and complicated, namely: "find this other agent, and what their values are". Because, "agents" and "values" are natural concepts, for reasons strongly related to "there's a relatively simple core structure that explains why complicated cognitive machines work".

This seems like it must be true to some degree, but "there is a big chunk" feels a bit too strong to me.

Possibly we don't disagree, and just have different notions of what a "big chunk" is. But some things that make the chunk feel sm... (read more)

Humans are at least a little coherent, or we would never get anything done; but we aren't very coherent, so the project of piecing together 'what does the human brain as a whole "want"' can be vastly more difficult than the problem of figuring out what a coherent optimizer wants.

This is a point where I feel like I do have a substantial disagreement with the "conventional wisdom" of LessWrong.

First, LessWrong began with a discussion of cognitive biases in human irrationality, so this naturally became a staple of the local narrative. On the other hand, I ... (read more)

I don't think I personally could have written it; if others think they could have, I'd genuinely be interested to hear them brag, even if they can't prove it.

Maybe the ideal would be 'I generated the core ideas of [a,b,c] with little or no argument from others; I had to be convinced of [d,e,f] but I now agree with them; I disagree with [g,h,i]; I think you left out important considerations [x,y,z].' Just knowing people's self-model is interesting to me, I don't demand that everything you believe be immediately provable to me.

It's very clear to me I could have written this if I had wanted to—and at the very least I'm sure Paul could have as well. As evidence: it took me ~1 hour to list off all the existing sources that cover every one of these points in my comment.

I have a couple object-level disagreements including relevance of evolution / nature of inner alignment problem and difficulty of attaining corrigibility. But leaving those aside, I wouldn’t have exactly written this kind of document myself, because I’m not quite sure what the purpose is. It seems to be trying to do a lot of different things for different audiences, where I think more narrowly-tailored documents would be better.

So, here are four useful things to do, and whether I’m personally doing them:

First, there is a mass of people who think AGI risk i... (read more)

I think as of early this year (like, January/February, before I saw a version of this doc) I could have produced a pretty similar list to this one. I definitely would not derive it from the empty string in the closest world-without-Eliezer; I'm unsure how much I'd pay attention to AI alignment at all in that world. I'd very likely be working on agent foundations in that world, but possibly in the context of biology or AI capabilities rather than alignment. Arguments about AI foom and doom were obviously-to-me correct once I paid attention to them at all, b... (read more)

Yes, please do rewrite the post, or make your own version of a post like this!! :) I don't suggest trying to persuade arbitrary policymakers of AGI risk, but I'd be very keen on posts like this optimized to be clear and informative to different audiences. Especially groups like 'lucid ML researchers who might go into alignment research', 'lucid mathematicians, physicists, etc. who might go into alignment research', etc.

0Michael Große4mo
I wonder if we could be much more effective in outreach to these groups? Like making sure that Robert Miles is sufficiently funded to have a professional team +20% (if that is not already the case). Maybe reaching out to Sabine Hossenfelder and sponsoring a video, or maybe collaborate with her for a video about this. Though I guess given her attitude towards the physics community, the work with her might be a gamble and two-edged sword. Can we get market research on what influencers have a high number of followers of ML researches/physicists/mathematicians and then work with them / sponsor them? Or maybe micro-target this demographic with facebook/google/github/stackexchange ads and point them to something? I don't know, I'm not a marketing person, but I feel like I would have seen much more of these things if we were doing enough of them. Not saying that this should be MIRI's job, rather stating that I'm confused because I feel like we as a community are not taking an action that would seem obvious to me. Especially given how recent advances in published AI capabilities seem to make the problem even much legible. Is the reason for not doing it really just that we're all a bunch of nerds who are bad at this kind of thing, or is there more to it that I'm missing? While I see that there is a lot of risk associated with such outreach increasing the amount of noise, I wonder if that tradeoff might be shifting the shorter the timelines are getting and given that we don't seem to have better plans than "having a diverse set of smart people come up with novel ideas of their own in the hope that one of those works out". So taking steps to entice a somewhat more diverse group of people into the conversation might be worth it?

Suggestion: make it a CYOA-style interactive piece, where the reader is tasked with aligning AI, and could choose from a variety of approaches which branch out into sub-approaches and so on. All of the paths, of course, bottom out in everyone dying, with detailed explanations of why. This project might then evolve based on feedback, adding new branches that counter counter-arguments made by people who played it and weren't convinced. Might also make several "modes", targeted at ML specialists, general public, etc., where the text makes different tradeoffs ... (read more)

I'm thinking of people like Paul Christiano, Nate Soares, John Wentworth, Ajeya Cotra...  [...] I do agree with you that they seem to on average be way way too optimistic, but I don't think it's because they are ignorant of the considerations and arguments you've made here.

I don't think Nate is that much more optimistic than Eliezer, but I believe Eliezer thinks Nate couldn't have generated enough of the list in the OP, or couldn't have generated enough of it independently ("using the null string as input").

I agree that this would be scary if the system is, for example, as smart as physically possible. What I'm imagining is:

  • (1) if you find a way to ensure that the system is only weakly superhuman (e.g., it performs vast amounts of low-level-Google-engineer-quality reasoning, only rare short controlled bursts of von-Neumann-quality reasoning, and nothing dramatically above the von-Neumann level), and
  • (2) if you get the system to only care about thinking about this cube of space, and
  • (3) if you also somehow get the system to want to build the particular machine y
... (read more)

Conversely, it doesn't seem realistic to define limited impact or corrigibility or whatever without relying on an awful lot of values information - like e.g. what sort of changes-to-the-world we do/don't care about, what thing-in-the-environment the system is supposed to be corrigible with, etc.

I suspect you could do this in a less value-loaded way if you're somehow intervening on 'what the AGI wants to pay attention to', as opposed to just intervening on 'what sorts of directions it wants to steer the world in'.

'Only spend your cognition thinking about in... (read more)

4johnswentworth4mo
I do not think that would do what you seem to think it would do. If something optimizes one little chunk of the world really hard, ignoring everything else, that doesn't mean the rest of the world is unchanged; by default there are lots of side effects. E.g. if something is building nanotech in a 1m cube, ignoring everything outside the cube, at the very least I'd expect that dump nuke levels of waste heat into its immediate surroundings.

I'm not sure whether you mean "95% correct CEV has a lot of S-risk" or "95% correct CEV has a little S-risk, and even a tiny amount of S-risk is terrifying"?

The latter, as I was imagining "95%".

You're basically saying, your aim is not to design ethical/friendly/aligned AI [...]

My goal is an awesome, eudaimonistic long-run future. To get there, I strongly predict that you need to build AGI that is fully aligned with human values. To get there, I strongly predict that you need to have decades of experience actually working with AGI, since early generations of systems will inevitably have bugs and limitations and it would be catastrophic to lock in the wrong future because we did a rush job.

(I'd also expect us to need the equivalent of subjective ce... (read more)

Yeah, I'm very interested in hearing counter-arguments to claims like this. I'll say that although I think task AGI is easier, it's not necessarily strictly easier, for the reason you mentioned.

Maybe a cruxier way of putting my claim is: Maybe corrigibility / task AGI / etc. is harder than CEV, but it just doesn't seem realistic to me to try to achieve full, up-and-running CEV with the very first AGI systems you build, within a few months or a few years of humanity figuring out how to build AGI at all.

And I do think you need to get CEV up and running withi... (read more)

3johnswentworth4mo
Insofar as humans care about their AI being corrigible, we should expect some degree of corrigibility even from a CEV-maximizer. That, in turn, suggests at least some basin-of-attraction for values (at least along some dimensions), in the same way that corrigibility yields a basin-of-attraction. (Though obviously that's not an argument we'd want to make load-bearing without both theoretical and empirical evidence about how big the basin-of-attraction is along which dimensions.) Conversely, it doesn't seem realistic to define limited impact or corrigibility or whatever without relying on an awful lot of values information - like e.g. what sort of changes-to-the-world we do/don't care about, what thing-in-the-environment the system is supposed to be corrigible with, etc. Values seem like a necessary-and-sufficient component. Corrigibility/task architecture/etc doesn't. Small but important point here: an estimate of CEV which is within 5% error everywhere does reasonably well; that gets us within 5% of our best possible outcome. The problem is when our estimate is waaayyy off in 5% of scenarios, especially if it's off in the overestimate direction; then we're in trouble.
4Vanessa Kosoy4mo
The way I imagine the win scenario is, we're going to make a lot of progress in understanding alignment before we know how to build AGI. And, we're going to do it by prioritizing understanding alignment modulo capability (the two are not really possible to cleanly separate, but it might be possible to separate them enough for this purpose). For example, we can assume the existence of algorithms with certain properties, s.t. these properties arguably imply the algorithms can be used as building-blocks for AGI, and then ask: given such algorithms, how would we build aligned AGI? Or, we can come up with some toy setting where we already know how to build "AGI" in some sense, and ask, how to make it aligned in that setting? And then, once we know how to build AGI in the real world, it would hopefully not be too difficult to translate the alignment method. One caveat in all this is, if AGI is going to use deep learning, we might not know how to apply the lesson from the "oracle"/toy setting, because we don't understand what deep learning is actually doing, and because of that, we wouldn't be sure where to "slot" it in the correspondence/analogy s.t. the alignment method remains sound. But, mainstream researchers have been making progress on understanding what deep learning is actually doing, and IMO it's plausible we will have a good mathematical handle on it before AGI. I'm not sure whether you mean "95% correct CEV has a lot of S-risk" or "95% correct CEV has a little S-risk, and even a tiny amount of S-risk is terrifying"? I think I agree with the latter but not with the former. (How specifically does 95% CEV produce S-risk? I can imagine something like "AI realizes we want non-zero amount of pain/suffering to exist, somehow miscalibrates the amount and creates a lot of pain/suffering" or "AI realizes we don't want to die, and focuses on this goal on the expense of everything else, preserving us forever in a state of complete sensory deprivation". But these scenario

I think there are multiple viable options, like the toy example EY uses:

I think that after AGI becomes possible at all and then possible to scale to dangerously superhuman levels, there will be, in the best-case scenario where a lot of other social difficulties got resolved, a 3-month to 2-year period where only a very few actors have AGI, meaning that it was socially possible for those few actors to decide to not just scale it to where it automatically destroys the world.

During this step, if humanity is to survive, somebody has to perform some feat that c

... (read more)
0Mitchell_Porter4mo
OK, I disagree very much with that strategy. You're basically saying, your aim is not to design ethical/friendly/aligned AI, you're saying your aim is to design AI that can take over the world without killing anyone. Then once that is accomplished, you'll settle down to figure out how that unlimited power would best be used. To put it another way: Your optimistic scenario is one in which the organization that first achieves AGI uses it to take over the world, install a benevolent interim regime that monopolizes access to AGI without itself making a deadly mistake, and which then eventually figures out how to implement CEV (for example); and then it's finally safe to have autonomous AGI. I have a different optimistic scenario: We definitively figure out the theory of how to implement CEV before AGI even arises, and then spread that knowledge widely, so that whoever it is in the world that first achieves AGI, they will already know what they should do with it. Both these scenarios are utopian in different ways. The first one says that flawed humans can directly wield superintelligence for a protracted period without screwing things up. The second one says that flawed humans can fully figure out how to safely wield superintelligence before it even arrives. Meanwhile, in reality, we've already proceeded an unknown distance up the curve towards superintelligence, but none of the organizations leading the way has much of a plan for what happens, if their creations escape their control. In this situation, I say that people whose aim is to create ethical/friendly/aligned superintelligence, should focus on solving that problem. Leave the techno-military strategizing to the national security elites of the world. It's not a topic that you can avoid completely, but in the end it's not your job to figure out how mere humans can safely and humanely wield superhuman power. It's your job to design an autonomous superhuman power that is intrinsically safe and humane. To that

The 2017 document postulates an "acute risk period" in which people don't know how to align, and then a "stable period" once alignment theory is mature. 

"Align" is a vague term. Let's distinguish "strawberry alignment" (where we can safely and reliably use an AGI to execute a task like "Place, onto this particular plate here, two strawberries identical down to the cellular but not molecular level.") from "CEV alignment" (where we can safely and reliably use an AGI to carry out a CEV-like procedure.)

Strawberry alignment seems vastly easier than CEV ali... (read more)

0Mitchell_Porter4mo
The "stable period" is supposed to be a period in which AGI already exists, but nothing like CEV has yet been implemented, and yet "no one can destroy the world with AGI". How would that work? How do you prevent everyone in the whole wide world from developing unsafe AGI during the stable period?

Quoting a thing I said in March:

The two big things we feel bottlenecked on are:

  • (1) people who can generate promising new alignment ideas. (By far the top priority, but seems empirically rare.)
  • (2) competent executives who are unusually good at understanding the kinds of things MIRI is trying to do, and who can run their own large alignment projects mostly-independently.

For 2, I think the best way to get hired by MIRI is to prove your abilities via the Visible Thoughts Project. The post there says a bit more about the kind of skills we're looking for:

Eliezer

... (read more)
Load More