All of Rob Bensinger's Comments + Replies

How about the distinction between (A) “An AGI kills every human, and the people who turned on the AGI didn’t want that to happen” versus (B) “An AGI kills every human, and the people who turned on the AGI did want that to happen”?

I think the misuse vs. accident dichotomy is clearer when you don't focus exclusively on "AGI kills every human" risks. (E.g., global totalitarianism risks strike me as small but non-negligible if we solve the alignment problem. Larger are risks that fall short of totalitarianism but still involve non-morally-humble developers dam... (read more)

1David Scott Krueger3d
By "intend" do you mean that they sought that outcome / selected for it?   Or merely that it was a known or predictable outcome of their behavior? I think "unintentional" would already probably be a better term in most cases. 

FYI, the timestamp is for the first Discord message. If the log broke out timestamps for every part of the message, it would look like this:

[2:21 PM]

It's about the size of the information bottleneck. The human genome is 3 billion base pairs drawn from 4 possibilities, so 750 megabytes. Let's say 90% of that is junk DNA, and 10% of what's left is neural wiring algorithms. So the code that wires a 100-trillion-synapse human brain is about 7.5 megabytes. Now an adult human contains a lot more information than this. Your spinal cord is about 70 million neurons

... (read more)
3Ben Pace9d
That makes more sense.

I don't know Nate's response, but his take on agent-foundations-ish research in A note about differential technological development (and the fact that he and MIRI have been broadly pro-interpretability-work to date) might help clarify how he thinks about cases like this.

[...]

I feel relatively confident that a large percentage of people who do capabilities work at OpenAI, FAIR, DeepMind, Anthropic, etc. with justifications like "well, I'm helping with alignment some too" or "well, alignment will be easier when we get to the brink" (more often EA-adjacent th

... (read more)

The genre of plans that I'd recommend to groups currently pushing the capabilities frontier is: aim for a pivotal act that's selected for being (to the best of your knowledge) the easiest-to-align action that suffices to end the acute risk period. Per Eliezer on Arbital, the "easiest-to-align" condition probably means that you want the act that requires minimal cognitive abilities, out of the set of acts that suffice to prevent the world from being destroyed:

In the context of AI alignment, the "Principle of Minimality" or "Principle of Least Everything" sa

... (read more)

The genre of plans that I'd recommend to groups currently pushing the capabilities frontier is: aim for a pivotal act that's selected for being (to the best of your knowledge) the easiest-to-align action that suffices to end the acute risk period.

FYI, I think there's a huge difference between "I think humanity needs to aim for a pivotal act" and "I recommend to groups pushing the capabilities frontier forward to aim for pivotal act". I think pivotal acts require massive amounts of good judgement to do right, and, like, I think capabilities researchers have... (read more)

The definitions given in the post are:

  • ASI-boosted humans — We solve all of the problems involved in aiming artificial superintelligence at the things we’d ideally want.

[...]

  • misaligned AI — Humans build and deploy superintelligent AI that isn’t aligned with what we’d ideally want.

I'd expect most people to agree that "We solve all of the problems involved in aiming artificial superintelligence at the things we'd ideally want" yields outcomes that are about as good as possible, and I'd expect most of the disagreement to turn (either overtly or in some su... (read more)

0Lone Pine2mo
Isn't "misaligned AI" by definition a bad thing and "ASI-boosted humans" by definition a good thing? You're basically asking "How likely is <good outcome> given that we have <a machine that creates good outcomes>"

My example with the 100 million referred to question 1.

Yeah, I'm also talking about question 1.

I do think that stuff only matters (to me) if it's in some sense causally connected to my life and experiences.

Seems obviously false as a description of my values (and, I'd guess, just about every human's).

Consider the simple example of a universe that consists of two planets: mine, and another person's. We don't have spaceships, so we can't interact. I am not therefore indifferent to whether the other person is being horribly tortured for thousands of years.

If I... (read more)

2Vanessa Kosoy4mo
P.S. I think that in your example, if a person is given a button that can save a person on a different planet from being tortured, they will have a direct incentive to press the button, because the button is a causal connection in itself, and consciously reasoning about the person on the other planet is a causal[1] connection in the other direction. That said, a person still has a limited budget of such causal connections (you cannot reason about a group of arbitrarily many people, with fixed non-zero amount of paying attention to the individual details of every person, in a fixed time-frame). Therefore, while the incentive is positive, its magnitude saturates as the number of saved people grows s.t. e.g. a button that saves a million people is virtually the same as a button that saves a billion people. -------------------------------------------------------------------------------- 1. I'm modeling this via Turing RL, where conscious reasoning can be regarded as a form of observation. Ofc this means we are talking about "logical" rather than "physical" causality. ↩︎

I'm curious what is the evidence you see that this is false as a description of the values of just about every human, given that

  • I, a human [citation needed] tell you that this seems to be a description of my values.
  • Almost every culture that ever existed had norms that prioritized helping family, friends and neighbors over helping random strangers, not to mention strangers that you never met.
  • Most people don't do much to help random strangers they never met, with the notable exception of effective altruists, but even most effective altruists only go that
... (read more)

But, two rooms with trillion people each is virtually the same as one room with two trillion. The returns on interactions with additional people fall off exponentially past the Dunbar number.

You're conflating "would I enjoy interacting with X?" with "is it good for X to exist?". Which is almost understandable given that Nate used the "two people can have more fun in the same room" example to illustrate why utility isn't linear in population. But this comment has an IMO bizarre amount of agreekarma (26 net agreement, with 11 votes), which makes me wonder if... (read more)

First, you can consider preferences that are impartial but sublinear in the number of people. So, you can disagree with Nate's room analogy without the premise "stuff only matters if it adds to my own life and experiences".

Second, my preferences are indeed partial. But even that doesn't mean "stuff only matters if it adds to my own life and experiences". I do think that stuff only matters (to me) if it's in some sense causally connected to my life and experiences. More details here.

Third, I don't know what do you mean by "good". The questions that I unders... (read more)

(And we aren't perfect recognizers of 'functional, safe-to-use nanofactory' or other known-to-me things that might save the world.)

Also from Ronny: 

There's also an important disanalogy between generating/recognizing faces and learning 'human values', which is that humans are perfect human face recognizers but not perfect recognizers of worlds high in 'human values'.

That means that there might be world states or plans in the training data or generated by adversarial training that look to us, and ML trained to recognize these things the way we recognize them, like they are awesome, but are actually awful.

2Rob Bensinger4mo
(And we aren't perfect recognizers of 'functional, safe-to-use nanofactory' or other known-to-me things that might save the world.)

Ronny Fernandez asked me, Nate, and Eliezer for our take on Twitter. Copying over Nate's reply:

briefly: A) narrow non-optimizers can exist but won't matter; B) wake me when the allegedly maximally-facelike image looks human; C) ribosomes show that cognition-bound superpowers exist; D) humans can't stack into superintelligent corps, but if they could then yes plz value-load

(tbc, I appreciate Katja saying all that. Hooray for stating what you think, and hooray again when it's potentially locally unpopular! If I were less harried I might give more than a twee

... (read more)
3Alex Turner2mo
Nate's B) currently seems confused. I read a connotation "we need the AGI's learned concepts to be safe under extreme optimization pressure, such that, when extremized, they yield reasonable results (e.g. human faces from maximizing the AI-faceishness-concept-activation of an image)."  But I think agents will not maximize their own concept activations, when choosing plans. An agent's values will optimize the world; the values don't optimize themselves. For example, I think that I am not looking for a romantic relationship which maximally activates my "awesome relationship" concept, if that's a thing I have. It's true that conditional on such a plan being considered, my relationship-shard might bid for that plan with strength monotonically increasing on "predicted activation of awesome-relationship".  And conditional on such a plan getting considered, where that concept activation is maximized, I would therefore be very inclined to pursue that plan. But I think it's not true that my relationship-shard is optimizing its own future activations by extremizing future concept activations. I think that this plan won't get found, and the agent won't want to find this plan. Values are not the optimization target. (This point explained in more detail: Alignment allows "nonrobust" decision-influences and doesn't require robust grading [https://www.lesswrong.com/posts/rauMEna2ddf26BqiE/alignment-allows-nonrobust-decision-influences-and-doesn-t])

Could someone clarify the relevance of ribosomes?

Also from Ronny: 

There's also an important disanalogy between generating/recognizing faces and learning 'human values', which is that humans are perfect human face recognizers but not perfect recognizers of worlds high in 'human values'.

That means that there might be world states or plans in the training data or generated by adversarial training that look to us, and ML trained to recognize these things the way we recognize them, like they are awesome, but are actually awful.

Note: "ask them for the faciest possible thing" seems confused.

How I would've interpreted this if I were talking with another ML researcher is "Sample the face at the point of highest probability density in the generative model's latent space". For GANs and diffusion models (the models we in fact generate faces with), you can do exactly this by setting the Gaussian latents to zeros, and you will see that the result is a perfectly normal, non-Eldritch human face.

I'm guessing what he has in mind is more like "take a GAN discriminator / image classifier &... (read more)

e.g. by trying to apply standards of epistemic uncertainty to the state of this essence? 

I would say that there's a logical object that a large chunk of human moral discourse is trying to point at — something like "the rules of the logical game Morality", analogous to "the rules of the logical game Chess". Two people can both be discussing the same logical object "the rules of Chess", but have different beliefs about what that logical object's properties are. And just as someone can be mistaken or uncertain bout the rules of chess — or about their int... (read more)

4Charlie Steiner4mo
When I think about the rules of chess, I basically treat them as having some external essence that I have epistemic uncertainty about. What this means mechanistically is: * When I'm unsure about the rules of chess, this raises the value of certain information-gathering actions, like checking the FIDE website, asking a friend, reading a book. * If I knew the outcomes of all those actions, that would resolve my uncertainty. * I have probabilities associated with my uncertainty, and updates to those probabilities based on evidence should follow Bayesian logic. * Decision-making under uncertainty should linearly aggregate the different possibilities that I'm uncertain over, weighted by their probability. So the rules of chess are basically just a pattern out in the world that I can go look at. When I say I'm uncertain about the rules of chess, this is epistemic uncertainty that I manage the same as if I'm uncertain about anything else out there in the world. The "rules of Morality" are not like this. * When I'm unsure about whether I care about fish suffering, this does raise the value of certain information-gathering actions like learning more about fish. * But if I knew the outcomes of all those actions, this wouldn't resolve all my uncertainty. * I can put probabilities to various possibilities, and can update them on evidence using Bayesian logic - that part still works. * Decision-making under the remaining-after-evidence part of the uncertainty doesn't have to look like linear aggregation. In fact it shouldn't - I have meta-preferences like "conservatism," which says that I should trust models differently depending on whether they seem to be inside their domain of validity or not. So there's a lot of my uncertainty about morality that doesn't stem from being unaware about facts. Where does it come from? One source is self-modeling uncertainty - how do I take the empirical facts about me and the world, and use tha

I also contend that it's the more epistemically humble position, because you're not saying that it's for sure that a pivotal act should be performed, but just that it's quite plausible given the specifics of the current world situation

The thing I'd say in favor of this position is that I think it better fits the evidence. I think the problem with the opposing view is that it's wrong, not that it's more confident. E.g., if I learned that Nate assigns probability .9 to "a pivotal act is necessary" (for some operationalization of "necessary") while Critch ass... (read more)

"The goal should be to cause the future to be great on its own terms"

What the heck is this supposed to mean? Great according to the Inherent Essence Of Goodness that lives inside futures, rather than as part of human evaluations?

The rest of the quote explains what this means:

The goal should be to cause the future to be great on its own terms, without locking in the particular moral opinions of humanity today — and without locking in the moral opinions of any subset of humans, whether that’s a corporation, a government, or a nation.

(If you can't s

... (read more)

The present is "good on its own terms", rather than "good on Ancient Romans' terms", because the Ancient Romans weren't able to lock in their values. If you think this makes sense (and is a good thing) in the absence of an Inherent Essence Of Goodness, then there's no reason to posit an Inherent Essence Of Goodness when we switch from discussing "moral progress after Ancient Rome" to "moral progress after circa-2022 civilization".

The present is certainly good on my terms (relative to ancient Rome). But the present itself doesn't care. It's not the type of ... (read more)

The wisest moves we've made as a species to date (ending slavery? ending smallpox? landing on the moon?) didn't particularly look like "worldwide collaborations".

I think Nate might've been thinking of things like:

  • Having all AGI research occur in one place is good (ceteris paribus), because then the AGI project can take as much time as it needs to figure out alignment, without worrying that some competitor will destroy the world with AGI if they go too slowly.
  • This is even truer if the global coordination is strong enough to prevent other x-risks (e.g., bio-
... (read more)
4Alex Flint4mo
Yeah I also have the sense that we mostly agree here. I have the sense that CEV stands for, very roughly, "what such-and-such a person would do if they became extremely wise", and the hope (which I think is a reasonable hope) is that there is a direction called "wisdom" such that if you move a person far enough in that direction then they become both intelligent and benevolent, and that this eventually doesn't depend super much on where you started. The tricky part is that we are in this time where we have the option of making some moves that might be quite disruptive, and we don't yet have direct access to the wisdom that we would ideally use to guide our most significant decisions. And the key question is really: what do you do if you come into a position of really significant influence, at a time when you don't yet have the tools to access the CEV-level wisdom that you might later get? And some people say it's flat-out antisocial to even contemplate taking any disruptive actions, while others say that given the particular configuration of the world right now and the particular problems we face, it actually seems plausible that a person in such a position of influence ought to seriously consider disruptive actions. I really agree with the latter, and I also contend that it's the more epistemically humble position, because you're not saying that it's for sure that a pivotal act should be performed, but just that it's quite plausible given the specifics of the current world situation. The other side of the argument seems to be saying that no no no it's definitely better not to do anything like that in anything like the current world situation.

I'd guess Nate might say one of:

  • Current SotA systems are very opaque — we more-or-less can't inspect or intervene on their thoughts — and it isn't clear how we could navigate to AI approaches that are far less opaque, and that can carry forward to AGI. (Though it seems very likely such approaches exist somewhere in the space of AI research approaches.)
  • Much more generally: we don't have a alignment approach that could realistically work fast (say, within ten months of inventing AGI rather than ten years), in the face of a sharp left turn, given inevitable p
... (read more)
3Noosphere894mo
Yeah, it does seem like interpreterability is a bottleneck for a lot of alignment proposals, and in particular as long as neutral networks are essentially black boxes, deceptive alignment/inner alignment issues seem almost impossible to address.

Thanks for the update, Ajeya! I found the details here super interesting.

I already thought that timelines disagreements within EA weren't very cruxy, and this is another small update in that direction: I see you and various MIRI people and Metaculans give very different arguments about how to think about timelines, and then the actual median year I tend to hear is quite similar.

(And also, all of the stated arguments on all sides continue to seem weak/inconclusive to me! So IMO there's not much disagreement, and it would be very easy for all of us to be wro... (read more)

3Ajeya Cotra6mo
Yeah I agree more of the value of this kind of exercise (at least within the community) is in revealing more granular disagreements about various things. But I do think there's value in establishing to more external people something high level like "It really could be soon and it's not crazy or sci fi to think so."

Some added context for this list: Nate and Eliezer expect the first AGI developers to encounter many difficulties in the “something forces you to stop and redesign (and/or recode, and/or retrain) large parts of the system” category, with the result that alignment adds significant development time.

By default, safety-conscious groups won't be able to stabilize the game board before less safety-conscious groups race ahead and destroy the world. To avoid this outcome, humanity needs there to exist an AGI group that

  • is highly safety-conscious.
  • has a la
... (read more)

One caveat to the claim that we should prioritize serial alignment work over parallelizable work, is that this assumes an omniscient and optimal allocator of researcher-hours to problems.

Why do you think it assumes that?

This isn't a coincidence; the state of alignment knowledge is currently "we have no idea what would be involved in doing it even in principle, given realistic research paths and constraints", very far from being a well-specified engineering problem. Cf. https://intelligence.org/2013/11/04/from-philosophy-to-math-to-engineering/.

If you succeed at the framework-inventing "how does one even do this?" stage, then you can probably deploy an enormous amount of engineering talent in parallel to help with implementation, small iterative improvements, building-upon-foundations, targeting-established-metrics, etc. tasks.

From A central AI alignment problem: capabilities generalization, and the sharp left turn:

Suppose that the fictional team OpenMind is training up a variety of AI systems, before one of them takes that sharp left turn. Suppose they've put the AI in lots of different video-game and simulated environments, and they've had good luck training it to pursue an objective that the operators described in English. "I don't know what those MIRI folks were talking about; these systems are easy to direct; simple training suffices", they say. At the same time, they apply

... (read more)
1Lauro Langosco6mo
Thanks!

(Most of the QR-upvotes at the moment are from me. I think 1-4 are all good questions, for Nate or others; but I'm extra excited about people coming up with ideas for 3.)

When I think about the strawberry problem, it seems unnatural, and perhaps misleading of our attention, since there's no guarantee there's even a reasonable solution.

Why would there not be a solution?

To clarify, I said there might not be a reasonable solution (i.e. such that solving the strawberry problem isn't significantly harder than solving pivotal-act alignment). 

Not directly answering your Q, but here's why it seems unnatural and maybe misleading-of-attention. Copied from a Slack message I sent: 

First, I suspect that even an aligned AI would fail the "duplicate a strawberry and do nothing else" challenge, because such an AI would care about human life and/or about cooperating with humans, and would be asked to stand by while 1.8 humans

... (read more)

On my model, the point of ass numbers isn't to demand perfection of your gut (e.g., of the sort that would be needed to avoid multiple-stage fallacies when trying to conditionalize a lot), but to:

  1. Communicate with more precision than English-language words like 'likely' or 'unlikely' allow. Even very vague or uncertain numbers will, at least some of the time, be a better guide than natural-language terms that weren't designed to cover the space of probabilities (and that can vary somewhat in meaning from person to person).
  2. At least very vaguely and roughly b
... (read more)

On my model, the point of ass numbers isn't to demand perfection of your gut (e.g., of the sort that would be needed to avoid multiple-stage fallacies when trying to conditionalize a lot), but to:

  1. Communicate with more precision than English-language words like 'likely' or 'unlikely' allow. Even very vague or uncertain numbers will, at least some of the time, be a better guide than natural-language terms that weren't designed to cover the space of probabilities (and that can vary somewhat in meaning from person to person).
  2. At least very vaguely and roughly b
... (read more)

Note that I advocate for considering much more weird solutions, and also thinking much more weird world states when talking with the "general world".  While in contrast, on LW and AF, I'd like to see more discussion of various "boring" solutions on which the world can roughly agree.

Can I get us all to agree to push for including pivotal acts and pivotal processes in the Overton window, then? :) I'm happy to publicly talk about pivotal processes and encourage people to take them seriously as options to evaluate, while flagging that I'm ~2-5% on them be... (read more)

  • With pretty high confidence, you expect sharp left turn to happen (in almost all trajectories)
  • This is to a large extent based on the belief that at some point "systems start to work really well in domains really far beyond the environments of their training" which is roughly the same as "discovering a core of generality" and few other formulations. These systems will be in some meaningful sense fundamentally different from eg Gato

That's right, though the phrasing "discovering a core of generality" here sounds sort of mystical and mysterious to me, which ma... (read more)

In my view, in practice, the pivotal acts framing actually pushes people to consider a more narrow space of discrete powerful actions, "sharp turns", "events that have a game-changing impact on astronomical stakes". 

My objection to Critch's post wasn't 'you shouldn't talk about pivotal processes, just pivotal acts'. On the contrary, I think bringing in pivotal processes is awesome.

My objection (more so to "Pivotal Act" Intentions, but also to the new one) is specifically to the idea that we should socially shun the concept of "pivotal acts", and socia... (read more)

3Jan_Kulveit7mo
With the last point: I think can roughly pass your ITT - we can try that, if you are interested.  So, here is what I believe are your beliefs * With pretty high confidence, you expect sharp left turn [https://www.lesswrong.com/posts/GNhMPAWcfBCASy8e6/a-central-ai-alignment-problem-capabilities-generalization] to happen (in almost all trajectories) * This is to a large extent based on the belief that at some point "systems start to work really well in domains really far beyond the environments of their training" which is roughly the same as "discovering a core of generality" and few other formulations. These systems will be in some meaningful sense fundamentally different from eg Gato * From your perspective, this is based on thinking deeply about the nature of such system (note that this mostly based on hypothetical systems, and an analogy with evolution) * My claim roughly is this is only part of what's going on, where the actual think is: people start with a deep prior on "continuity in the space of intelligent systems". Looking into a specific question about hypothetical systems, their search in argument space is guided by this prior, and they end up mostly sampling arguments supporting their prior.  (This is not to say the arguments are wrong.) * You probably don't agree with the above point, but notice the correlations: * You expect sharp left turn due to discontinuity in "architectures" dimensions (which is the crux according to you) * But you also expect jumps in capabilities of individual systems (at least I think so) * Also, you expect majority of hope in a "sharp right turn" histories (in contrast to smooth right turn histories) * And more * In my view yours (or rather MIRI-esque) views on the above dimensions are correlated more than expected, which suggest the existence of hidden variable/hidden model explaining the correlation.    Can't speak for Critch, but my view is p

An example of a possible "pivotal act" I like that isn't "melt all GPUs" is:

Use AGI to build fast-running high-fidelity human whole-brain emulations. Then run thousands of very-fast-thinking copies of your best thinkers. Seems to me this plausibly makes it realistic to keep tabs on the world's AGI progress, and locally intervene before anything dangerous happens, in a more surgical way rather than via mass property destruction of any sort.

Looking for pivotal acts that are less destructive (and, more importantly for humanity's sake, less difficult to align)... (read more)

2Jan_Kulveit7mo
In my view, in practice, the pivotal acts framing actually pushes people to consider a more narrow space of discrete powerful actions, "sharp turns", "events that have a game-changing impact on astronomical stakes".  As I understand it, the definition of "pivotal acts" explicitly forbids to consider things like "this process would make 20% per year of AI developers actually take safety seriously with 80% chance" or "what class of small shifts would in aggregate move the equilibrium?". (Where things in this category get straw-manned as "Rube-Goldberg-machine-like") As often, one of the actual cruxes is in continuity assumptions [https://www.lesswrong.com/posts/cHJxSJ4jBmBRGtbaE/continuity-assumptions], where basically you have a low prior on "smooth trajectory changes by many acts" and high prior on "sharp turns left or right". Second crux, as you note, is doom-by-default probability: if you have a very high doom probability, you may be in favour of variance-increasing acts, where people who are a few bits more optimistic may be much less excited about them, in particular if all plans for such acts they have very unclear shapes of impact distributions. Given this deep prior differences, it seems reasonable to assume this discussion will lead nowhere in particular. (I've a draft with a more explicit argument why.)

Some hopefully-unnecessary background info for people attempting this task:

A description of corrigibility Eliezer wrote a few months ago: "'corrigibility' is meant to refer to the sort of putative hypothetical motivational properties that prevent a system from wanting to kill you after you didn't build it exactly right".

An older description of "task-directed AGI" he wrote in 2015-2016: "A task-based AGI is an AGI intended to follow a series of human-originated orders, with these orders each being of limited scope", where the orders can be "accomplished using bounded amounts of effort and resources (as opposed to the goals being more and more fulfillable using more and more effort)."

Ronny Fernandez on Twitter:

I think I don’t like AI safety analogies with human evolution except as illustrations. I don’t think they’re what convinced the people who use those analogies, and they’re not what convinced me. You can convince yourself of the same things just by knowing some stuff about agency.

Corrigibility, human values, and figure-out-while-aiming-for-human-values, are not short description length. I know because I’ve practiced finding the shortest description lengths of things a lot, and they just don’t seem like the right sort of thing.

Also

... (read more)

From an Eliezer comment:

Interventions on the order of burning all GPUs in clusters larger than 4 and preventing any new clusters from being made, including the reaction of existing political entities to that event and the many interest groups who would try to shut you down and build new GPU factories or clusters hidden from the means you'd used to burn them, would in fact really actually save the world for an extended period of time and imply a drastically different gameboard offering new hopes and options. [...]

If Iceland did this, it would plausibly need... (read more)

I kind of like the analogous idea of an alignment target as a repeller cone / dome.

Corrigibility is a repeller. Human values aren't a repeller, but they're a very narrow target to hit.

3Vladimir Nesov8mo
In the sense of moving a system towards many possible goals? But I think in a more appropriate space (where the aiming should take place) it's again an attractor. Corrigibility is not a goal, a corrigible system doesn't necessarily have any well-defined goals, traditional goal-directed agents can't be corrigible in a robust way, and it should be possible to use it for corrigibility towards corrigibility, making this aspect stronger if that's what the operators work towards happening. More generally, non-agentic aspects of behavior can systematically reinforce non-agentic character of each other, preventing any opposing convergent drives (including the drive towards agency [https://www.lesswrong.com/posts/oiftkZnFBqyHGALwv/agents-as-p-b-chain-reactions]) from manifesting if they've been set up to do so. Sufficient intelligence/planning advantage pushes this past exploitability hazards, repelling selection theorems [https://www.lesswrong.com/posts/G2Lne2Fi7Qra5Lbuf/selection-theorems-a-program-for-understanding-agents], even as some of the non-agentic behaviors might be about maintaining specific forms of exploitability.

A lot of models of what can or can't work in AI alignment depends on intuitions about whether to expect "true discontinuities" or just "steep bits".

Note that Nate and Eliezer expect there to be some curves you can draw after-the-fact that shows continuity in AGI progress on particular dimensions. They just don't expect these to be the curves with the most practical impact (and they don't think we can identify the curves with foresight, in 2022, to make strong predictions about AGI timing or rates of progress).

Quoting Nate in 2018:

On my model, the key point

... (read more)
1Jan_Kulveit7mo
  Yes, but conversely, I could say I'd expect some curves to show discontinuous jumps, mostly in dimensions which no one really cares about.  Clearly the cruxes are about discontinuities in dimensions which matter. As I tried to explain in the post, I think continuity assumptions mostly get you different things than "strong predictions about AGI timing".  I would paraphrase this as "assuming discontinuities at every level" - both one-system training, and the more macroscopic exploration in the "space of learning systems" - but stating the key disagreement is about the discontinuities in the space of model architectures, rather than in jumpiness of single model training. Personally, I don't think the distinction between 'movement by learning of a single model' and 'movement by scaling' and 'movement by architectural changes' will be necessarily big.    This seem more or less support what I wrote? Expecting a Big Discontinuity, and this being a pretty deep difference? My overall impression is Eliezer likes to argue against "Hansonian views", but something like "continuity assumptions" seem much broader category than Robin's views.   In my view continuity assumptions are not just about takeoff speeds. E.g, IDA make much more sense in a continuous world - if you reach a cliff, working IDA should slow down, and warn you. In the Truly Discontinuous world, you just jump off the cliff at some unknown step.  I would guess probably a majority of all debates and disagreements between Paul and Eliezer has some "continuity" component: e.g. the question whether we can learn a lot of important alignment stuff on non-AGI systems is a typical continuity problem, but only tangentially relevant to takeoff speeds.  

I'm not Eliezer, but my high-level attempt at this:

[...] The things I'd mainly recommend are interventions that:

  • Help ourselves think more clearly. (I imagine this including a lot of trying-to-become-more-rational, developing and following relatively open/honest communication norms, and trying to build better mental models of crucial parts of the world.)
  • Help relevant parts of humanity (e.g., the field of ML, or academic STEM) think more clearly and understand the situation.
  • Help us understand and resolve major disagreements. (Especially current disagreements
... (read more)

I think most worlds that successfully navigate AGI risk have properties like:

  • AI results aren't published publicly, going back to more or less the field's origin.
  • The research community deliberately steers toward relatively alignable approaches to AI, which includes steering away from approaches that look like 'giant opaque deep nets'.
    • This means that you need to figure out what makes an approach 'alignable' earlier, which suggests much more research on getting de-confused regarding alignable cognition.
      • Many such de-confusions will require a lot of software ex
... (read more)

I understand the first part of your comment as "sure, it's possible for minds to care about reality, but we don't know how to target value formation so that the mind cares about a particular part of reality." Is this a good summary? 

Yes!

I was, first, pointing out that this problem has to be solvable, since the human genome solves it millions of times every day! 

True! Though everyone already agreed (e.g., EY asserted this in the OP) that it's possible in principle. The updatey thing would be if the case of the human genome / brain development sugg... (read more)

4Alex Turner8mo
Feat #2 is: Design a mind which cares about anything at all in reality which isn't a shallow sensory phenomenon which is directly observable by the agent. Like, maybe I have a mind-training procedure, where I don't know what the final trained mind will value (dogs, diamonds, trees having particular kinds of cross-sections at year 5 of their growth), but I'm damn sure the AI will care about something besides its own sensory signals. Such a procedure would accomplish feat #2, but not #3. Feat #3 is: Design a mind which cares about a particular kind of object. We could target the mind-training process to care about diamonds, or about dogs, or about trees, but to solve this problem, we have to ensure the trained mind significantly cares about one kind of real-world entity in particular. Therefore, feat #3 is strictly harder than feat #2. I actually think that the dog- and diamond-maximization problems are about equally hard, and, to be totally honest, neither seems that bad[1] in the shard theory paradigm.  Surprisingly, I weakly suspect the harder part is getting the agent to maximize real-world dogs in expectation, not getting the agent to maximize real-world dogs in expectation. I think "figure out how to build a mind which cares about the number of real-world dogs, such that the mind intelligently selects plans which lead to a lot of dogs" is significantly easier than building a dog-maximizer. 1. ^ I appreciate that this claim is hard to swallow. In any case, I want to focus on inferentially-closer questions first, like how human values form.

Why is the process by which humans come to reliably care about the real world, not a process we could leverage analogously to make AIs care about the real world? 

Maybe I'm not understanding your proposal, but on the face of it this seems like a change of topic. I don't see Eliezer claiming 'there's no way to make the AGI care about the real world vs. caring about (say) internal experiences in its own head'. Maybe he does think that, but mostly I'd guess he doesn't care, because the important thing is whether you can point the AGI at very, very specifi... (read more)

6Alex Turner8mo
Hm, I'll give this another stab. I understand the first part of your comment as "sure, it's possible for minds to care about reality, but we don't know how to target value formation so that the mind cares about a particular part of reality." Is this a good summary?  Let me distinguish three alignment feats: 1. Producing a mind which terminally values sensory entities.  2. Producing a mind which reliably terminally values some kind of non-sensory entity in the world, like dogs or bananas.  1. AFAIK we have no idea how to ensure this happens reliably -- to produce an AGI which terminally values some element of {diamonds, dogs, cats, tree branches, other real-world objects}, such that there's a low probability that the AGI actually just cares about high-reward sensory observations.  2. In other words: Design a mind which cares about anything at all in reality which isn't a shallow sensory phenomenon which is directly observable by the agent. Like, maybe I have a mind-training procedure, where I don't know what the final trained mind will value (dogs, diamonds, trees having particular kinds of cross-sections at year 5 of their growth), but I'm damn sure the AI will care about something besides its own sensory signals.  3. I was, first, pointing out that this problem has to be solvable, since the human genome solves it millions of times every day!  3. Producing a mind which reliably terminally values a specific non-sensory entity, like diamonds [https://arbital.com/p/ontology_identification]. 1. Design a mind which cares about a particular kind of object. We could target the mind-training process to care about diamonds, or about dogs, or about trees, but to solve this problem, we have to ensure the trained mind significantly cares about one kind of real-world entity in particular. Therefore, feat #3 is strictly harder than feat #2. 2.

For example, I claim that while AlphaGo could be said to be agent-y, it does not care about atoms. And I think that we could make it fantastically more superhuman at Go, and it would still not care about atoms. Atoms are just not in the domain of its utility function.

In particular, I don't think it has an incentive to break out into the real world to somehow get itself more compute, so that it can think more about its next move. It's just not modeling the real world at all. It's not even trying to rack up a bunch of wins over time. It's just playing the si

... (read more)

Here's my answer: https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities?commentId=LowEED2iDkhco3a5d 

We have to actually figure out how to build aligned AGI, and the details are crucial. If you're modeling this as a random blog post aimed at persuading people to care about this cause area, a "voice of AI safety" type task, then sure, the details are less important and it's not so clear that Yet Another Marginal Blog Post Arguing For "Care About AI Stuff" matters much.

But humanity also has to do the task of actually figuring o... (read more)

On Twitter, Eric Rogstad wrote:

"the thing where it keeps being literally him doing this stuff is quite a bad sign"

I'm a bit confused by this part. Some thoughts on why it seems odd for him (or others) to express that sentiment...

1. I parse the original as, "a collection of EY's thoughts on why safe AI is hard". It's EY's thoughts, why would someone else (other than @robbensinger) write a collection of EY's thoughts?

(And if we generalize to asking why no-one else would write about why safe AI is hard, then what about Superintelligence, or the AI stuff in co

... (read more)
0handoflixue8mo
  I don't think making this list in 1980 would have been meaningful. How do you offer any sort of coherent, detailed plan for dealing with something when all you have is toy examples like Eliza?  We didn't even have the concept of machine learning back then - everything computers did in 1980 was relatively easily understood by humans, in a very basic step-by-step way. Making a 1980s computer "safe" is a trivial task, because we hadn't yet developed any technology that could do something "unsafe" (i.e. beyond our understanding). A computer in the 1980s couldn't lie to you, because you could just inspect the code and memory and find out the actual reality. What makes you think this would have been useful? Do we have any historical examples to guide us in what this might look like?

The conclusion we should take from the concept of mesa-optimisation isn't "oh no alignment is impossible", that's equivalent to "oh no learning is impossible".

The OP isn't claiming that alignment is impossible.

If we were actually inner aligned to the crude heuristics that evolution installed in us for bootstrapping the entire process, we would be totally disfunctional weirdoes. 

I don't understand the point you're making here.

The point I'm making is that the human example tells us that: 

If first we realize that we can't code up our values, therefore alignment is hard. Then, when we realize that mesa-optimisation is a thing. we shouldn't update towards "alignment is even harder". We should update in the opposite direction. 

Because the human example tells us that a mesa-optimiser can reliably point to a complex thing even if the optimiser points to only a few crude things. 

But I only ever see these three points, human example, inability to code up values, mesa-optimisation to separately argue for "alignment is even harder than previously thought". But taken together that is just not the picture. 

this can (roughly) be read as a set of 42 statements that need to be true for us to in fact be doomed, and statistically speaking it seems unlikely that all of these statements are true.

I don't think these statements all need to be true in order for p(doom) to be high, and I also don't think they're independent. Indeed, they seem more disjunctive than conjunctive to me; there are many cases where any one of the claims being true increases risk substantially, even if many others are false.

1David Scott Krueger8mo
I basically agree.   I am arguing against extreme levels of pessimism (~>99% doom).  

a mistake to leave you as the main "public advocate / person who writes stuff down" person for the cause.

It sort of sounds like you're treating him as the sole "person who writes stuff down", not just the "main" one. Noam Chomsky might have been the "main linguistics guy" in the late 20th century, but people didn't expect him to write more than a trivial fraction of the field's output, either in terms of high-level overviews or in-the-trenches research.

I think EY was pretty clear in the OP that this is not how things go on earths that survive. Even if there aren't many who can write high-level alignment overviews today, more people should make the attempt and try to build skill.

0handoflixue8mo
In the counterfactual world where Eliezer was totally happy continuing to write articles like this and being seen as the "voice of AI Safety", would you still agree that it's important to have a dozen other people also writing similar articles?  I'm genuinely lost on the value of having a dozen similar papers - I don't know of a dozen different versions of fivethirtyeight.com or GiveWell, and it never occurred to me to think that the world is worse for only having one of those.

The counter-concern is that if humanity can't talk about things that sound like sci-fi, then we just die. We're inventing AGI, whose big core characteristic is 'a technology that enables future technologies'. We need to somehow become able to start actually talking about AGI.

One strategy would be 'open with the normal-sounding stuff, then introduce increasingly weird stuff only when people are super bought into the normal stuff'. Some problems with this:

  • A large chunk of current discussion and research happens in public; if it had to happen in private becau
... (read more)

I mean, all of this feels very speculative and un-cruxy to me; I wouldn't be surprised if the ASI indeed is able to conclude that humanity is no threat at all, in which case it kills us just to harvest the resources.

I do think that normal predators are a little misleading in this context, though, because they haven't crossed the generality ('can do science and tech') threshold. Tigers won't invent new machines, so it's easier to upper-bound their capabilities. General intelligences are at least somewhat qualitatively trickier, because your enemy is 'the space of all reachable technologies' (including tech that may be surprisingly reachable). Tigers can surprise you, but not in very many ways and not to a large degree.

Load More