Where I agree and disagree with Eliezer

I doubt that’s the primary component that makes the difference. Other countries which did mostly sensible things early are eg Australia, Czechia, Vietnam, New Zealand, Iceland.

What do you think is the primary component? I seem to recall reading somewhere that previous experience with SARS makes a big difference. I guess my more general point is that if the good COVID responses can mostly be explained by factors that predictably won't be available to the median AI risk response, then the variance in COVID response doesn't help to give much hope for a good AI risk response.

My main claim isn’t about what a median response would be, but something like “difference between median early covid governmental response and actually good early covid response was something between 1 and 2 sigma; this suggests bad response isn’t over-determined, and sensibe responses are within human reach”.

This seems to depend on response to AI risk being of similar difficulty as response to COVID. I think people who updated towards "bad response to AI risk is overdetermined" did so partly on the basis that the former is much harder. (In other words, if the median government has done this badly against COVID, what chance does it have against something much harder?) I wrote down a list of things that make COVID an easier challenge, which I now realize may be a bit of a tangent if that's not the main thing you want to argue about, but I'll put down here anyway so as to not waste it.

  1. it's relatively intuitive for humans to think about the mechanics of the danger and possible countermeasures
  2. previous human experiences with pandemics, including very similar ones like SARS
  3. there are very effective countermeasures that are much easier / less costly than comparable countermeasures for AI risk, such as distributing high quality masks to everyone and sealing one's borders
  4. COVID isn't agenty and can't fight back intelligently
  5. potentially divisive issues in AI risk response seem to be a strict superset of politically divisive issues in COVID response (additional issues include: how to weigh very long term benefits against short term costs, the sentience, moral worth, and rights of AIs, what kind of values do we want AIs to have, and/or who should have control/access to AI)
Where I agree and disagree with Eliezer

For people who doubt this, I’d point to variance in initial governmental-level response to COVID19, which ranged from “highly incompetent” (eg. early US) to “quite competent” (eg Taiwan).

Seems worth noting that Taiwan is an outlier in terms of average IQ of its population. Given this, I find it pretty unlikely that typical governmental response to AI would be more akin to Taiwan than the US.

AGI Ruin: A List of Lethalities

I think until recently, I've been consistently more pessimistic than Eliezer about AI existential safety. Here's a 2004 SL4 post for example where I tried to argue against MIRI (SIAI at the time) trying to build a safe AI (and again in 2011). I've made my own list of sources of AI risk that's somewhat similar to this list. But it seems to me that there are still various "outs" from certain doom, such that my probability of a good outcome is closer to 20% (maybe a range of 10-30% depending on my mood) than 1%.

  1. Human thought partially exposes only a partially scrutable outer surface layer. Words only trace our real thoughts. Words are not an AGI-complete data representation in its native style. The underparts of human thought are not exposed for direct imitation learning and can't be put in any dataset. This makes it hard and probably impossible to train a powerful system entirely on imitation of human words or other human-legible contents, which are only impoverished subsystems of human thoughts; unless that system is powerful enough to contain inner intelligences figuring out the humans, and at that point it is no longer really working as imitative human thought.

One of the biggest "outs" I see is that it turns out to be not that hard "to train a powerful system entirely on imitation of human words or other human-legible contents", we (e.g., a relatively responsible AI lab) train such a system and then use it to differentially accelerate AI safety research. I definitely think that it's very risky to rely on such black-box human imitations for existential safety, and that a competent civilization would be pursuing other plans where they can end up with greater certainty of success, but it seems there's something like a 20% chance that it just works out anyway.

To explain my thinking a bit more, human children have to learn how to think human thoughts through "imitation of human words or other human-legible contents". It's possible that they can only do this successfully because their genes contain certain key ingredients that enable human thinking, but it also seems possible that children are just implementations of some generic imitation learning algorithm, so our artificial learning algorithms (once they become advanced/powerful enough) won't be worse at learning to think like humans. I don't know how to rule out the latter possibility with very high confidence. Eliezer, if you do, can you please explain this more?

Godzilla Strategies

I was going to make a comment to the effect that humans are already a species of Godzilla (humans aren't safe, human morality is scary, yada yada), only to find you making the same analogy, but with an optimistic slant. :)

[Link] A minimal viable product for alignment

The example of cryptography was mainly intended to make the point that humans are by default too credulous when it comes to informal arguments. But consider your statement:

It feels to me like there’s basically no question that recognizing good cryptosystems is easier than generating them.

Consider some cryptosystem widely considered to be secure, like AES. How much time did humanity spend on learning / figuring out how to recognize good cryptosystems (e.g. finding all the attacks one has to worry about, like differential cryptanalysis), versus specifically generating AES with the background knowledge in mind? Maybe the latter is on the order of 10% of the former?

Then consider that we don't actually know that AES is secure, because we don't know all the possible attacks and we don't know how to prove it secure, i.e., we don't know how to recognize a good cryptosystem. Suppose one day we figure that out, wouldn't finding an actually good cryptosystem be trivial at that point compared to all the previous effort?

Some of your other points are valid, I think, but cryptography is just easier than alignment (don't have time to say more as my flight is about to take off), and philosophy is perhaps a better analogy for the more general point.

[Link] A minimal viable product for alignment

If it turns out that evaluation of alignment proposals is not easier than generation, we’re in pretty big trouble because we’ll struggle to convince others that any good alignment proposals humans come up with are worth implementing.

But this is pretty likely the case though, isn't it? Actually I think by default the situation will be the opposite: it will be too easy to convince others that some alignment proposal is worth implementing, because humans are in general too easily convinced by informal arguments that look good but contain hidden flaws (and formalizing the arguments is both very difficult and doesn't help much because you're still depending on informal arguments for why the formalized theoretical concepts correspond well enough to the pre-theoretical concepts that we actually care about). Look at the history of philosophy, or cryptography, if you doubt this.

But suppose we're able to convince people to distrust their intuitive sense of how good an argument is, and to keep look for hidden flaws and counterarguments (which might have their own hidden flaws and so on). Well how do we know when it's safe to end this process and actually hit the run button?

A broad basin of attraction around human values?

I think Paul’s argument amounts to saying that a corrigibility approach focuses directly on mitigating the “lock-in” of wrong preferences, whereas ambitious value learning would try to get the right preferences but has a greater risk of locking-in its best guess.

What's the actual content of the argument that this is true? From my current perspective, corrigible AI still has a very high risk of lock-in of wrong preferences, due to bad metapreferences of the overseer, and ambitious value learning, or some ways of doing that, could turn out to be less risky with respect to lock-in, because for example you could potentially examine the metapreferences that a value-learning AI has learned, which might make it more obvious that they're not safe enough as is, triggering attempts to do something about that.

A broad basin of attraction around human values?

My inclination is to guess that there is a broad basin of attraction if we’re appropriately careful in some sense (and the same seems true for corrigibility).

In other words, the attractor basin is very thin along some dimensions, but very thick along some other dimensions.

What do you think are the chances are of humanity being collectively careful enough, given that (in addition from the bad metapreferences I cited in the OP) it's devoting approximately 0.0000001% of its resources (3 FTEs, to give a generous overestimate) to studying either metaphilosophy or metapreferences in relation to AI risk, just years or decades before transformative AI will plausibly arrive?

One reason some people cited ~10 years ago for being optimistic about AI risks that they expected as AI gets closer, human civilization will start paying more attention to AI risk and quickly ramp up its efforts on that front. That seems to be happening on some technical aspects of AI safety/alignment, but not on metaphilosophy/metapreferences. I am puzzled why almost no one is as (visibly) worried about it as I am, as my update (to the lack of ramp-up) is that (unless something changes soon) we're screwed unless we're (logically) lucky and the attractor basin just happens to be thick along all dimensions.

General alignment plus human values, or alignment via human values?

In contrast, something like a threat doesn’t count, because you know that the outcome if the threat is executed is not something you want; the problem comes because you don’t know how to act in a way that both disincentivizes threats and also doesn’t lead to (too many) threats being enforced. In particular, the problem is not that you don’t know which outcomes are bad.

I see, but I think at least part of the problem with threats is that I'm not sure what I care about, which greatly increases my "attack surface". For example, if I knew that negative utilitarianism is definitely wrong, then threats to torture some large number of simulated people wouldn't be effective on me (e.g., under total utilitarianism, I could use the resources demanded by the attacker to create more than enough happy people to counterbalance whatever they threaten to do).

Alignment is 100x more likely to be an existentially risky problem at all (think of this as the ratio between probabilities of existential catastrophe by the given problem assuming no intervention from longtermists).

This seems really extreme, if I'm not misunderstanding you. (My own number is like 1x-5x.) Assuming your intent alignment risk is 10%, your AI persuasion risk is only 1/1000?

Putting on my “what would I do” hat, I’m imagining that the AI doesn’t know that it was specifically optimized to be persuasive, but it does know that there are other persuasive counterarguments that aren’t being presented, and so it says that it looks one-sided and you might want to look at these other counterarguments.

Given that humans are liable to be persuaded by bad counterarguments too, I'd be concerned that the AI will always "know that there are other persuasive counterarguments that aren’t being presented, and so it says that it looks one-sided and you might want to look at these other counterarguments." Since it's not safe to actually look the counterarguments found by your own AI, it's not really helping at all. (Or it makes things worse if the user isn't very cautious and does look at their AI's counterarguments and gets persuaded by them.)

I totally expect them to ask AI for help with such games. I don’t expect (most of) them to lock in their values such that they can’t change their mind.

I think most people don't think very long term and aren't very rational. They'll see some people within their group do AI-enabled value lock-in, get a lot of status reward for it, and emulate that behavior in order to not fall behind and become low status within the group. (This might be a gradual process resembling "purity spirals" of the past, i.e., people ask AI to do more and more things that have the effect of locking in their values, or a sudden wave of explicit value lock-ins.)

I expect AIs will be able to do the sort of philosophical reasoning that we do, and the question of whether we should care about simulations seems way way easier than the question about which simulations of me are being run, by whom, and what they want.

This seems plausible to me, but I don't see how one can have enough confidence in this view that one isn't very worried about the opposite being true and constituting a significant x-risk.

General alignment plus human values, or alignment via human values?

To be clear, my original claim was for hypothetical scenarios where the failure occurs because the AI didn’t know human values, rather than cases where the AI knows what the human would want but still a failure occurs.

I'm not sure I understand the distinction that you're drawing here. (It seems like my scenarios could also be interpreted as failures where AI don't know enough human values, or maybe where humans themselves don't know enough human values.) What are some examples of what your claim was about?

I do still think they are not as important as intent alignment.

As in, the total expected value lost through such scenarios isn't as large as the expected value lost through the risk of failing to solve intent alignment? Can you give some ballpark figures of how you see each side of this inequality?

Mostly I’d hope that AI can tell what philosophy is optimized for persuasion

How? How would you train an AI to distinguish between philosophy optimized for persuasion, and correct or well-intentioned philosophy that just happens to be very persuasive?

or at least is capable of presenting counterarguments persuasively as well.

You mean every time you hear a philosophical argument, you ask you AI to produce some counterarguments optimized for persuasion? If so, won't your friends be afraid to send you any arguments they think of, for fear of your AI superhumanly persuading you to the opposite conclusion?

And I don’t expect a large number of people to explicitly try to lock in their values.

A lot of people are playing status games where faith/loyalty to their cause/ideology gains them a lot of status points. Why wouldn't they ask their AI for help with this? Or do you imagine them asking for something like "more faith", but AIs understand human values well enough to not interpret that as "lock in values"?

It seems odd to me that it’s sufficiently competent to successfully reason about simulations enough that an acausal threat can actually be made, but then not competent at reasoning about exotic philosophical cases, and I don’t particularly expect this to happen.

The former seems to just require that the AI is good at reasoning about mathematical/empirical matters (e.g., are there many simulations of me actually being run in some universe or set of universes) which I think AIs will be good at by default, whereas dealing with the threats seems to require reasoning about hard philosophical problems like decision theory and morality. For example, how much should I care about my copies in the simulations or my subjective future experience, versus the value that would be lost in the base reality if I were to give in to the simulators' demands? Should I make a counterthreat? Are there any thoughts I or my AI should avoid having, or computations we should avoid doing?

I don’t expect AIs to have clean crisp utility functions of the form “maximize paperclips” (at least initially).

I expect that AIs (or humans) who are less cautious or who think their values can be easily expressed as utility functions will do this first, and thereby gain an advantage over everyone else and maybe forcing them to follow.

I expect this to be way less work than the complicated plans that the AI is enacting, so it isn’t a huge competitiveness hit.

I don't think it's so much that the coordination involving humans is a lot of work, but rather that we don't know how to do it in a way that doesn't cause a lot of waste, similar to a democratically elected administration implementing a bunch of policies only to be reversed by the next administration that takes power, or lawmakers pursuing pork barrel projects that collectively make almost everyone worse off, or being unable to establish and implement easy policies (see COVID again). (You may well have something in mind that works well in the context of intent aligned AI, but I have a prior that says this class of problems is very difficult in general so I'd need to see more details before I update.)

Load More