Wei Dai

Wei Dai's Comments

Curiosity Killed the Cat and the Asymptotically Optimal Agent

From your paper:

It is interesting to note that AIXI, a Bayes-optimal reinforcement learner in general environments,is not asymptotically optimal [Orseau, 2010], and in-deed, may cease to explore [Leikeet al., 2015]. Depending on its prior and its past observations, AIXI may decide at some point that further exploration is not worth the risk. Given our result, this seems like reasonable behavior.

Given this, why is your main conclusion "Perhaps our results suggest we are in need of more theory regarding the 'parenting' of artificial agents" instead of "We should use Bayesian optimality instead of asymptotic optimality"?

[AN #80]: Why AI risk might be solved without additional intervention from longtermists

I agree that this is troubling, though I think it’s similar to how I wouldn’t want the term biorisk to be expanded ...

Well as I said, natural language doesn't have to be perfectly logical, and I think "biorisk" is in somewhat in that category but there's an explanation that makes it a bit reasonable than it might first appear, which is that the "bio" refers not to "biological" but to "bioweapon". This is actually one of the definitions that Google gives when you search for "bio": "relating to or involving the use of toxic biological or biochemical substances as weapons of war. 'bioterrorism'"

I guess the analogous thing would be if we start using "AI" to mean "technical AI accidents" in a bunch of phrases, which feels worse to me than the "bio" case, maybe because "AI" is a standalone word/acronym instead of a prefix? Does this make sense to you?

Not to say that’s what you are doing with AI risk. I’m worried about what others will do with it if the term gets expanded.

But the term was expanded from the beginning. Have you actually observed it being used in ways that you fear (and which would be prevented if we were to redefine it more narrowly)?

[AN #80]: Why AI risk might be solved without additional intervention from longtermists

Also, isn't defining "AI risk" as "technical accidental AI risk" analogous to defining "apple" as "red apple" (in terms of being circular/illogical)? I realize natural language doesn't have to be perfectly logical, but this still seems a bit too egregious.

[AN #80]: Why AI risk might be solved without additional intervention from longtermists

But I am optimistic about the actual risks that you and others argue for.

Why? I actually wrote a reply that was more questioning in tone, and then changed it because I found some comments you made where you seemed to be concerned about the additional AI risks. Good thing I saved a copy of the original reply, so I'll just paste it below:

I wonder if you would consider writing an overview of your perspective on AI risk strategy. (You do have a sequence but I'm looking for something that's more comprehensive, that includes e.g. human safety and philosophical problems. Or let me know if there's an existing post that I've missed.) I ask because you're one of the most prolific participants here but don't fall into one of the existing "camps" on AI risk for whom I already have good models for. It's happened several times that I see a comment from you that seems wrong or unclear, but I'm afraid to risk being annoying or repetitive with my questions/objections. (I sometimes worry that I've already brought up some issue with you and then forgot your answer.) It would help a lot to have a better model of you in my head and in writing so I can refer to that to help me interpret what the most likely intended meaning of a comment is, or to predict how you would likely answer if I were to ask certain questions.

It’s notable that AI Impacts asked for people who were skeptical of AI risk (or something along those lines) and to my eye it looks like all four of the people in the newsletter independently interpreted that as accidental technical AI risk in which the AI is adversarially optimizing against you (or at least that’s what the four people argued against).

Maybe that's because the question was asked in a way that indicated the questioner was mostly interested in technical accidental AI risk? And some of them may be fine with defining "AI risk" as "AI-caused x-risk" but just didn't have the other risks on the top of their minds, because their personal focus is on the technical/accidental side. In other words I don't think this is strong evidence that all 4 people would endorse defining "AI risk" as "technical accidental AI risk". It also seems notable that I've been using "AI risk" in a broad sense for a while and no one has objected to that usage until now.

I would certainly support having clearer definitions and terminology if we could all agree on them.

The current situation seems to be that we have two good (relatively clear) terms "technical accidental AI risk" and "AI-caused x-risk" and the dispute is over what plain "AI risk" should be shorthand for. Does that seem fair?

[AN #80]: Why AI risk might be solved without additional intervention from longtermists

AI risk is just a shorthand for “accidental technical AI risk.”

I don't think "AI risk" was originally meant to be a shorthand for "accidental technical AI risk". The earliest considered (i.e., not off-hand) usage I can find is in the title of Luke Muehlhauser's AI Risk and Opportunity: A Strategic Analysis where he defined it as "the risk of AI-caused extinction".

(He used "extinction" but nowadays we tend think in terms of "existential risk" which also includes "permanent large negative consequences", which seems like an reasonable expansion of "AI risk".)

However, I disagree with the idea that we should expand the word AI risk to include philosophical failures and intentional risks.

I want to include philosophical failures, as long as the consequences of the failures flow through AI, because (aside from historical usage) technical problems and philosophical problems blend into each other, and I don't see a point in drawing an arbitrary and potentially contentious border between them. (Is UDT a technical advance or a philosophical advance? Is defining the right utility function for a Sovereign Singleton a technical problem or a philosophical problem? Why force ourselves to answer these questions?)

As for "intentional risks" it's already common practice to include that in "AI risk":

Dividing AI risks into misuse risks and accident risks has become a prevailing approach in the field.

Besides that, I think there's also a large grey area between "accident risk" and "misuse" where the risk partly comes from technical/philosophical problems and partly from human nature. For example humans might be easily persuaded by wrong but psychologically convincing moral/philosophical arguments that AIs can come up with and then order their AIs to do terrible things. Even pure intentional risks might have technical solutions. Again I don't really see the point of trying to figure out which of these problems should be excluded from "AI risk".

It becomes unclear in conversation what people mean when they say AI risk

It seems perfectly fine to me to use that as shorthand for "AI-caused x-risk" and use more specific terms when we mean more specific risks.

Like The Singularity, it becomes a buzzword

What do you mean? Like people will use "AI risk" when their project has nothing to do with "AI-caused x-risk"? Couldn't they do that even if we define "AI risk" to be "accidental technical AI risk"?

Journalists start projecting Terminator scenarios onto the words, and now have justification because even the researchers say that AI risk can mean a lot of different things.

Terminator scenarios seem to be scenarios of "accidental technical AI risk" (they're just not very realistic scenarios) so I don't see how defining "AI risk" to mean that would prevent journalists from using Terminator scenarios to illustrate "AI risk".

It puts a whole bunch of types of risk into one basket, suggesting to outsiders that all attempts to reduce “AI risk” might be equally worthwhile.

I don't think this is a good argument, because even within "accidental technical AI risk" there are different problems that aren't equally worthwhile to solve, so why aren't you already worried about outsiders thinking all those problems are equally worthwhile?

ML researchers start to distrust AI risk researchers, because people who are worried about the Terminator are using the same words as the AI risk researchers and therefore get associated with them.

See my response above regarding "Terminator scenarios".

This can all be avoided by having a community norm to clarify that we mean technical accidental risk when we say AI risk, and when we’re talking about other types of risks we use more precise terminology.

I propose that we instead stick with historical precedent and keep "AI risk" to mean "AI-caused x-risk" and use more precise terminology to refer to more specific types of AI-caused x-risk that we might want to talk about. Aside from what I wrote above, it's just more intuitive/commonsensical that "AI risk" means "AI-caused x-risk" in general instead of a specific kind of AI-caused x-risk.

However I appreciate that someone who works mostly on the less philosophical / less human-related problems might find it tiresome to say or type "technical accidental AI risk" all the time to describe what they do or to discuss the importance of their work, and can find it very tempting to just use "AI risk". It would probably be good to create a (different) shorthand or acronym for it to remove this temptation and to make their lives easier.

[AN #80]: Why AI risk might be solved without additional intervention from longtermists

Ok, I wasn't sure that you'd agree, but given that you do, it seems that when you wrote the title of this newsletter "Why AI risk might be solved without additional intervention from longtermists" you must have meant "Why some forms of AI risk ...", or perhaps certain forms of AI risk just didn't come to your mind at that time. In either case it seems worth clarifying somewhere that you don't currently endorse interpreting "AI risk" as "AI risk in its entirety" in that sentence.

Similarly, on the inside you wrote:

The main reason I am optimistic about AI safety is that we will see problems in advance, and we will solve them, because nobody wants to build unaligned AI. A likely crux is that I think that the ML community will actually solve the problems, as opposed to applying a bandaid fix that doesn’t scale. I don’t know why there are different underlying intuitions here.

It seems worth clarifying that you're only optimistic about certain types of AI safety problems.

(I'm basically making the same complaint/suggestion that I made to Matthew Barnett not too long ago. I don't want to be too repetitive or annoying, so let me know if I'm starting to sound that way.)

[AN #80]: Why AI risk might be solved without additional intervention from longtermists

But on the strong versions of warning shots, where there’s common knowledge that building an AGI runs a substantial risk of destroying the world, yes, I expect them to not build AGI until safety is solved. (Not to the standard you usually imagine, where we must also solve philosophical problems, but to the standard I usually imagine, where the AGI is not trying to deceive us or work against us.)

To the extent that we expect strong warning shots and ability to avoid building AGI upon receiving such warning shots, this seems like an argument for researchers/longtermists to work on / advocate for safety problems beyond the standard of "AGI is not trying to deceive us or work against us" (because that standard will likely be reached anyway). Do you agree?

The major disanalogy is that catastrophic outcomes of climate change do not personally affect the CEOs of energy companies very much, whereas AI x-risk affects everyone.

Some types of AI x-risk don't affect everyone though (e.g., ones that reduce the long term value of the universe or multiverse without killing everyone in the near term).

[AN #80]: Why AI risk might be solved without additional intervention from longtermists

Faced with an actual example, I’m realizing that what I actually expect would cause people to take it more seriously is a) the belief that AGI is near and b) an example where the AI algorithm “deliberately” causes a problem (i.e. “with full knowledge” that the thing it was doing was not what we wanted).

What do you expect the ML community to do at that point? Coordinate to stop or slow down the race to AGI until AI safety/alignment is solved? Or do you think each company/lab will unilaterally invest more into safety/alignment without slowing down capability research much, and that will be sufficient? Or something else?

I worry about a parallel with the "energy community", a large part of which not just ignores but actively tries to obscure or downplay warning signs about future risks associated with certain forms of energy production. Given that the run-up to AGI will likely generate huge profits for AI companies as well as provide clear benefits for many people (compared to which, the disasters that will have occurred by then may well seem tolerable by comparison), and given probable disagreements between different experts about how serious the future risks are, it seems likely to me that AI risk will become politicized/controversial in a way similar to climate change, which will prevent effective coordination around it.

On the other hand... maybe AI will be more like nuclear power than fossil fuels, and a few big accidents will stall its deployment for quite a while. Is this why you're relatively optimistic about AI risk being taken seriously, and if so can you share why you think nuclear power is a closer analogy?

What can the principal-agent literature tell us about AI risk?

Thanks for making the changes, but even with "PAL confirms that due to diverging interests and imperfect monitoring, AI agents could get some rents." I'd still like to understand why imperfect monitoring could lead to rents, because I don't currently know a model that clearly shows this (i.e., where the rent isn't due to the agent having some other kind of advantage, like not having many competitors).

Also, I get that the PAL in its current form may not be directly relevant to AI, so I'm just trying to understand it on its own terms for now. Possibly I should just dig into the literature myself...

What can the principal-agent literature tell us about AI risk?

PAL confirms that due to diverging interests and imperfect monitoring, agents will get some rents.

Can you provide a source for this, or explain more? I'm asking because your note about competition between agents reducing agency rents made me think that such competition ought to eliminate all rents that the agent could (for example) gain by shirking, because agents will bid against each other to accept lower wages until they have no rent left. For example in the model of principle-agent problem presented in this lecture (which has diverging interests and imperfect monitoring) there is no agency rent. (ETA: This model does not have explicit competition between agents, but it models the principle as having all of the bargaining power, by letting it make a take-it-or-leave-it offer to the agent.)

If agents only earn rents when there isn't enough competition, that seems more like "monopoly rent" than "agency rent", plus it seemingly wouldn't apply to AIs... Can you help me develop a better intuition of where agency rents come from, according to PAL?

Load More