The Main Sources of AI Risk?

Wei Dai

Thank you for making this list. I think it is important enough to be worth continually updating and refining; if you don't do it then I will myself someday. Ideally there'd be a whole webpage or something, with the list refined so as to be disjunctive, and each element of the list catchily named, concisely explained, and accompanied by a memorable and plausible example. (As well as lots of links to literature.)

I think the commitment races problem is mostly but not entirely covered by #12 and #19, and at any rate might be worth including since you are OK with overlap.

Also, here's a good anecdote to link to for the "coding errors" section: https://openai.com/blog/fine-tuning-gpt-2/

[-]Wei Dai6y40

Thank you for making this list. I think it is important enough to be worth continually updating and refining; if you don’t do it then I will myself someday.

Please do. I seem to get too easily distracted these days for this kind of long term maintenance work. I'll ask the admins to give you edit permission on this post (if possible) and you can also copy the contents into a wiki page or your own post if you want to do that instead.

[-]Daniel Kokotajlo6y40

Ha! I wake up this morning to see my own name as author, that wasn't what I had in mind but it sure does work to motivate me to walk the talk! Thanks!

[-]Daniel Kokotajlo3y40

Update: It has failed to motivate me. I made one or two edits to the list but haven't done anything like the thorough encyclopedic accounting I originally envisioned. :(

[-]habryka6y10

Done! Daniel should now be able to edit the post.

[-]William_S7y*30

AI systems end up controlled by a group of humans representing a small range of human values (ie. an ideological or religious group that imposes values on everyone else). While not caused only by AI design, it is possible that design decisions could impact the likelihood of this scenario (ie. at what point are values loaded into the system/how many people's values are loaded into the system), and is relevant for overall strategy.

[-]William_S7y*30

Failure to learn how to deal with alignment in the many-humans, many-AIs case even if single-human, single-AI alignment is solved (which I think Andrew Critch has talked about). For example, AIs negotiating on behalf of humans take the stance described in https://arxiv.org/abs/1711.00363 of agreeing to split control of the future according to which human's priors are most accurate (on potentially irrelevant issues) if this isn't what humans actually want.

[-]Wei Dai7y40

Failure to learn how to deal with alignment in the many-humans, many-AIs case even if single-human, single-AI alignment is solved (which I think Andrew Critch has talked about).

Good point, I'll add this to the list.

For example, AIs negotiating on behalf of humans take the stance described in https://arxiv.org/abs/1711.00363 of agreeing to split control of the future according to which human’s priors are most accurate (on potentially irrelevant issues) if this isn’t what humans actually want.

Thanks, I hadn't noticed that paper until now. Under "Related Works" it cites Social Choice Theory but doesn't actually mention any recent research from that field. Here is one paper that criticizes the Pareto principle that Critch's paper is based on, in the context of preference aggregation of people with different priors: Spurious Unanimity and the Pareto Principle

[-]michaelcohen7y20

3. Misspecified or incorrectly learned goals/values

I think this phrasing misplaces the likely failure modes. An example that comes to mind from this phrasing is that we mean to maximize conscious flourishing, but we accidentally maximize dopamine in large brains.

Of course, this example includes an agent intervening in the provision of its own reward, but since that seems like the paradigmatic example here, maybe the language could better reflect that, or maybe this could be split into two.

The single technical problem that appears biggest to me is that we don't know how to align an agent with any goal. If we had an indestructible magic box that printed a number to a screen corresponding to the true amount of Good in the world, we still don't know how to design an agent that maximizes that number (instead of taking over the world, and tampering with the cameras that are aimed at the screen/the optical character recognition program used to decipher the image). This problems seems to me like the single most fundamental source of AI risk. Is 3 meant to include this?

[-]Wei Dai7y10

I'm not sure if I meant to include this when I wrote 3, but it does seem like a good idea to break it out into its own item. How would you suggest phrasing it? "Wireheading" or something more general or more descriptive?

[-]michaelcohen7y10

Maybe something along the lines of "Inability to specify any 'real-world' goal for an artificial agent"?

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

35

The Main Sources of AI Risk?

35