I'm back, reading more papers in ethics (but now also in philosophy and wherever the citations lead me). Unlike last time when I surveyed one journal's approach to AI in general, this time I'm searching specifically for interesting papers that bear on aligning AGI, preferably written by people I've never heard of before. Are they going to be any good? *shrug*
To this end, I skimmed the titles and sometimes abstracts of the last 5 years of papers in a pretty big chunk of the AI ethics space, namely:
From the set of all papers that even had a remote chance at being relevant (~10 per journal per year), I read more deeply and am relaying to you in this post all the ones that were somewhat on-topic and nontrivial (~0.5 per journal per year). By "nontrivial" I mean that I didn't include papers that just say "the alignment problem exists" - I certainly do not mean that I set a high bar for quality. Then I did some additional searching into what else those authors had published, who they worked with, etc.
What were all the other papers about, the ones that didn't match my criteria? All sorts of things! Whether robots are responsible for their actions, how important privacy is, how to encourage and learn from non-Western robotics paradigms, the ethics of playing MMORPGs, how to make your AI ignore protected characteristics, the impact of bow-hunting technology on middle stone age society, and so on and so forth.
The bulk of this post is a big barely-ordered list of the papers. For each paper I'll give the title, authors, author affiliations, journal and date. Each paper will get a summary and maybe a recommendation. I'll also bring up interesting author affiliations and related papers.
Whereas last post was more a "I did this so you don't have to" kind of deal, I think this post will be more fun if you explore more deeply, reading the papers that catch your interest. (Let me know if you have any trouble finding certain pdfs - most are linked on google scholar.) If you find this post to be too long, don't go through it all in one sitting - I sure didn't!
I binned most papers talking about artificial moral agents ("AMAs") for being dull and anthropocentric. I decided to include this paper because it's better than most, and its stab at drawing the line between "not moral agent" and "moral agent" is also a good line between "just needs to be reliable" and "actually needs to be value aligned." Recommended if you like people who really like Kant.
They argue that if you design an AI's motivations by giving it multiple objective functions that only get aggregated near the point of decision-making, you can do things like using a nonlinear aggregation function as a form of reduced-impact AI, or throwing away outliers for increased robustness. Recommended if you haven't thought about this idea yet and want something to skim while you turn on your brain (also see Peter Vamplew's more technical papers if you want more details).
This has been cited a healthy amount, and by some interesting-looking papers, including:
MORAL: Aligning AI with Human Norms through Multi-Objective Reinforced Active Learning - They combine the idea with IRL. Check it out.
Building TrusTee: The World's Most Trusted Robot
Soft maximin approaches to Multi-Objective Decision-making for encoding human intuitive values
Instilling moral value alignment by means of multi-objective reinforcement learning (See immediately below)
They train a gridworld agent to be ethical by combining a handwritten ethics reward with the task reward.
We should prevent AI from doing bad things by building automated ethical tests that we occasionally probe them with, and shutting the AI down if it fails. Don't read the paper, but maybe hire this guy if constructing automated ethical tests.
This paper comes out of a "Moral Competence in Computational Architectures for Robots" grant that's funding work at several universities. Most of the papers are similar to the stuff I'm not passing on to you, but one fun-looking one that caught my eye is Toward the Engineering of Virtuous Machines.
We should do value alignment by laying down some moral principles that we're confident apply in various situations, and then when the AI is in a new situation it can apply the moral principles of the nearest neighboring known situation. If this sounds like a bad idea that nonetheless raises interesting questions, don't read this paper, instead read his other paper Can we Use Conceptual Spaces to Model Moral Principles?
Predictive-processing-based AI will automatically be super-ethical.
Mostly an explanation of word2vec. Possibly arguing that we shouldn't imagine future AIs as "overly literal genies," instead we should imagine them as having word2vec-like semantics.
I couldn't resist including this because first, it's the most relevant article in a collection I was looking through for a different reason, second, Douglas is a LWer, and third, it illustrates that the U.S. Army is interested in robots that do what you mean.
We should get value aligned AI by giving it rules for detecting its goals (e.g. human well-being) that work when the AI is dumb and the environment is in the training distribution, then giving it a generalization procedure that looks like wide reflective equilibrium. Recommended if you're fine with the proposals being bad, and you want an aid to think more about the intersection of Steve Byrnes' "Finders Keepers" for values and Dennett's Eliminate the Middletoad.
Also, kudos to Steve Petersen for changing his mind.
We should teach AI morality with suggestively-named cells in gridworlds.
Julia Haas was also involved with Australian National University's Humanising Machine Intelligence project, which seems potentially interesting, but I didn't find any papers I was personally excited about so far.
I said I would skip "Hey, the alignment problem exists" papers, but that was mostly just an excuse to exclude a few bad ones - this one is actually good. It's the author's take on (mostly) the normative components of the value alignment problem, and it's an interesting perspective. Recommended if you hope that at least DeepMind is thinking about which monkey gets the banana.
It's been well-cited in several fields as a sort of generalist explanation. A few of the more interesting papers (all good enough to glance at, I think) citing it are:
Invariance in Policy Optimisation and Partial Identifiability in Reward Learning
Learning Altruistic Behaviours in Reinforcement Learning without External Rewards
On Agent Incentives to Manipulate Human Feedback in Multi-Agent Reward Learning Scenarios
What are you optimizing for? Aligning Recommender Systems with Human Values
We should use Will MacAskill's scheme for aggregating moral theories.
This paper is also a direct ancestor of the next paper:
The prose is a bit purple, but underneath was a decent paper about modeling preferences from data. Recommended if you're interested in extracting human preferences from surveys and are good at parsing text.
This paper also cited a surprisingly good early AMA paper, Prolegomena to any future artificial moral agent - recommended if you want to time travel to 2000 AD.
Just a reminder that "normal" AI ethics papers still exist - this one seemed particularly well put together, and is influential in AI regulation in the EU. Recommended if you want to know more about AI regulation in the EU.
I'm ironically including this "control problem" paper that's actually about the problem of humans operating complicated machines. I can only guess that this has got to be some sort of cry for help from Leverhulme (Are they doing okay?).
We should develop AGI that cares equally for all sentient life, not just humans.
We should guard against erosion of our moral agency, even if there's AGI running around.
We can build moral and interpretable AI by generalizing ethical rules from examples, and giving justifications for actions in terms of deduction from ethical rules.
Explainability depends on what different stakeholders want from an explanation. Provides a breakdown of some things people want, and some interpretability techniques people can use. Recommended if you don't agree with that first sentence, or want to read an informed but unusual perspective on interpreting neural nets.
Cited by a lot of other explainable AI papers, but a very cursory look didn't turn up any worth linking.
Let's do a headcount:
This matches my impressions. If you want to publish something about value alignment in an ethics-y journal, you should shoot for Minds and Machines, then Ethics and Information Technology.
Is there a field here? No. Basically all the authors who weren't the usual suspects were writing baby's first AI alignment paper. But there are some papers that are starting to accrete a lineage of people interested in extending their ideas, and it's funny seeing people citing Soares and Fallenstein 2014 authoritatively. You can see the arc that will lead to a field being here in another 15 years, barring interruption.
Was anything good? A sizable fraction was worth reading. I hope you'll find some of these that you like as well.
I would look at the main FATE conferences as well, which I view as being: FAccT, AIES, EEAMO.