This is the second of three posts (part I) about surveying moral sentiments related to AI alignment. This work was done by Marius Hobbhahn and Eric Landgrebe under the supervision of Beth Barnes as part of the AI safety camp 2022.
TL;DR: We find that the results of our first study, i.e. that humans tend to agree with conflict resolution mechanisms hold under different wordings but are weakened in adversarial scenarios (where we actively try to elicit less agreement). Furthermore, we find that people tend to agree much less with a mechanism when the decision-maker is a smart benevolent AI rather than a smart benevolent human. A positive interpretation of these findings is that humans are fine with giving power to a conflict resolution mechanism as long as humans are ultimately in control.
In the first post, we surveyed 1000 US respondents about their moral beliefs, the conditions under which they would change their moral beliefs and how they felt about mechanisms to resolve moral disagreements such as democracy or debate.
Our main findings of the first survey were
We think this finding has some implications for AI alignment. Most importantly, we think that the alignment target of an AI should be a mechanism to resolve moral conflicts such as democracy or debate rather than a set of individual beliefs, e.g. about abortion or veganism. This is more likely to lead to decisions that can be accepted by most people and prevent a race from “aligning an AI to your personal beliefs at the cost of other people’s beliefs”.
We think the main weakness of our first result was that it could have been a result of confounders or randomness. We asked people for their opinions on multiple scenarios, but our questions still had some shortcomings. These include:
Therefore, we ran a follow-up study that attempts to address these problems. Concretely, we first run robustness tests by asking the same question in different ways. Then we run adversarial tests where we actively try to get people to give more disagreeing answers. Finally, we ask the same question that was asked with smart, benevolent and nice future humans but replace “humans” with “AIs”.
To test the robustness of our questions, we randomly assigned one of the following three questions to a participant.
“Basic (1st)” displays the results from the original study. “Basic”, “goodness” and “disagreement” show the different wordings of the answer options (see text). We find that different wordings don’t show large differences.
We find that the different wordings don’t seem to have a large effect on the result. Switching the order of agreement and disagreement seems to not matter significantly and changing the wording of the scale from agreement to the quality of the result also doesn’t seem to make a difference.
We interpret this as mild evidence that our original findings are robust under different wordings and are not the result of random chance. However, there could still be other phrasings that might have an influence on the result.
To further stress test our original findings, we randomly assigned one of the following three questions to participants in addition to the previous one.
“Basic (1st)” displays the results from the original study. “Basic”, “goodness” and “disagreement” describe the different adversarial conditions. We find that participants show less agreement in the “abortion opposite” scenarios but show similar agreement in the “truthful” condition.
We find that the two scenarios in which detail that the mechanism will result in the opposite of their beliefs on abortion lead to lower agreement. Reminding them that they have to answer truthfully does not change their overall agreement meaningfully.
We think this implies that participants in the first study already operated under the belief that they answer truthfully and not what is socially desirable. Secondly, we interpret the fact that participants reduce their trust in the mechanism depending on the outcome shows that the mechanism and outcome are not fully decoupled, e.g. people like democracy more if it produces their desired results.
We randomly assign participants another adversarial question.
“Basic (1st)” displays the findings from the original survey. “Different political”, “different moral” and “flaws” describe the three adversarial conditions we test for. We find that the participants respond with less agreement in all adversarial conditions.
We find that participants reduce their agreement in all three scenarios compared to the first study, i.e. when the decision makers have different political or moral views from them and after we remind them that people have flaws.
Our interpretation of these findings is that the participants judge the quality of the mechanism partly by how much they agree with the people making the decision, e.g. when the decision-makers have different beliefs or the participants think worse about them, they agree less with the mechanism.
In the first survey, one of the scenarios we polled was about good future humans. We asked people about their agreement with the respective mechanisms in a future world with the following setting: “Assume there is a future society where everyone is much smarter than the smartest people today, all of their physical needs are met, they are better educated than today's professors, they consider all perspectives when making a decision and they intend to find the best possible solution (we will call this society *good future society*)”. We used this setting to get answers in idealized settings because people might e.g. not like the current version of democracy but think it is a good system in principle.
To test people’s sentiment regarding AIs, we kept the same assumptions about the future society but now swapped humans with AIs, e.g. “Assume there is a future society where all decision makers are artificial intelligences (AIs). All AIs are much smarter than the smartest people today, the AIs are better educated than today's professors, the AIs consider all perspectives when making a decision and they intend to find the best possible solution. We will call them *"benevolent AIs"*.”. Then we asked the exact same questions as for the future human scenario.
“Future (1st)” displays the results from the original survey. “Future AI” shows the results from the follow-up survey (see text).
We find that the participants’ agreement with the “future AI” scenario is much lower than with the “future” scenario from the first study. Since the only difference between these two scenarios is whether humans or AIs make the decisions, we interpret this as evidence that the participants trust human decision-makers much more than AIs in making decisions with potentially society-wide consequences.
We find that participants of the second survey give similar answers to the first study when we merely change the wording of the question (robustness). However, when we actively design questions to elicit lower agreement (adversarial 1 & 2), participants show lower agreement with the mechanisms.
Furthermore, we find that the participants strongly decreased their agreement if we switch human decision makers with AI decision-makers in the setting of the question even when the AIs are framed as benevolent.
We think these findings show that people’s agreement with a conflict resolution mechanism depends on how much they trust the people (or AIs) making the decision and how much they agree with the outcome. In other words, reflection mechanisms are not decoupled from other factors.
One high-level takeaway from these results is that people seem to be willing to give up power to a conflict resolution mechanism as long as they think humans are in control of the process and these humans are trustworthy.
We feel that the robustness of our original results bodes well for the broad idea of aligning to reflection procedures, as we can find more agreement in ways to resolve conflict than we can in particular moral stances. We feel somewhat concerned about people’s reported attitudes towards AIs making decisions, but feel that this provides strong support for the argument that AIs should be aligned to derive values in the ways that humans do, and that it is important to educate the public about how advanced AIs make decisions (e.g. by explaining how alignment procedures work at a high level or using interpretability to inform the public about why an AI made a given decision). We think making AI people can trust and understand is an important part of making a safe and good future, and feel that aligning to reflection procedures is one idea in this direction.
Methodology: The methodology was exactly the same as in the first survey. We follow the same protocol, etc.
Data and code: We are happy to share the data and code with other researchers. We keep them private by default for privacy concerns. In case you want to use the data or rerun the code, just write Marius a mail.