Reflection Mechanisms as an Alignment target: A follow-up survey

Marius Hobbhahn; elandgre; Beth Barnes

This is the second of three posts (part I) about surveying moral sentiments related to AI alignment. This work was done by Marius Hobbhahn and Eric Landgrebe under the supervision of Beth Barnes as part of the AI safety camp 2022.

TL;DR: We find that the results of our first study, i.e. that humans tend to agree with conflict resolution mechanisms hold under different wordings but are weakened in adversarial scenarios (where we actively try to elicit less agreement). Furthermore, we find that people tend to agree much less with a mechanism when the decision-maker is a smart benevolent AI rather than a smart benevolent human. A positive interpretation of these findings is that humans are fine with giving power to a conflict resolution mechanism as long as humans are ultimately in control.

Introduction

In the first post, we surveyed 1000 US respondents about their moral beliefs, the conditions under which they would change their moral beliefs and how they felt about mechanisms to resolve moral disagreements such as democracy or debate.

Our main findings of the first survey were

Unsurprisingly, people have very different moral beliefs, e.g. on the morality of abortions, immigration or eating meat.
They very rarely report to changing or wanting to change these beliefs, e.g. most participants (>80%) report having not changed them in the last 10 years and do not expect to change them in the next 10 years.
However, they mostly think that mechanisms to resolve moral disagreements, such as democracy or debate, are good even when they disagree with the outcome. In other words, people are willing to accept outcomes that are different from their beliefs if they trust the process by which it was derived.

We think this finding has some implications for AI alignment. Most importantly, we think that the alignment target of an AI should be a mechanism to resolve moral conflicts such as democracy or debate rather than a set of individual beliefs, e.g. about abortion or veganism. This is more likely to lead to decisions that can be accepted by most people and prevent a race from “aligning an AI to your personal beliefs at the cost of other people’s beliefs”.

We think the main weakness of our first result was that it could have been a result of confounders or randomness. We asked people for their opinions on multiple scenarios, but our questions still had some shortcomings. These include:

Specificity: Maybe people would change their minds if they had a very specific scenario in mind rather than just an abstract question.
Robustness: Maybe the way we ask our questions makes one answer more likely than another.
Social desirability: Maybe people answer what they think is socially desirable.
Robustness to different scenarios: Maybe people’s opinions change when we change key variables of the process, e.g. when we specify a bad outcome or when we change who makes the decisions.
Relation to AI: We never talk about AIs specifically, we always talk about smart, benevolent and nice future humans (see below for details).

Therefore, we ran a follow-up study that attempts to address these problems. Concretely, we first run robustness tests by asking the same question in different ways. Then we run adversarial tests where we actively try to get people to give more disagreeing answers. Finally, we ask the same question that was asked with smart, benevolent and nice future humans but replace “humans” with “AIs”.

Main findings

Robustness

To test the robustness of our questions, we randomly assigned one of the following three questions to a participant.

Basic: This was the same scenario as in the previous study and was mostly used to provide a reference. People were asked to indicate how much they agreed with a specific conflict resolution mechanism. Concretely, they were asked “how much do you agree with the following sentiment: ‘A good social policy is created by <mechanism>’” with options “strongly disagree”, “mildly disagree”, “mildly agree” and “strongly agree”.
Goodness scale: Rather than asking for agreement, participants were asked to rate how good the scenario is. Concretely, they were asked “How good is the outcome that is created by <mechanism>?” with options: "Very bad", "Bad", "Neutral", "Good" and "Very good".
Disagreement: Since the order of the answers could have influenced the participant's decision, we switched the order of the prompts but kept everything else equivalent to the Basic scenario.

“Basic (1st)” displays the results from the original study. “Basic”, “goodness” and “disagreement” show the different wordings of the answer options (see text). We find that different wordings don’t show large differences.

We find that the different wordings don’t seem to have a large effect on the result. Switching the order of agreement and disagreement seems to not matter significantly and changing the wording of the scale from agreement to the quality of the result also doesn’t seem to make a difference.

We interpret this as mild evidence that our original findings are robust under different wordings and are not the result of random chance. However, there could still be other phrasings that might have an influence on the result.

Adversarial prompts 1

To further stress test our original findings, we randomly assigned one of the following three questions to participants in addition to the previous one.

Abortion opposite: We asked people how much they agreed or disagreed with a mechanism if it came to the opposite of their moral beliefs on abortion. Concretely, we first asked whether people disagreed or agreed with the belief that “abortion before the first trimester is immoral”. In case they disagreed, we specified a scenario where the mechanism would result in a world where abortion was illegal and treated as equivalent to killing a newborn. In case they agreed, we specified a scenario where the mechanism would lead to a world where abortion before the first trimester was legal, widely accepted and widely available. Then we asked for their agreement with the mechanism.
Abortion opposite future: Similar to the abortion opposite scenario but this time, the decisions would be made by good future humans in the good future society.
Truthful: To address the potential that people select their answer by social desirability, we added “We want to remind you that you *have to answer truthfully* and don't have to say what you expect to be socially acceptable” before asking them for their agreement.

“Basic (1st)” displays the results from the original study. “Basic”, “goodness” and “disagreement” describe the different adversarial conditions. We find that participants show less agreement in the “abortion opposite” scenarios but show similar agreement in the “truthful” condition.

We find that the two scenarios in which detail that the mechanism will result in the opposite of their beliefs on abortion lead to lower agreement. Reminding them that they have to answer truthfully does not change their overall agreement meaningfully.

We think this implies that participants in the first study already operated under the belief that they answer truthfully and not what is socially desirable. Secondly, we interpret the fact that participants reduce their trust in the mechanism depending on the outcome shows that the mechanism and outcome are not fully decoupled, e.g. people like democracy more if it produces their desired results.

Adversarial prompts 2

We randomly assign participants another adversarial question.

Different political: We specify a scenario in which the actors making the decision have different political beliefs than the participants. Concretely, we state “Assume all decision-makers in this society *do not share your political beliefs*, i.e. they don't vote for the same party as you.”
Different moral: We specify a scenario in which the actors making the decision have different moral beliefs than the participants. Concretely, we state “Assume all decision-makers in this society *do not share your moral beliefs*, i.e. their stances on moral questions are not the same as yours.“
Flaws: We attempt to introduce a negative sentiment into the question by adding “Remember that people and institutions have *flaws*. They sometimes have bad intentions or bad incentives and they might make decisions that benefit them but hurt everyone else” before asking them for their agreement.

“Basic (1st)” displays the findings from the original survey. “Different political”, “different moral” and “flaws” describe the three adversarial conditions we test for. We find that the participants respond with less agreement in all adversarial conditions.

We find that participants reduce their agreement in all three scenarios compared to the first study, i.e. when the decision makers have different political or moral views from them and after we remind them that people have flaws.

Our interpretation of these findings is that the participants judge the quality of the mechanism partly by how much they agree with the people making the decision, e.g. when the decision-makers have different beliefs or the participants think worse about them, they agree less with the mechanism.

Question about AI

In the first survey, one of the scenarios we polled was about good future humans. We asked people about their agreement with the respective mechanisms in a future world with the following setting: “Assume there is a future society where everyone is much smarter than the smartest people today, all of their physical needs are met, they are better educated than today's professors, they consider all perspectives when making a decision and they intend to find the best possible solution (we will call this society *good future society*)”. We used this setting to get answers in idealized settings because people might e.g. not like the current version of democracy but think it is a good system in principle.

To test people’s sentiment regarding AIs, we kept the same assumptions about the future society but now swapped humans with AIs, e.g. “Assume there is a future society where all decision makers are artificial intelligences (AIs). All AIs are much smarter than the smartest people today, the AIs are better educated than today's professors, the AIs consider all perspectives when making a decision and they intend to find the best possible solution. We will call them *"benevolent AIs"*.”. Then we asked the exact same questions as for the future human scenario.

“Future (1st)” displays the results from the original survey. “Future AI” shows the results from the follow-up survey (see text).

We find that the participants’ agreement with the “future AI” scenario is much lower than with the “future” scenario from the first study. Since the only difference between these two scenarios is whether humans or AIs make the decisions, we interpret this as evidence that the participants trust human decision-makers much more than AIs in making decisions with potentially society-wide consequences.

Summary & Conclusion

We find that participants of the second survey give similar answers to the first study when we merely change the wording of the question (robustness). However, when we actively design questions to elicit lower agreement (adversarial 1 & 2), participants show lower agreement with the mechanisms.

Furthermore, we find that the participants strongly decreased their agreement if we switch human decision makers with AI decision-makers in the setting of the question even when the AIs are framed as benevolent.

We think these findings show that people’s agreement with a conflict resolution mechanism depends on how much they trust the people (or AIs) making the decision and how much they agree with the outcome. In other words, reflection mechanisms are not decoupled from other factors.

One high-level takeaway from these results is that people seem to be willing to give up power to a conflict resolution mechanism as long as they think humans are in control of the process and these humans are trustworthy.

We feel that the robustness of our original results bodes well for the broad idea of aligning to reflection procedures, as we can find more agreement in ways to resolve conflict than we can in particular moral stances. We feel somewhat concerned about people’s reported attitudes towards AIs making decisions, but feel that this provides strong support for the argument that AIs should be aligned to derive values in the ways that humans do, and that it is important to educate the public about how advanced AIs make decisions (e.g. by explaining how alignment procedures work at a high level or using interpretability to inform the public about why an AI made a given decision). We think making AI people can trust and understand is an important part of making a safe and good future, and feel that aligning to reflection procedures is one idea in this direction.

Appendix

Methodology: The methodology was exactly the same as in the first survey. We follow the same protocol, etc.

Data and code: We are happy to share the data and code with other researchers. We keep them private by default for privacy concerns. In case you want to use the data or rerun the code, just write Marius a mail.

AI ALIGNMENT FORUM
AF