Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter.


Three AI Safety Related Ideas and Two Neglected Problems in Human-AI Safety (Wei Dai): If any particular human got a lot of power, or was able to think a lot faster, then they might do something that we would consider bad. Perhaps power corrupts them, or perhaps they get so excited about the potential technologies they can develop that they do so without thinking seriously about the consequences. We now have both an opportunity and an obligation to design AI systems that operate more cautiously, that aren't prone to the same biases of reasoning and heuristics that we are, such that the future actually goes better than it would if we magically made humans more intelligent.

If it's too hard to make AI systems in this way and we need to have them learn goals from humans, we could at least have them learn from idealized humans rather than real ones. Human values don't extrapolate well -- just look at the myriad answers that people give to the various hypotheticals like the trolley problem. So, it's better to learn from humans that are kept in safe, familiar environment with all their basic needs taken care of. These are our idealized humans. In practice the AI system would learn a lot from the preferences of real humans, since that should be a very good indicator of the preferences of idealized humans. But if the idealized humans begin to have different preferences from real humans, then the AI system should ignore the "corrupted" values of the real humans.

More generally, it seems important for our AI systems to help us figure out what we care about before we make drastic and irreversible changes to our environment, especially changes that prevent us from figuring out what we care about. For example, if we create a hedonic paradise where everyone is on side-effect-free recreational drugs all the time, it seems unlikely that we check whether this is actually what we wanted. This suggests that we need to work on AI systems that differentially advance our philosophical capabilities relative to other capabilities, such as technological ones.

One particular way that "aligned" AI systems could make things worse is if they accidentally "corrupt" our values, as in the hedonic paradise example before. A nearer-term example would be making more addictive video games or social media. They might also make very persuasive but wrong moral arguments.

This could also happen in a multipolar setting, where different groups have their own AIs that try to manipulate other humans into having values similar to theirs. The attack is easy, since you have a clear objective (whether or not the humans start behaving according to your values), but it seems hard to defend against, because it is hard to determine the difference between manipulation and useful information.

Rohin's opinion: (A more detailed discussion is available on these threads.) I'm glad these posts were written, they outline real problems that I think are neglected in the AI safety community and outline some angles of attack. The rest of this is going to be a bunch of disagreements I have, but these should be taken as disagreements on how to solve these problems, not a disagreement that the problems exist.

It seems quite difficult to me to build AI systems that are safe, without having them rely on humans making philosophical progress themselves. We've been trying to figure this out for thousands of years. I'm pessimistic about our chances at creating AI systems that can outperform this huge intellectual effort correctly on the first try without feedback from humans. Learning from idealized humans might address this to some extent, but in many circumstances I think I would trust the real humans with skin in the game more than the idealized humans who must reason about those circumstances from afar (in their safe, familiar environment).

I do think we want to have a general approach where we try to figure out how AIs and humans should reason, such that the resulting system behaves well. On the human side, this might mean that the human needs to be more cautious for longer timescales, or to have more epistemic and moral humility. Idealized humans can be thought of an instance of this approach where rather than change the policy of real humans, we indirectly change their policy in a hypothetical by putting them in safer environments.

For the problem of intentionally corrupting values, this seems to me an instance of the general class of "Competing aligned superintelligent AI systems could do bad things", in the same way that we have the risk of nuclear war today. I'm not sure why we're focusing on value corruption in particular. In any case, my current preferred solution is not to get into this situation in the first place (though admittedly that seems very hard to do, and I'd love to see more thought put into this).

Overall, I'm hoping that we can solve "human safety problems" by training the humans supervising the AI to not have those problems, because it sure does make the technical problem of aligning AI seem a lot easier. I don't have a great answer to the problem of competing aligned superintelligent AI systems.

Legible Normativity for AI Alignment: The Value of Silly Rules (Dylan Hadfield-Menell et al): One issue we might have with value learning is that our AI system might look at "silly rules" and infer that we care about them deeply. For example, we often enforce dress codes through social punishments. Given that dress codes do not have much functional purpose and yet we enforce them, should an AI system infer that we care about dress codes as much as we care about (say) property rights? This paper claims that these "silly rules" should be interpreted as a coordination mechanism that allows group members to learn whether or not the group rules will be enforced by neutral third parties. For example, if I violate the dress code, no one is significantly harmed but I would be punished anyway -- and this can give everyone confidence that if I were to break an important rule, such as stealing someone's wallet, bystanders would punish me by reporting me to the police, even though they are not affected by my actions and it is a cost to them to report me.

They formalize this using a model with a pool of agents that can choose to be part of a group. Agents in the group play "important" games and "silly" games. In any game, there is a scofflaw, a victim, and a bystander. In an important game, if the bystander would punish any rule violations, then the scofflaw follows the rule and the victim gets +1 utility, but if the bystander would not punish the violation, the scofflaw breaks the rule and the victim gets -1 utility. Note that in order to signal that they would punish, bystanders must pay a cost of c. A silly game works the same way, except the victim always gets 0 utility. Given a set of important rules, the main quantity of interest is how many silly rules to add. The authors quantify this by considering the proportion of all games that are silly games, which they call the density. Since we are imagining adding silly rules, all outcomes are measured with respect to the number of important games. We can think of this as a proxy for time, and indeed the authors call the expected number of games till an important game a timestep.

Now, for important games the expected utility to the victim is positive if the probability that the bystander is a punisher is greater than 0.5. So, each of the agents cares about estimating this probability in order to decide whether or not to stay in the group. Now, if we only had important games, we would have a single game per timestep, and we would only learn whether one particular agent is a punisher. As we add more silly games, we get more games per timestep, and so we can learn much more quickly the proportion of punishers, which leads to more stable groups. However, the silly rules are not free. The authors prove that if they are free, then we keep adding silly rules and the density would approach 1. (More precisely, they show that as density goes to 1, the value of being told the true probability of punishment goes to 0, meaning that the agent already knows everything.)

They then show experimental results showing a few things. When the agents are relatively certain of the probability of an agent being a punisher, then silly rules are not very useful and the group is more likely to collapse (since the cost of enforcing the silly rules starts to be important). Second, as long as c is low (so it is easy to signal that you will enforce rules), then groups with more silly rules will be more resilient to shocks in individual's beliefs about the proportion of punishers, since they will very quickly converge to the right belief. If there aren't any silly rules it can take more time and your estimate might be incorrectly low enough that you decide to leave the group even though group membership is still net positive. Finally, if the proportion of punishers drops below 0.5, making group membership net negative, agents in groups with high density will learn this faster, and their groups will disband much sooner.

Rohin's opinion: I really like this paper, it's a great concrete example of how systems of agents can have very different behavior than any one individual agent even if each of the agents have similar goals. The idea makes intuitive sense and I think the model captures its salient aspects. There are definitely many quibbles you could make with the model (though perhaps it is the standard model, I don't know this field), but I don't think they're important. My perspective is that the model is a particularly clear and precise way of communicating the effect that the authors are describing, as opposed to something that is supposed to track reality closely.

Technical AI alignment


Three AI Safety Related Ideas and Two Neglected Problems in Human-AI Safety (Wei Dai): Summarized in the highlights!

Technical agendas and prioritization

Multi-agent minds and AI alignment (Jan Kulveit): This post argues against the model of humans as optimizing some particular utility function, instead favoring a model based on predictive processing. This leads to several issues with the way standard value learning approaches like inverse reinforcement learning work. There are a few suggested areas for future research. First, we could understand how hierarchical models of the world work (presumably for better value learning). Second, we could try to invert game theory to learn objectives in multiagent settings. Third, we could learn preferences in multiagent settings, which might allow us to better infer norms that humans follow. Fourth, we could see what happens if we take a system of agents, infer a utility function, and then optimize it -- perhaps one of the agents' utility functions dominates? Finally, we can see what happens when we take a system of agents and give it more computation, to see how different parts scale. On the non-technical side, we can try to figure out how to get humans to be more self-aligned (i.e. there aren't "different parts pulling in different directions").

Rohin's opinion: I agree with the general point that figuring out a human utility function and then optimizing it is unlikely to work, but for different reasons (see the first chapter of the Value Learning sequence). I also agree that humans are complex and you can’t get away with modeling them as Boltzmann rational and optimizing some fixed utility function. I wouldn’t try to make the model more accurate (eg. a model of a bunch of interacting subagents, each with their own utility function), I would try to make the model less precise (eg. a single giant neural net), because that reduces the chance of model misspecification. However, given the impossibility result saying that you must make assumptions to make this work, we probably have to give up on having some nice formally specified meaning of “values”. I think this is probably fine -- for example, iterated amplification doesn’t have any explicit formal value function.

Reward learning theory

Figuring out what Alice wants: non-human Alice (Stuart Armstrong): We know that if we have a potentially irrational agent, then inferring their preferences is impossible without further assumptions. However, in practice we can infer preferences of humans quite well. This is because we have very specific and narrow models of how humans work: we tend to agree on our judgments of whether someone is angry, and what anger implies about their preferences. This is exactly what the theorem is meant to prohibit, which means that humans are making some strong assumptions about other humans. As a result, we can hope to solve the value learning problem by figuring out what assumptions humans are already making and using those assumptions.

Rohin's opinion: The fact that humans are quite good at inferring preferences should give us optimism about value learning. In the framework of rationality with a mistake model, we are trying to infer the mistake model from the way that humans infer preferences about other humans. This sidesteps the impossibility result by focusing on the structure of the algorithm that generates the policy. However, it still seems like we have to make some assumption about how the structure of the algorithm leads to a mistake model, or a model for what values are. Though perhaps we can get an answer that is principled enough or intuitive enough that we believe it.

Handling groups of agents

Legible Normativity for AI Alignment: The Value of Silly Rules (Dylan Hadfield-Menell et al): Summarized in the highlights!

Miscellaneous (Alignment)

Assuming we've solved X, could we do Y... (Stuart Armstrong): We often want to make assumptions that sound intuitive but that we can't easily formalize, eg. "assume we've solved the problem of determining human values". However, such assumptions can often be interpreted as being very weak or very strong, and depending on the interpretation we could be assuming away the entire problem, or the assumption doesn't buy us anything. So, we should be more precise in our assumptions, or focus on only on some precise properties of an assumption.

Rohin's opinion: I think this argument applies well to the case where we are trying to communicate, but not so much to the case where I individually am thinking about a problem. (I'm making this claim about me specifically; I don't know if it generalizes to other people.) Communication is hard and if the speaker uses some intuitive assumption, chances are the listener will interpret it differently from what the speaker intended, and so being very precise seems quite helpful. However, when I'm thinking through a problem myself and I make an assumption, I usually have a fairly detailed intuitive model of what I mean, such that if you ask me whether I'm assuming that problem X is solved by the assumption, I could answer that, even though I don't have a precise formulation of the assumption. Making the assumption more precise would be quite a lot of work, and probably would not improve my thinking on the topic that much, so I tend not to do it until I think there's some insight and want to make the argument more rigorous. It seems to me that this is how most research progress happens: by individual researchers having intuitions that they then make rigorous and precise.

Near-term concerns

Fairness and bias

Providing Gender-Specific Translations in Google Translate (Melvin Johnson)

Machine ethics

Building Ethics into Artificial Intelligence (Han Yu et al)

Building Ethically Bounded AI (Francesca Rossi et al)

Malicious use of AI

FLI Signs Safe Face Pledge (Ariel Conn)

Other progress in AI

Reinforcement learning

Off-Policy Deep Reinforcement Learning without Exploration (Scott Fujimoto et al) (summarized by Richard): This paper discusses off-policy batch reinforcement learning, in which an agent is trying to learn a policy from data which is not based on its own policy, and without the opportunity to collect more data during training. The authors demonstrate that standard RL algorithms do badly in this setting because they give unseen state-action pairs unrealistically high values, and lack the opportunity to update them. They proposes to address this problem by only selecting actions from previously seen state-action pairs; they prove various optimality results for this algorithm in the MDP setting. To adapt this approach to the continuous control case, the authors train a generative model to produce likely actions (conditional on the state and the data batch) and then only select from the top n actions. Their batch-conditional q-learning algorithm (BCQ) consists of that generative model, a perturbation model to slightly alter the top actions, and a value network and critic to perform the selection. When n = 0, BCQ resembles behavioural cloning, and when n -> ∞, it resembles Q-learning. BCQ with n=10 handily outperformed DQN and DDPG on some Mujoco experiments using batch data.

Richard's opinion: This is an interesting paper, with a good balance of intuitive motivations, theoretical proofs, and empirical results. While it's not directly safety-related, the broad direction of combining imitation learning and reinforcement learning seems like it might have promise. Relatedly, I wish the authors had discussed in more depth what assumptions can or should be made about the source of batch data. For example, BCQ would presumably perform worse than DQN when data is collected from an expert trying to minimise reward, and (from the paper’s experiments) performs worse than behavioural cloning when data is collected from an expert trying to maximise reward. Most human data an advanced AI might learn from is presumably somewhere in between those two extremes, and so understanding how well algorithms like BCQ would work on it may be valuable.

Soft Actor Critic—Deep Reinforcement Learning with Real-World Robots (Tuomas Haarnoja et al)

Deep learning

How AI Training Scales (Sam McCandlish et al): OpenAI has done an empirical investigation into the performance of AI systems, and found that the maximum useful batch size for a particular task is strongly influenced by the noise in the gradient. (Here, the noise in the gradient comes from the fact that we are using stochastic gradient descent -- any difference in the gradients across batches counts as "noise".) They also found some preliminary results showing the more powerful ML techniques tend to have more gradient noise, and even a single model tends to have increased gradient noise over time as they get better at the task.

Rohin's opinion: While OpenAI doesn't speculate on why this relationship exists, it seems to me that as you get larger batch sizes, you are improving the gradient by reducing noise by averaging over a larger batch. This predicts the results well: as the task gets harder and the noise in the gradients gets larger, there's more noise to get rid of by averaging over data points, and so there's more opportunity to have even larger batch sizes.

New Comment
2 comments, sorted by Click to highlight new comments since:

One particular way that “aligned” AI systems could make things work is if they accidentally “corrupt” our values

Did you mean "worse" instead of "work" here?

these should be taken as disagreements on how to solve these problems, not a disagreement that the problems exist.

I'm definitely not too attached to my proposed angles of attack either, and mainly wanted to give some ideas as existence proof that there are things that can be done from a technical perspective.

I’m not sure why we’re focusing on value corruption in particular. [...] I don’t have a great answer to the problem of competing aligned superintelligent AI systems.

I thought I gave pretty reasonable answers at Do you disagree with what I said?

Did you mean "worse" instead of "work" here?

Yes, fixed.

I thought I gave pretty reasonable answers at Do you disagree with what I said?

No, I just didn't get to that comment by the time the newsletter was sent out. (I've been a bit busy, and iirc that comment was either really close to or after the newsletter release time.)