Former AI safety research engineer, now AI governance researcher at OpenAI. Blog: thinkingcomplete.blogspot.com
Just skimmed the course. One suggestion (will make more later): adding the Goal Misgeneralization paper from Langosco et al. as a core readings in the week on Detecting and Forecasting Emergent Behavior.
Hmm, perhaps clearer to say "reward does not automatically reinforce reward-focused thoughts into terminal values", given that we both agree that agents will have thoughts about reward either way.
But if you agree that reward gets reinforced as an instrumental value, then I think your claims here probably need to actually describe the distinction between terminal and instrumental values. And this feels pretty fuzzy - e.g. in humans, I think the distinction is actually not that clear-cut.
In other words, if everyone agrees that reward likely becomes a strong instrumental value, then this seems like a prima facie reason to think that it's also plausible as a terminal value, unless you think the processes which give rise to terminal values are very different from the processes which give rise to instrumental values.
Amplification can just be used as a method for making more and better common-sense improvements, though. You could also do all sorts of other stuff with it, but standard examples (like "catch agents when they lie to us") seem very much like common-sense improvements.
+1 on this comment, I feel pretty confused about the excerpt from Paul that Steve quoted above. And even without the agent deliberately deciding where to avoid exploring, incomplete exploration may lead to agents which learn non-reward goals before convergence - so if Paul's statement is intended to refer to optimal policies, I'd be curious why he thinks that's the most important case to focus on.
I have relatively little idea how to "improve" a reward function so that it improves the inner cognition chiseled into the policy, because I don't know the mapping from outer reward schedules to inner cognition within the agent.
You don't need to know the full mapping in order to suspect that, when we reward agents for doing undesirable things, we tend to get more undesirable cognition. For example, if we reward agents for lying to us, then we'll tend to get less honest agents. We can construct examples where this isn't true but it seems like a pretty reasonable working hypothesis. It's possible that discarding this working hypothesis will lead to better research but I don't think your arguments manage to establish that, they only establish that we might in theory find ourselves in a situation where it's reasonable to discard this working hypothesis.
The way I attempt to avoid confusion is to distinguish between the RL algorithm's optimization target and the RL policy's optimization target, and then avoid talking about the "RL agent's" optimization target, since that's ambiguous between the two meanings. I dislike the title of this post because it implies that there's only one optimization target, which exacerbates this ambiguity. I predict that if you switch to using this terminology, and then start asking a bunch of RL researchers questions, they'll tend to give broadly sensible answers (conditional on taking on the idea of "RL policy's optimization target" as a reasonable concept).
Authors' summary of the "reward is enough" paper:
In this paper we hypothesise that the objective of maximising reward is enough to drive behaviour that exhibits most if not all attributes of intelligence that are studied in natural and artificial intelligence, including knowledge, learning, perception, social intelligence, language and generalisation. This is in contrast to the view that specialised problem formulations are needed for each attribute of intelligence, based on other signals or objectives. The reward-is-enough hypothesis suggests that agents with powerful reinforcement learning algorithms when placed in rich environments with simple rewards could develop the kind of broad, multi-attribute intelligence that constitutes an artificial general intelligence.
I think this is consistent with your claims, because reward can be enough to drive intelligent-seeming behavior whether or not it is the target of learned optimization. Can you point to the specific claim in this summary that you disagree with? (or a part of the paper, if your disagreement isn't captured in this summary).
More generally, consider the analogy to evolution. I view your position as analogous to saying: "hey, genetic fitness is not the optimization target of humans, therefore genetic fitness is not the optimization target of evolution". The idea that genetic fitness is not the optimization target of humans is an important insight, but it's clearly unhelpful to jump to "and therefore evolutionary biologists who talk about evolution optimizing for genetic fitness just don't get it", which seems analogous to what you're doing in this post.
Importantly, reward does not magically spawn thoughts about reward, and reinforce those reward-focused thoughts! Just because common English endows “reward” with suggestive pleasurable connotations, that does not mean that an RL agent will terminally value reward!
Sufficiently intelligent RL policies will have the concept of reward because they understand many facts about machine learning and their own situation, and (if deceptively aligned) will think about reward a bunch. There may be some other argument for why this concept won't get embedded as a terminal goal, but the idea that it needs to be "magically spawned" is very strawmanny.
Even if that part was easy, it still seems like a very small lever. A system capable of taking over the world will be able to generate those ideas for itself, and a system with strong motivations to take over the world won't have them changed by small amounts of training text.
Worrying about which alignment writing ends up in the training data feels like a very small lever for affecting alignment; my general heuristic is that we should try to focus on much bigger levers.
Great post. Two comments:
That can be folded into the utility function, however. Just make the ratings of the deferential person mostly copy the ratings of their partner.
Can you say more specifically how this is done?
the axiom of Independence of Irrelevant Alternatives....is not really a desiderata at all, it's actually an extremely baffling property.
The reason it's a desideratum is because it makes bargaining more robust to variation in how the game is defined. I agree it's counterintuitive within the context of a given game though. So maybe the best approach is to take it out, but then specify that we should think of games as being defined via some unbiased meta-bargaining-process...
Stop worrying about finding “outer objectives” which are safe to maximize. I think that you’re not going to get an outer-objective-maximizer (i.e. an agent which maximizes the explicitly specified reward function). Instead, focus on building good cognition within the agent. In my ontology, there's only an inner alignment problem: How do we grow good cognition inside of the trained agent?
This feels very strongly reminiscent of an update I made a while back, and which I tried to convey in this section of AGI safety from first principles. But I think you've stated it far too strongly; and I think fewer other people were making this mistake than you expect (including people in the standard field of RL), for reasons that Paul laid out above. When you say things like "Any reasoning derived from the reward-optimization premise is now suspect until otherwise supported", this assumes that the people doing this reasoning were using the premise in the mistaken way that you (and some other people, including past Richard) were. Before drawing these conclusions wholesale, I'd suggest trying to identify ways in which the things other people are saying are consistent with the insight this post identifies. E.g. does this post actually generate specific disagreements with Ajeya's threat model?
Edited to add: these sentences in particular feel very strawmanny of what I claim is the standard position:
My explanation for why my current position is consistent with both being aware of this core claim, and also disagreeing with most of this post:
I now think that, even though there's some sense in which in theory "building good cognition within the agent" is the only goal we care about, in practice this claim is somewhat misleading, because incrementally improving reward functions (including by doing things like making rewards depend on activations, or amplification in general) is a very good mechanism for moving agents towards the type of cognition we'd like them to do - and we have very few other mechanisms for doing so.
In other words, the claim that there's "only an inner alignment problem" in principle may or may not be a useful one, depending on how far improving rewards (i.e. making progress on the outer alignment problem) gets you in practice. And I agree that RL people are less aware of the inner alignment problem/goal misgeneralization problem than they should be, but saying that inner misalignment is the only problem seems like a significant overcorrection.
Relevant excerpt from AGI safety from first principles:
In trying to ensure that AGI will be aligned, we have a range of tools available to us - we can choose the neural architectures, RL algorithms, environments, optimisers, etc, that are used in the training procedure. We should think about our ability to specify an objective function as the most powerful such tool. Yet it’s not powerful because the objective function defines an agent’s motivations, but rather because samples drawn from it shape that agent’s motivations and cognition.From this perspective, we should be less concerned about what the extreme optima of our objective functions look like, because they won’t ever come up during training (and because they’d likely involve tampering). Instead, we should focus on how objective functions, in conjunction with other parts of the training setup, create selection pressures towards agents which think in the ways we want, and therefore have desirable motivations in a wide range of circumstances.
In trying to ensure that AGI will be aligned, we have a range of tools available to us - we can choose the neural architectures, RL algorithms, environments, optimisers, etc, that are used in the training procedure. We should think about our ability to specify an objective function as the most powerful such tool. Yet it’s not powerful because the objective function defines an agent’s motivations, but rather because samples drawn from it shape that agent’s motivations and cognition.
From this perspective, we should be less concerned about what the extreme optima of our objective functions look like, because they won’t ever come up during training (and because they’d likely involve tampering). Instead, we should focus on how objective functions, in conjunction with other parts of the training setup, create selection pressures towards agents which think in the ways we want, and therefore have desirable motivations in a wide range of circumstances.