# 18

We’ve written up a blog post about our recent paper that I’ve been linking to but haven’t really announced or explained. The key idea is that since we’ve optimized the world towards our preferences, we can infer these preferences just from the state of the world. We present an algorithm called Reward Learning by Simulating the Past (RLSP) that can do this in simple environments, but my primary goal is simply to show that there is a lot to be gained by inferring preferences from the world state.

The rest of this post assumes that you’ve read at least the non-technical part of the linked blog post. This post is entirely my own and may not reflect the views of my coauthors.

## Other sources of intuition

The story in the blog post is that when you look at the state of the world, you can figure out what humans have put effort into, and thus what they care about. There are other intuition pumps that you can use as well:

• The world state is “surprisingly” ordered and low-entropy. Anywhere you see such order, you can bet that a human was responsible for it, and that the human cared about it.
• If you look across the world, you’ll see many patterns recurring again and again -- vases are usually intact, glasses are usually upright, and laptops are usually on desks. Patterns that wouldn’t have happened without humans are likely something humans care about.

## How can a single state do so much?

You might be wondering how a single state could possibly contain so much information. And you would be correct to wonder that. This method depends very crucially on the assumption of known dynamics (i.e. a model of “how the world works”) and a good featurization.

Known dynamics. This is what allows you to simulate the past, and figure out what “must have happened”. Using the dynamics, the robot can figure out that breaking a vase is irreversible, and that Alice must have taken special care to avoid doing so. This is also what allows us to distinguish between effects caused by humans (which we care about) and effects caused by the environment (which we don’t care about).

If you take away the knowledge of dynamics, much of the oomph of this method is gone. You could still look for and preserve repetitions in the state -- maybe there are a lot of intact vases and no broken vases, so you try to keep vases intact. But this might also lead you to making sure that nobody puts warning signs near cliffs, since most cliffs don’t have warning signs near them.

But notice that dynamics are an empirical fact about the world, and do not depend on “values”. We should expect powerful AI systems to have a good understanding of dynamics. So I’m not too worried about the fact that we need to know dynamics for this to work well.

Features. A good featurization on the other hand allows you to focus on reward functions that are “reasonable” or “about the important parts”. It eliminates a vast swathe of strange, implausible reward functions that you otherwise would not be able to eliminate. If you didn’t have a good featurization and instead had rewards that were any function mapping from states to rewards, then you would typically learn some degenerate reward, such as mapping to reward 1 and mapping everything else to reward 0. (IRL faces the same problem of degenerate rewards. Since we observe strictly less than IRL does, we face the same problem.)

I’m not sure whether features are more like empirical facts, or more like values. It sure seems like there are very natural ways to understand the world that imply a certain set of features, and that a powerful AI system is likely to have these features; but maybe it only feels this way because we humans actually use those features to understand the world. I hope to test this in future work by trying out RLSP-like algorithms in more realistic environments where we first learn features in an unsupervised manner.

## Connection to impact measures

Preferences inferred from the state of the world are kind of like impact measures in that they allow us to infer all of the “common sense” rules that humans follow that tell us what not to do. The original motivating example for this work was a more complicated version of the vase environment, which is the standard example for negative side effects. (It was more complicated because at the time I thought it was important for there to be “repetitions” in the environment, e.g. multiple intact vases.)

Desiderata. I think that there are three desiderata for impact measures that are very hard to meet in concert. Let us say that an impact measure must also specify the set of reward functions it is compatible with. For example, attainable utility preservation (AUP) aims to be compatible with rewards whose codomain is [0, 1]. Then the desiderata are:

• Prevent catastrophe: The impact measure prevents all catastrophic outcomes, regardless of which compatible reward function the AI system optimizes.
• Do what we want: There exists some compatible reward function such that the AI system does the things that we want, despite the impact measure.
• Value agnostic: The design of the impact measure (both the penalty and the set of compatible rewards) should be agnostic to human values.

Note that the first two desiderata are about what the impact measure actually does, as opposed to what we can prove about it. The second one is an addition I’ve argued for before.

With both relative reachability and AUP, I worry that any setting of the hyperparameters will lead to a violation of either the first desideratum (if the penalty is not large enough) or the second one (if the penalty is too large). For intermediate settings, both desiderata would be violated.

When we infer preferences from the state of the world, we are definitely giving up on being value agnostic, but we are gaining significantly on the “do what we want” desideratum: the point of inferring preferences is that we do not also penalize positive impacts that we want to happen.

Test cases. You might wonder why we didn’t try using RLSP on the environments in relative reachability. The main problem is that those environments don’t satisfy our key assumption: that a human has been acting to optimize their preferences for some time. So if you try to run RLSP in that setting, it is very likely to fail. I think this is fine, because RLSP is exploiting a fact about reality that those environments fail to model.

(This is a general problem with benchmarks: they often do not include important aspects of the real problem under consideration, because the benchmark designers didn’t realize that those aspects were important for a solution.)

This is kind of related to the fact that we are not trying to be value agnostic -- if you’re trying to come up with a value agnostic, objective measure of impact, then it would make sense that you could create some simple gridworld environments and claim that any objective measure of impact should give the same result on that environment, since one action is clearly more impactful than the other. However, since we’re not trying to be value agnostic, that argument doesn’t apply.

If you take the test cases, put them in a more realistic context, make your model of the world sufficiently large and powerful, don’t worry about compute, and imagine a variant of RLSP that somehow learns good features of the world, then I would expect that RLSP could solve most of the impact measure test cases.

## What’s the point?

Before people start pointing out how a superintelligent AI system would game the preferences learned in this way, let me be clear: the goal is not to use the inferred preferences as a utility function. There are many reasons this is a bad idea, but one argument is that unless you have a good mistake model, you can’t exceed human performance -- which means that (for the most part) you want to leave the state the way it already is.

In other words, we are also not trying to achieve the “Prevent catastrophe” desideratum above. We are instead going for the weaker goal of preventing some bad outcomes, and learning more of human preferences without increasing the burden on the human overseer.

You can also think of this as a contribution to the overall paradigm of value learning: the state of the world is an especially good source of information of our preferences on what not to do, which are particularly hard to get feedback on.

If I had to point towards a particular concrete path to a good future, it would be the one that I outlined in Following human norms. We build AI systems that have a good understanding of “common sense” or “how to behave normally in human society”; they accelerate technological development and improve decision-making; if we really want to have a goal-directed AI that is not under our control but that optimizes for our values then we solve the full alignment problem in the future. Inferring preferences or norms from the world state could be a crucial part of helping our AI systems understand “common sense”.

## Limitations

There are a bunch of reasons why you couldn’t take RLSP, run it on the real world and hope to get a set of preferences that prevent you from causing negative impacts. Many of these are interesting directions for future work:

Things we don’t affect. We can’t affect quasars even if we wanted to, and so quasars are not optimized for our preferences, and RLSP will not be able to infer anything about our preferences about quasars.

We are optimized for the environment. You might reply that we don’t really have strong preferences about quasars (but don’t we?), but even then evolution has optimized us to prefer our environment, even though we haven’t optimized it. For example, you could imagine that RLSP infers that we don’t care about the composition of the atmosphere, or infers that we prefer there to be more carbon dioxide in the atmosphere. Thanks to Daniel Filan for making this point way back at the genesis of this project.

Multiple agents. RLSP assumes that there is exactly one human acting in the environment; in reality there are billions, and they do not have the same preferences.

Non-static preferences. Or as Stuart Armstrong likes to put it, our values are underdefined, changeable, and manipulable, whereas RLSP assumes they are static.

Not robust to misspecification and imperfect models. If you have an incorrect model of the dynamics, or a bad featurization, you can get very bad results. For example, if you can tell the difference between dusty vases and clean vases, but you don’t realize that by default dust accumulates on vases over time, then you infer that Alice actively wants her vase to be dusty.

Using finite-horizon policy for Alice instead of an infinite-horizon policy. The math in RLSP assumes that Alice was optimizing her reward over an episode that would end exactly when the robot is deployed, so that the observed state is Alice’s “final state”. This is clearly a bad model, since Alice will still be acting in the environment after the robot is deployed. For example, if the robot is deployed the day before Alice is scheduled to move, the robot might infer that Alice really wants there to be a lot of moving boxes in her living space (rather than realizing that this is an instrumental goal in a longer-term plan).

There’s no good reason for using a finite horizon policy for Alice. We were simply following Maximum Causal Entropy IRL, which makes this assumption (which is much more reasonable when you observe demonstrations rather than the state of the world), and didn’t realize our mistake until we were nearly done. The finite horizon version worked sufficiently well that we didn’t redo everything with the infinite horizon case, which would have been a significant amount of work.

New Comment

In other words, we are also not trying to achieve the “Prevent catastrophe” desideratum above.

I'm confused that this idea is framed as an alternative to impact measures, because I thought the main point of impact measures is "prevent catastrophe" and this doesn't aim to do that. In the AI that RLSP might be a component of, what is doing the "prevent catastrophe" part?

Can you also compare the pros and cons of this idea with other related ideas, for example large-scale IRL? (I'm imagining attaching recording devices to lots of people and recording their behavior over say months or years and feeding that to IRL.)

This is a challenging problem – we have two sources of information, the inferred reward from s0, and the specified reward θspec, and they will conflict.

It seems like there's gotta be a principled way to combine this idea with inverse reward design. Is that something you've thought about?

Oh, thanks for sharing the link to the paper review site. It's interesting to read the reviewers' comments, and encouraging to see some of the proposed reforms for academic peer review being implemented.

I'm confused that this idea is framed as an alternative to impact measures, because I thought the main point of impact measures is "prevent catastrophe" and this doesn't aim to do that.

I didn't mean to frame it as an alternative to impact measures, but it is achieving some of the things that impact measures achieve. Partly I wrote this post to explicitly say that I don't imagine RLSP being a drop-in replacement for impact measures, even though it might seem like that could be true. I guess I didn't communicate that effectively.

In the AI that RLSP might be a component of, what is doing the "prevent catastrophe" part?

That depends more on the AI part than on RLSP. I think the actual contribution here is the observation that the state of the world tells us a lot about what humans care about, and the RLSP algorithm is meant to demonstrate that it is in principle possible to extract those preferences.

If I were forced to give an answer to this question, it would be that RLSP would form a part of a norm-following AI, and that because the AI was following norms it wouldn't do anything too crazy. However, RLSP doesn't solve any of the theoretical problems with norm-following AI.

But the real answer is that this is an observation that seems important, but I don't have a story for how it leads to us solving AI safety.

Can you also compare the pros and cons of this idea with other related ideas, for example large-scale IRL? (I'm imagining attaching recording devices to lots of people and recording their behavior over say months or years and feeding that to IRL.)

Any scenario I construct with RLSP has clear problems, and similarly large-scale IRL also has clear problems. If you provide particular scenarios I could analyze those.

For example, if you literally think just of running RLSP with a time horizon of a year vs. large-scale IRL over a year and optimizing the resulting utility function, large-scale IRL should do better because it has way more data to work with.

It seems like there's gotta be a principled way to combine this idea with inverse reward design. Is that something you've thought about?

Yeah, I agree they feel very composable. The main issue is that the observation model in IRD requires a notion of a "training environment" that's separate from the real world, whereas RLSP assumes that there is one complex environment in which you are acting.

Certainly if you first trained your AI system in some training environments and then deployed them in the real world, you could use IRD during training to get a distribution over reward functions, and then use that distribution as your prior when running RLSP. It's maybe plausible that if you did this you could simply optimize the resulting reward function, rather than doing risk-averse planning (which is how IRD gets the robot to avoid lava), that would be cool. It's hard to test because all of the IRD environments don't satisfy the key assumption of RLSP (that humans have optimized the environment for their preferences).

Partly I wrote this post to explicitly say that I don’t imagine RLSP being a drop-in replacement for impact measures, even though it might seem like that could be true.

Ah ok, it didn't occur to me in the first place that RLSP could be a replacement for impact measures (it seemed more closely related to IRL), so the comparisons you did with impact measures made me think you're trying to frame it as a possible replacement for impact measures.

For example, if you literally think just of running RLSP with a time horizon of a year vs. large-scale IRL over a year and optimizing the resulting utility function, large-scale IRL should do better because it has way more data to work with.

I guess I wasn't thinking in terms of a fixed time horizon, but more like, given similar budgets can you do more with RLSP or IRL. For example it seems like increasing the time horizon of RLSP might be cheaper compared to doing large-scale IRL over a longer period of time so maybe RLSP could actually work with more data at a similar budget.

Also, more data has diminishing returns after a while, so I was also asking whether if you had the budget to do large-scale IRL over say a year, and if you could do RLSP over a longer time horizon and get more data as a result, do you think that would give RLSP a significant advantage over IRL.

(But maybe these questions aren't very important if the main point here isn't offering RLSP as a concrete technique for people to use but more that "state of the world tells us a lot about what humans care about".)

(But maybe these questions aren't very important if the main point here isn't offering RLSP as a concrete technique for people to use but more that "state of the world tells us a lot about what humans care about".)

Yeah, I think that's basically my position.

But to try to give an answer anyway, I suspect that the benefits of having a lot of data via large-scale IRL will make it significantly outperform RLSP, even if you could get a longer time horizon on RLSP. There might be weird effects where the RLSP reward is less Goodhart-able (since it tends to prioritize keeping the state the same) that make the RLSP reward better to maximize, even though it captures fewer aspects of "what humans care about". On the other hand, RLSP is much more fragile; slight errors in dynamics / features / action space will lead to big errors in the inferred reward; I would guess this is less true of large-scale IRL, so in practice I'd guess that large-scale IRL would still be better. But both would be bad.