Rohin Shah

PhD student at the Center for Human-Compatible AI. Creator of the Alignment Newsletter. http://rohinshah.com/

Rohin Shah's Comments

ACDT: a hack-y acausal decision theory

Thanks! I changed it to:

If the predictor is near-perfect, but the agent models its actions as independent of the predictor (since the prediction was made in the past), then the agent will have some belief about the prediction and will choose the less likely action for expected utility at least 1, and will continually lose.

The problem with the previous agent is that it never learns that it has the wrong causal model. If the agent is able to learn a better causal model from experience, then it can learn that the predictor can actually predict the agent successfully, and so will no longer expect a 50% chance of winning, and it will stop playing the game.
AI Safety Debate and Its Applications

Planned summary for the Alignment Newsletter:

This post defines the components of a <@debate@>(@AI safety via debate@) game, lists some of its applications, and defines truth-seeking as the property that we want. Assuming that the agent chooses randomly from the possible Nash equilibria, the truth-promoting likelihood is the probability that the agent picks the actually correct answer. The post then shows the results of experiments on MNIST and Fashion MNIST, seeing comparable results to the original paper.
ACDT: a hack-y acausal decision theory

Planned summary for the previous post for the Alignment Newsletter:

Consider a setting in which an agent can play a game against a predictor. The agent can choose to say zero or one. It gets 3 utility if it says something different from the predictor, and -1 utility if it says the same thing. If the predictor is near-perfect, but the agent models itself as having access to unpredictable randomness, then the agent will continually try to randomize (which it calculates has expected utility 1), and will continually lose.

Planned summary for this post:

The problem with the previous agent is that it never learns that it has the wrong causal model. If the agent is able to learn a better causal model from experience, then it can learn that it is not actually able to use unpredictable randomness, and so it will no longer expect a 50% chance of winning, and it will stop playing the game.
Embedded Agency via Abstraction

Asya's summary for the Alignment Newsletter:

<@Embedded agency problems@>(@Embedded Agents@) are a class of theoretical problems that arise as soon as an agent is part of the environment it is interacting with and modeling, rather than having a clearly-defined and separated relationship. This post makes the argument that before we can solve embedded agency problems, we first need to develop a theory of _abstraction_. _Abstraction_ refers to the problem of throwing out some information about a system while still being able to make predictions about it. This problem can also be referred to as the problem of constructing a map for some territory.
The post argues that abstraction is key for embedded agency problems because the underlying challenge of embedded world models is that the agent (the map) is smaller than the environment it is modeling (the territory), and so inherently has to throw some information away.
Some simple questions around abstraction that we might want to answer include:
- Given a map-making process, characterize the queries whose answers the map can reliably predict.
- Given some representation of the map-territory correspondence, translate queries from the territory-representation to the map-representation and vice versa.
- Given a territory, characterize classes of queries which can be reliably answered using a map much smaller than the territory itself.
- Given a territory and a class of queries, construct a map which throws out as much information as possible while still allowing accurate prediction over the query class.
The post argues that once we create the simple theory, we will have a natural way of looking at more challenging problems with embedded agency, like the problem of self-referential maps, the problem of other map-makers, and the problem of self-reasoning that arises when the produced map includes an abstraction of the map-making process itself.

Asya's opinion:

My impression is that embedded agency problems as a class of problems are very young, extremely entangled, and characterized by a lot of confusion. I am enthusiastic about attempts to decrease confusion and intuitively, abstraction does feel like a key component to doing that.
That being said, my guess is that it’s difficult to predictably suggest the most promising research directions in a space that’s so entangled. For example, one thread in the comments of this post discusses the fact that this theory of abstraction as presented looks at “one-shot” agency where the system takes in some data once and then outputs it, rather than “dynamic” agency where a system takes in data and outputs decisions repeatedly over time. Abram Demski argues that the “dynamic” nature of embedded agency is a central part of the problem and that it may be more valuable and neglected to put research emphasis there.
Malign generalization without internal search

Planned summary for the Alignment Newsletter:

This post argues that agents can have <@capability generalization without objective generalization@>(@2-D Robustness@), _without_ having an agent that does internal search in pursuit of a simple mesa objective. Consider an agent that learns different heuristics for different situations which it selects from using a switch statement. For example, in lunar lander, if at training time the landing pad is always red, the agent may learn a heuristic about which thrusters to apply based on the position of red ground relative to the lander. The post argues that this selection across heuristics could still happen with very complex agents (though the heuristics themselves may involve search).

Planned opinion:

I generally agree that you could get powerful agents that nonetheless are "following heuristics" rather than "doing search"; however, others with differing intuitions did not find this post convincing.
Clarifying "AI Alignment"
A bound on subjective regret ensures that running the AI is a nearly-optimal strategy from the user's subjective perspective.

Sorry, that's right. Fwiw, I do think subjective regret bounds are significantly better than the thing I meant by definition-optimization.

It is possible that Alpha cannot predict it, because in Beta-simulation-world the user would confirm the irreversible action. It is also possible that the user would confirm the irreversible action in the real world because the user is being manipulated, and whatever defenses we put in place against manipulation are thrown off by the simulation hypothesis.

Why doesn't this also apply to subjective regret bounds?

My guess at your answer is that Alpha wouldn't take the irreversible action as long as the user believes that Alpha is not in Beta-simulation-world. I would amend that to say that Alpha has to know that [the user doesn't believe that Alpha is in Beta-simulation-world]. But if Alpha knows that, then surely Alpha can predict that the user would not confirm the irreversible action?

It seems like for subjective regret bounds, avoiding this scenario depends on your prior already "knowing" that the user thinks that Alpha is not in Beta-simulation-world (perhaps by excluding Beta-simulations). If that's true, you could do the same thing with intent alignment / corrigibility.

Besides the fact ascription universality is not formalized, why is it equivalent to intent-alignment? Maybe I'm missing something.

It isn't equivalent to intent alignment; but it is meant to be used as part of an argument for safety, though I guess it could be used in definition-optimization too, so never mind.

I am curious whether you can specify, as concretely as possible, what type of mathematical result would you have to see in order to significantly update away from this opinion.

That is hard to say. I would want to have the reaction "oh, if I built that system, I expect it to be safe and competitive". Most existing mathematical results do not seem to be competitive, as they get their guarantees by doing something that involves a search over the entire hypothesis space.

I could also imagine being pretty interested in a mathematical definition of safety that I thought actually captured "safety" without "passing the buck". I think subjective regret bounds and CIRL both make some progress on this, but somewhat "pass the buck" by requiring a well-specified hypothesis space for rewards / beliefs / observation models.

Tbc, I also don't think intent alignment will lead to a mathematical formalization I'm happy with -- it "passes the buck" to the problem of defining what "trying" is, or what "corrigibility" is.

Clarifying "AI Alignment"
This opens the possibility of agents that with "well intentioned" mistakes that take the form of sophisticated plans that are catastrophic for the user.

Agreed that this is in theory possible, but it would be quite surprising, especially if we are specifically aiming to train systems that behave corrigibly.

In the above scenario, is Alpha "motivation-aligned"

If Alpha can predict that the user would say not to do the irreversible action, then at the very least it isn't corrigible, and it would be rather hard to argue that it is intent aligned.

But, such a concept would depend in complicated ways on the agent's internals.

That, or it could depend on the agent's counterfactual behavior in other situations. I agree it can't be just the action chosen in the particular state.

Moreover, the latter already produced viable directions for mathematical formalization, and the former has not (AFAIK).

I guess you wouldn't count universality. Overall I agree. I'm relatively pessimistic about mathematical formalization. (Probably not worth debating this point; feels like people have talked about it at length in Realism about rationality without making much progress.)

it refers to the actual things that agent does, and the ways in which these things might have catastrophic consequences.

I do want to note that all of these require you to make assumptions of the form, "if there are traps, either the user or the agent already knows about them" and so on, in order to avoid no-free-lunch theorems.

Realism about rationality
I disagree with the version that replaces 'MIRI's theories' with 'mathematical theories of embedded rationality'

Yeah, I think this is the sense in which realism about rationality is an important disagreement.

But also, to the extent that your theory is mathematisable and comes with 'error bars'

Yeah, I agree that this would make it easier to build multiple levels of abstractions "on top". I also would be surprised if mathematical theories of embedded rationality came with tight error bounds (where "tight" means "not so wide as to be useless"). For example, current theories of generalization in deep learning do not provide tight error bounds to my knowledge, except in special cases that don't apply to the main successes of deep learning.

When I read a MIRI paper, it typically seems to me that the theories discussed are pretty abstract, and as such there are more levels below than above. [...] They are also mathematised enough that I'm optimistic about upwards abstraction having the possibility of robustness.

Agreed.

The levels below seem mostly unproblematic (except for machine learning, which in the form of deep learning is often under-theorised).

I am basically only concerned about machine learning, when I say that you can't build on the theories. My understanding of MIRI's mainline story of impact is that they develop some theory that AI researchers use to change the way they do machine learning that leads to safe AI. This sounds to me like there are multiple levels of inference: "MIRI's theory" -> "machine learning" -> "AGI". This isn't exactly layers of abstraction, but I think the same principle applies, and this seems like too many layers.

You could imagine other stories of impact, and I'd have other questions about those, e.g. if the story was "MIRI's theory will tell us how to build aligned AGI without machine learning", I'd be asking when the theory was going to include computational complexity.

rohinmshah's Shortform

I was reading Avoiding Side Effects By Considering Future Tasks, and it seemed like it was doing something very similar to relative reachability. This is an exploration of that; it assumes you have already read the paper and the relative reachability paper. It benefitted from discussion with Vika.

Define the reachability , where  is the optimal policy for getting from to , and is the length of the trajectory. This is the notion of reachability both in the original paper and the new one.

Then, for the new paper when using a baseline, the future task value is:

where is the baseline state and is the future goal.

In a deterministic environment, this can be rewritten as:

Here, is relative reachability, and the last line depends on the fact that the goal is equally likely to be any state.

Note that the first term only depends on the number of timesteps, since it only depends on the baseline state s'. So for a fixed time step, the first term is a constant.

The optimal value function in the new paper is (page 3, and using my notation of instead of their ):

.

This is the regular Bellman equation, but with the following augmented reward (here is the baseline state at time t):

Terminal states:

Non-terminal states:

For comparison, the original relative reachability reward is:

The first and third terms in are very similar to the two terms in . The second term in only depends on the baseline.

All of these rewards so far are for finite-horizon MDPs (at least, that's what it sounds like from the paper, and if not, they could be anyway). Let's convert them to infinite-horizon MDPs (which will make things simpler, though that's not obvious yet). To convert a finite-horizon MDP to an infinite-horizon MDP, you take all the terminal states, add a self-loop, and multiply the rewards in terminal states by a factor of (to account for the fact that the agent gets that reward infinitely often, rather than just once as in the original MDP). Also define for convenience. Then, we have:

Non-terminal states:

What used to be terminal states that are now self-loop states:

Note that all of the transformations I've done have preserved the optimal policy, so any conclusions about these reward functions apply to the original methods. We're ready for analysis. There are exactly two differences between relative reachability and future state rewards:

First, the future state rewards have an extra term, .

This term depends only on the baseline . For the starting state and inaction baselines, the policy cannot affect this term at all. As a result, this term does not affect the optimal policy and doesn't matter.

For the stepwise inaction baseline, this term certainly does influence the policy, but in a bad way: the agent is incentivized to interfere with the environment to preserve reachability. For example, in the human-eating-sushi environment, the agent is incentivized to take the sushi off of the belt, so that in future baseline states, it is possible to reach goals that involve sushi.

Second, in non-terminal states, relative reachability weights the penalty by instead of . Really since and thus is an arbitrary hyperparameter, the actual big deal is that in relative reachability, the weight on the penalty switches from in non-terminal states to the smaller in terminal / self-loop states. This effectively means that relative reachability provides an incentive to finish the task faster, so that the penalty weight goes down faster. (This is also clear from the original paper: since it's a finite-horizon MDP, the faster you end the episode, the less penalty you accrue over time.)

Summary: The actual effects of the new paper's framing 1. removes the "extra" incentive to finish the task quickly that relative reachability provided and 2. adds an extra reward term that does nothing for starting state and inaction baselines but provides an interference incentive for the stepwise inaction baseline.

(That said, it starts from a very different place than the original RR paper, so it's interesting that they somewhat converge here.)

Load More