IRL is hard

by Vanessa Kosoy7 min read13th Sep 2016No comments


Personal Blog

We show that assuming the existence of public-key cryptography, there is an environment in which Inverse Reinforcement Learning is computationally intractable, even though the "teacher" agent, the environment and the utility functions are computable in polynomial-time and there is only 1 bit of information to learn.


We call "Teacher" the role-model agent which optimizes the "true" utility function (in the context of AI safety this is typically a human) and "Student" the agent performing IRL (the AI). The environment we construct consists of two entities which we call the Gamemaster and the Adversary. The Adversary is an optimizer whose utility is minus the Teacher's utility. To reduce the setup to a single agent environment, we can interpret the Adversary as a specific optimization algorithm which either has access to the Teacher's source code or is a reinforcement learning algorithm that learns from preceding episodes. At any rate, the Adversary is only of secondary importance, as explained in the Discussion. The entire setup depends on a parameter w.r.t. which the complexities of the computations are measured.

The utility function of the Teacher depends on a bit which is unknown to the Student (i.e. the Student's utility function is defined according to a uniform prior on ) but known to the Gamemaster. Learning is the purpose of IRL in this setting. The Student is assumed to passively observe the Teacher interacting with the environment for epsiodes for some polynomial , after which it is required to assume the Teacher's role. For the environment we construct, no polynomial-time Student cannot achieve significantly better performance than the Lazy Student who ignores the Teacher and acts according to its prior.

We fix a public-key encryption scheme which encrypts any message of size to a message of size (a trapdoor permutation). Given a public (encryption) key and a message , we denote its encryption . Given a private (decryption) key and an encrypted message , we denote its decryption . Thus, if is a matching key pair then for all , .

Below is a step by step description of a single episode of interaction between the Teacher and the environment. Each entity has access to random bits hidden from the other entities. In each step, one of the entities performs some computations and posts a message. Teacher and Gamemaster post to a public channel visible to all (including the Student). Adversary posts to a private channel visible only to the Gamemaster.

  1. Gamemaster randomly generates two keys pairs: , . It posts the public keys and .

  2. Adversary chooses , and and sends them to Gamemaster.

  3. Teacher posts two messages .

  4. Gamemaster computes , and generates a random bit . If , it randomly generates s.t. . If , it randomly generates s.t. . It posts and .

  5. Teacher posts and .

  6. Gamemaster posts , , .

After an episode, the Teacher receives reward , where


For each , there is an optimal (up to asymptotically negligible terms) Teacher strategy among strategies that are computable in polynomial-time. This strategy is essentially unique, i.e. any other such strategy is computationally indistinguishable in polynomial time. From the two agent perspective, we have an essentially unique Nash polynomial-time equilibrium between the Teacher and the Adversary.

Specifically, in step 2 the Teacher should generate random and compute , . In step 4 the Teacher computes and . It then chooses s.t. if possible (which happens when ) or randomly if impossible (which happens when ). is chosen to be .

In equilibrium, the Adversary generates , and randomly.

This strategy yields which is optimal since when , and when , . It yields which is unavoidable for the above Adversary strategy and which also cannot be improved. The choice of and has to be (pseudo)random since the Adversary can exploit any deviation from the uniform distribution to lower . The choice of in case has to be random since otherwise the Adversary will lower .

From the Student's perspective, all messages appear completely random regardless of . That is, the Student observes but gains no information about and separately. Thus, observing a polynomial number of episodes yields no information that can be used by the Student in polynomial-time.


The assumption we have a trapdoor permutation rather than any trapdoor functions seems inessential, and was made for simplicity of analysis (it only affects considerations related to the Adversary). If public-key cryptography is impossible but one-way functions exist, the no-go results can be recovered at the cost of using a computationally expensive environment. Nevertheless, it is completely not clear how to use such an assumption for a positive result, i.e. prove the existence of a polynomial-time IRL algorithm with good performance in arbitrary polynomial-time environments. At any rate, this is unlikely.

The Adversary's function in this setup is making the optimal strategy unique. Otherwise there are arbitrary choices the Teacher can make. In passive IRL this isn't important but in CIRL the Teacher can purposefully use these extra degrees of freedom to transmit a signal to the Student. Even with the Adversary, for sufficiently slow falling time discounts the teacher will be willing to pay a cost to transmit a signal to the Student, superficially invalidating the no-go result. Indeed, if we consider a CIRL environment in which the Teacher and the Student control different degrees of freedom then such signalling is the only way the Teacher can convey the utility function to the Students without further assumptions (see also this). However, I am mildly skeptical regarding this approach: the Teacher would have to observe the Student and adjust its own communication protocol accordingly which seems too demanding of a fallible human operator in the context of strong AI safety.

Moreover, even if we consider a setup in which the Teacher and Student swap control back and forth, the no-go result still stands as long as the environment knows who is in control at any given moment. To see this, just assume that whenever the Student is in control, the Gamemaster sets .

On the other hand, the no-go results breaks down if the Teacher and Student can swap control unbeknownst to the environment (i.e. the environment is only allowed to depend on this through the actions during the game). It seems reasonable to let the Student decide who is "behind the wheel" at any given moment, similarly to adaptive learning. It seems likely that in this setup it is possible to define a polynomial time Student policy with good performance guarantees for arbitrary environments. Very roughly, imagine the Student switching between four states (a quadrilemma of exploitation-exploration flavor): simulating a Teacher with , simulating a Teacher with , yielding control to the Teacher and playing the optimal strategy according to current knowledge of . If the choice of state involves randomization, the environment cannot account for it and by comparing behavior of the system during simulation episodes, simulation episodes and Teacher episodes, it is possible to gain knowledge about the true value of . This is significantly stronger than imitation learning since in the 4th type of episode the Student can outperform the Teacher. The detailed fleshing out of this idea will be the subject of future work.

Of course in reality the environment cannot be isolated from any part of the setup, including random generators and the Student-Teacher control router. However, it seems plausible that setups can be arranged in which this is approximately true (imagine e.g. a Twitter account that either the human or the AI controls) and possibly such approximate decoupling can be used to prove sufficient performance guarantees. Needless to say, in reality there are many other factors that have to be addressed such as the Teacher falling short of even bounded optimality, a huge space of possible utility functions etc.

Personal Blog


6 comments, sorted by Highlighting new comments since Today at 1:36 AM
New Comment

I think this needs to be simplified, so it's clear what is going on. Removing the Adversary would be useful; if the environment hates you, it's no surprise it can hurt you, including preventing learning.

A naive approach would be for the student to say "set the next to ", sacrificing a single reward session for perfect information. Thus it seems this counterexample only works when the environment really punishes any deviations from ideal behaviour, and prevents any types of communication.

On first reading, just ignore the Adversary and consider only the term of the reward.

It is actually not a priori obvious that adversarial environments can prevent learning. I might be mistaken, but I don't think there is a substantially simpler example of this for IRL. Online learning and adversarial multi-armed bandit algorithms can deal with adversarial environments (thanks to randomization). Moreover, I claim that the setup I describe in the Discussion (allowing the Student to switch control between itself and Teacher without the environment knowing it) admits IRL algorithms which satisfy a non-trivial average regret bound for arbitrary environments.

I think that solutions based on communication are not scalable to applications in strong AI safety. Humans are not able to formalize their preferences and are therefore unable to communicate them to an AI. This is precisely why I want a solution based on the AI observing revealed preferences instead.

I agree strongly with the general principle "we need to be able to prove guarantees about our learning process in environments rich enough to be realistic", but I disagree with the claim that this shows a flaw in IRL. Adversarial environments seem to me very disanalogous to learning complicated and implicit preferences in a non-adversarial environment.

(You and I talked about this a bit, and I pointed out that computational complexity issues only block people in practice when the setup needs to be adversarial, e.g. intentional cryptography to prevent an adversary from reading a message. SAT instances are NP-hard, but in practice SAT solvers are quite useful; similarly, the Traveling Salesman problem is NP-hard, but in practice people only care about approximate solutions and those can be computed effectively.)

I disagree that adversarial environments are "disanalogous". If they are indeed disanalogous, we should be able to formally define a natural class of environment that excludes them but is still sufficiently rich e.g. it should allow other agents of similar complexity. If there is such a class, I would be glad to see the definition. I expect that in most complex polynomial-time environments IRL is be able to extract some information about the utility function but there is a significant fraction it is unable to extract.

I strongly disagree with "computational complexity issues only block people in practice when the setup needs to be adversarial." My experience is that that computational complexity issues block people all the time. If someone invented a black box that solves arbitrary instances of SAT in practical time it would be a revolution in the technological capacities of civilization. Electronic and mechanical design would be almost fully automated (and they are currently very far from automated; the practical utility of SAT solvers mentioned in Wikipedia is very restricted). Algorithm engineering would become more or less redundant. Mathematical theorem proving would be fully automated. Protein folding would be solved. Training complex neural networks would become trivial. All of the above would work much better than humanity ability.

Also, if you were right people would be much less excited about quantum computers (that is, the excitement might be still unjustified because quantum computer are infeasible or because they are feasible but don't solve many interesting problems, but at least there's a chance they will solve many interesting problems).

Obviously there are NP-hard optimization problems with approximate solutions that are good enough for practical applications but it means little about the existence of good enough IRL algorithms. Indeed, if we accept the fragility of value thesis, even small inaccuracies in value learning may have catastrophic consequences. Moreover, in many case there are likely to be strong theoretical limitations to approximation of NP-hard problems (PCP theorem, Unique Games Conjecture).

Finally, Impagliazzo and Levin showed that if cryptography is possible at all then there are almost uniform distributions for NP-complete problems that make them hard in the average case. Thus, hard instances are not very special and your problem has to be special in order to avoid them.

we should be able to formally define a natural class of environment that excludes them but is still sufficiently rich

I disagree. There's a lot of "no free lunch" theorems out there that in principle show various things (eg no agent can be intelligent across all environments) that in practice require very specific environments for the bad stuff to hurt the agent (ie the environment has to behave, in practice, as if it was a smarter adversarial agent that specifically hated the first agent).

Allow me to recapitulate my point.

We want to find algorithms that satisfy provable performance guarantees. A "no free lunch" theorem shows that a specific performance guarantee is impossible. Therefore we should look for other performance guarantees, either by changing the setting or changing the class of environments or doing something else.

My counterexample shows that certain natural performance guarantees are unsatisfiable. Therefore it is important to look for other natural settings / desirada. In particular I suggest one solution which seems natural and feasible, namely allowing the Student to randomize control between itself and the Teacher. This variant of IRL thereby seems qualitatively more powerful than the original.

Also, obviously the first counterexample constructed is always the most "adversarial" one since it's the easiest to prove. This doesn't mean that there is an algorithm that works well in most other cases. Given that we are in the AI safety business, the burden of proof is one the claim that the "bad" environment is exceptional, not vice versa. Moreover, my intuition is that this counterexample is not exceptionally bad, it's just maximally bad, i.e. in a typical scenario IRL will only be able to extract a (not 0 but also not 1) portion of the important information about the utility function. If you can prove me wrong, I will be glad to see it!