# 1

Frontpage

## Highlights

Solving the Rubik's Cube Without Human Knowledge (Stephen McAleer, Forest Agostinelli, Alexander Shmakov et al): This paper proposes Autodidactic Iteration (ADI), which is a technique that can be combined with the techniques in AlphaGo and expert iteration to solve problems with only one goal state, such as the Rubik's cube. MCTS with value and policy networks will not suffice, because when starting from a randomly scrambled cube, MCTS will never find a path to the goal state, and so there will never be any reward signal. (Whereas with Go, even if you play randomly the game will end relatively quickly, giving you some reward signal.) To get around this, they start from the goal state and generate states that are near the goal state. This gives them a training dataset of states for which they know (a good approximation to) the value and the best action, which they can use to train a value and policy network. They then use this with MCTS to solve the full problem, as in AlphaGo.

My opinion: This general idea has been proposed in robotics as well, in Reverse Curriculum Generation for Reinforcement Learning, where there is a single goal state. However, in this setting we have the added benefit of perfect inverse dynamics, that is, for any action a that moves us from state s to s', we can find the inverse action a' that moves us from state s' to s. This allows the authors to start from the goal state, generate nearby states, and automatically know the value of those states (or at least a very good approximation to it). Hindsight Experience Replay also tackles similar issues -- I'd be interested to see if it could solve the Rubik's cube. Overall, the problem of sparse rewards is very difficult, and it seems like we now have another solution in the case where we have a single goal state and perfect (or perhaps just sufficiently good?) inverse dynamics.

Where Do You Think You're Going?: Inferring Beliefs about Dynamics from Behavior (Siddharth Reddy et al): Inverse reinforcement learning algorithms typically assume that the demonstrations come from an expert who is approximately optimal. However, this is often not the case, at least when the experts are fallible humans. This paper considers the case where the expert has an incorrect model of the dynamics (transition function) of the environment, and proposes learning the expert's model of the dynamics to improve reward function inference. However, this leads to severe unidentifiability problems, where many models of the dynamics are compatible with the observed behavior. To overcome this, they assume that they have multiple tasks with known reward functions, which they use to infer the expert's dynamics. This is then used to infer the reward function in a new task using an adaptation of max causal entropy IRL. The dynamics can be an arbitrary neural net while the reward function is a weighted linear combination of features. They evaluate the inference of the dynamics model with real humans on Lunar Lander. Given transcripts of humans playing Lunar Lander, they infer the underlying (incorrect) dynamics model. Then, when the human takes an action, they predict which next state the human wanted to achieve, and replace the human's action with the action that would actually get close to the state the human wanted.

My opinion: I really like that this paper has experiments with real humans. It's definitely a problem that IRL assumes that the expert is (approximately) optimal -- this means that you can't learn where the expert is likely to be wrong, and so it is hard to exceed the expert's performance. It's very difficult to figure out how to deal with the possbility of a biased expert, and I'm happy to see work that takes a shot at it.

# Technical AI alignment

### Problems

How the Enlightenment Ends (Henry A. Kissinger): This is an article about the dangers of AI written by a non-technologist, hitting some points that are relatively familiar.

My opinion: While there are many points that I disagree with (eg. "what [AIs] do uniquely is not thinking as heretofore conceived and experienced. Rather, it is unprecedented memorization and computation"), overall there was a surprising amount of familiar material said in a different way (such as explainability and unintended consequences).

### Learning human intent

Where Do You Think You're Going?: Inferring Beliefs about Dynamics from Behavior (Siddharth Reddy et al): Summarized in the highlights!

A Framework and Method for Online Inverse Reinforcement Learning (Saurabh Arora et al): This paper introduces Incremental Inverse Reinforcement Learning (I2RL), where the agent continually gets new demonstrations from an expert, and has to update the estimate of the reward function in real time. The running example is a robot that has to navigate to a goal location without being seen by two guards that are patrolling. The robot needs to infer the rewards of the two guards in order to predict what they will do and plan around them. Since the guards are sometimes out of sight, we get demonstrations with occlusion, that is, some of the states in the demonstrations are hidden.

In the batch setting, this is solved with Latent Maximum Entropy IRL. To deal with occluded states Z, we define a probability distribution Pr(Z | Y, θ), where Y is the visible states and θ is the reward weights. Then, you can use expectation maximization to find θ -- in the expectation step, you compute feature expectations of the demonstrations (taking an expectation over hidden states Z), and in the maximization step, you compute θ using the feature expectations as in standard maximum entropy IRL. The authors show how to extend this algorithm to the incremental setting where you only keep the reward weights, the feature expectations, and the number of past demonstrations as statistics. They show some convergence guarantees and evaluate on their running example of a robot that must evade guards.

My opinion: IRL algorithms are often more computationally expensive than state-of-the-art RL algorithms, so I'm happy to see work that's trying to make it more realistic. That said, this paper focuses on settings where IRL is used to infer other agent's preferences so we can plan around them (as opposed to imitation learning) -- this setting seems not very important for AI alignment. I'm also very confused by the experiments -- it seems in Figure 2 that if you ignore previous optimization and initialize the reward with random weights, it does better. (It isn't ignoring all previous data, because it still has access to past feature expectations.) They don't comment on this in the paper, but my guess is that they ran more iterations of expectation maximization (which is why the learning duration is higher) and that's why they got better performance.

Imitating Latent Policies from Observation (Ashley D. Edwards et al)

Machine Teaching for Inverse Reinforcement Learning: Algorithms and Applications (Daniel S. Brown et al)

Maximum Causal Tsallis Entropy Imitation Learning (Kyungjae Lee et al)

Safe Policy Learning from Observations (Elad Sarafian et al)

### Handling groups of agents

Learning to Teach in Cooperative Multiagent Reinforcement Learning (Shayegan Omidshafiei et al)

### Interpretability

Unsupervised Learning of Neural Networks to Explain Neural Networks (Quanshi Zhang et al)

### Verification

Verifiable Reinforcement Learning via Policy Extraction (Osbert Bastani et al): Since it is hard to verify properties of neural nets, we can instead first train a decision tree policy to mimic the policy learned by deep RL, and then verify properties about that. The authors generalize DAGGER to take advantage of the Q-function and extract decision tree policies. They then prove a correctness guarantee for a toy version of Pong (where the dynamics are known), a robustness guarantee for Pong (with symbolic states, not pixels) (which can be done without known dynamics), and stability of cartpole.

My opinion: Many people believe that ultimately we will need to prove theorems about the safety of our AIs. I don't understand yet what kind of theorems they have in mind, so I don't really want to speculate on how this relates to it. It does seem like the robustness guarantee is the most relevant one, since in general we won't have access to a perfect model of the dynamics.

### Miscellaneous (Alignment)

When is unaligned AI morally valuable? (Paul Christiano): When might it be a good idea to hand the keys to the universe to an unaligned AI? This post looks more deeply at this question, which could be important as a backup plan if we don't think we can build an aligned AI. I can't easily summarize this, so you'll have to read the post.

A Psychopathological Approach to Safety Engineering in AI and AGI (Vahid Behzadan et al): Since AGI research aims for cognitive functions that are similar to humans, they will be vulnerable to similar psychological issues. Some problems can be recast in this light -- for example, wireheading can be thought of as delusional or addictive behavior. This framework suggests new solutions to AI safety issues -- for example, analogous to behavioral therapy, we can retrain a malfunctioning agent in controlled environments to remove the negative effects of earlier experiences.

My opinion: The analogy is interesting but I'm not sure what to take away from the paper, and I think there are also big disanalogies. The biggest one is that we have to communicate our goals to an AI, whereas humans come equipped with some goals from birth (though arguably most of our goals come from the environment we grow up in). I'd be interested in seeing future work from this agenda, since I don't know how I could do work on the agenda laid out in this paper.

# AI strategy and policy

2018 White House Summit on Artificial Intelligence for American Industry (White House OSTP): See Import AI

France, China, and the EU All Have an AI Strategy. Shouldn’t the US? (John K. Delaney): See Import AI

Read more: FUTURE of AI Act

# AI capabilities

### Reinforcement learning

Solving the Rubik's Cube Without Human Knowledge (Stephen McAleer, Forest Agostinelli, Alexander Shmakov et al): Summarized in the highlights!

Gym Retro, again (Vicki Pfau et al): OpenAI is releasing the full version of Gym Retro, with over a thousand games, and a tool for integrating new games into the framework. And of course we see new games in which RL agents find infinite loops that give them lots of reward -- Cheese Cat-Astrophe and Blades of Vengeance.

Feedback-Based Tree Search for Reinforcement Learning (Daniel R. Jiang et al): See Import AI

Evolutionary Reinforcement Learning (Shauharda Khadka et al)

Learning Time-Sensitive Strategies in Space Fortress (Akshat Agarwal et al)

Learning Real-World Robot Policies by Dreaming (AJ Piergiovanni et al)

Episodic Memory Deep Q-Networks (Zichuan Lin et al)

### Meta learning

Meta-learning with differentiable closed-form solvers (Luca Bertinetto et al)

### Hierarchical RL

Hierarchical Reinforcement Learning with Deep Nested Agents (Marc Brittain et al)

Hierarchical Reinforcement Learning with Hindsight (Andrew Levy et al)

Data-Efficient Hierarchical Reinforcement Learning (Ofir Nachum et al)

### Miscellaneous (Capabilities)

The Blessings of Multiple Causes (Yixin Wang et al)

New Comment