Alignment Newsletter #15: 07/16/18

Rohin Shah

Highlights

Feature-wise transformations (Vincent Dumoulin et al): This Distill article is about transformations on features using FiLM (feature-wise linear modulation). A FiLM layer is used to "condition" a neural network on auxiliary information, which just means providing the input to the neural network in a way that it can use it effectively. This can be used to integrate multiple sources of information -- for example, in visual question answering (VQA), the main part of the network can be an image processing pipeline, and FiLM can be used to turn the natural language question about the image into a task representation and integrate it into the pipeline, and the full network can be trained end-to-end. The FiLM layer works by first using a subnetwork to turn the auxiliary information (such as the question in VQA) into a "task representation" (a new representation chosen by the neural network), which is then used as the parameters for an affine transformation of the features in the main pipeline. Importantly, each feature is treated independently of other features, so the FiLM layer can't create interactions between features. Yet, this still works well in many different contexts.

Since it is a Distill paper, it then goes into a ton of detail about lots of interesting details, such as how architectures in a variety of ML tasks can be thought of as FiLM, how FiLM relates to other ideas such as attention, how we can often interpolate between different auxiliary information by taking a weighted combination of the corresponding task information, how conditioning through concatenation is equivalent to FiLM with only a bias and no scaling, etc.

My opinion: I really enjoy Distill articles, they are consistently far more readable and understandable than typical papers (or even blog posts), even without including the interactive visualizations. This article is no exception. I didn't have particularly strong opinions on how to condition neural nets before, but now I think I will think about FiLM and how it could apply.

Troubling Trends in ML Scholarship (Zachary C. Lipton and Jacob Steinhardt): This is a position paper arguing that ML research would benefit from more rigor, as part of the ICML debates. It identifies four trends in ML papers. First, papers often don't make clear whether they are providing an (authoritative) explanation or a speculation, in which case speculations can accidentally be cited as proven facts in other papers. Second, researchers often don't perform ablation studies, which makes it hard to figure out whether performance gains come from eg. a better algorithm or hyperparameter tuning. Third, papers often include math for the sake of conveying technical depth and impressiveness, not actual exposition, including eg. spurious theorems that are not particularly related to the main claims of the paper. Fourth, papers often misuse language by using suggestive definitions (eg. "curiosity", "fear"), overloading existing terminology, and suitcase words (words with combine many different meanings into one, leading to a very vague concept). The authors speculate on the causes (which I'm not summarizing) and have some suggestions for the community. For authors, they recommend asking what worked, and why, rather than just quantifying performance. For reviewers, they recommend asking "Might I have accepted this paper if the authors had done a worse job?” For example, if the authors hadn't done the ablation study that showed that two things didn't work, and instead just showed a combination of methods that gave a performance improvement, would I have accepted the paper?

My opinion: I strongly agree with this paper. Mathiness in particular is really annoying; often when I spend the time to actually understand the math in a paper, I come away disappointed at how it is saying something trivial or unimportant, and at this point I typically ignore the theorems unless I can't understand what the paper is saying without them. It's also really helpful to have ablation studies -- in fact, for last week's Capture the Flag paper, I probably would have written off the learned reward shaping as unimportant if the ablation study wasn't there to show it was important, after which I dug deeper and figured out what I had misunderstood. And suggestive language has in the past convinced me to read a paper, and then be surprised when the paper ended, because the actual content of the paper contained so much less than I expected. I'm a big fan of the recommendation to reviewers -- while it seems so obvious in hindsight, I've never actually asked myself that question when reviewing a paper.

Technical AI alignment

Technical agendas and prioritization

A Summary of Concrete Problems in AI Safety (Shagun Sodhani): A nice summary of Concrete Problems in AI Safety that's a lot quicker to read than the original paper.

My opinion: I like it -- I think I will send this to newer researchers as a precursor to the full paper.

Read more: Concrete Problems in AI Safety

Mechanistic Transparency for Machine Learning (Daniel Filan): One useful thread of alignment research would be to figure out how to take a neural net, and distill parts or all of it into pseudocode or actual code that describes how the neural net actually works. This could then be read and analyzed by developers to make sure the neural net is doing the right thing. Key quote: "I'm excited about this agenda because I see it as giving the developers of AI systems tools to detect and correct properties of their AI systems that they see as undesirable, without having to deploy the system in a test environment that they must laboriously ensure is adequately sandboxed."

My opinion: I would be really excited to see good work on this agenda, it would be a big step forward on how good our design process for neural nets is.

Iterated distillation and amplification

A comment on the IDA-AlphaGoZero metaphor; capabilities versus alignment (Alex Mennen): Paul Christiano has compared iterated distillation and amplification (IDA) to AlphaGo Zero. However, we usually don't think of AlphaGo Zero as having any alignment issues. Alex points out that we could think of this another way -- we could imagine that the value network represents the "goals" of AlphaGo Zero. In that case, if states get an incorrect value, that is misalignment. AlphaGo Zero corrects this misalignment through MCTS (analogous to amplification in IDA), which updates the values according to the ground truth win/loss reward (analogous to the human). This suggests that in IDA, we should be aiming for any reduction in alignment from distillation to be corrected by the next amplification step.

My opinion: I agree with this post.

Agent foundations

Bayesian Probability is for things that are Space-like Separated from You (Scott Garrabrant): When an agent has uncertainty about things that either influenced which algorithm the agent is running (the agent's "past") or about things that will be affected by the agent's actions (the agent's "future"), you may not want to use Bayesian probability. Key quote: "The problem is that the standard justifications of Bayesian probability are in a framework where the facts that you are uncertain about are not in any way affected by whether or not you believe them!" This is not the case for events in the agent's "past" or "future". So, you should only use Bayesian probability for everything else, which are "space-like separated" from you (in analogy with space-like separation in relativity).

My opinion: I don't know much about the justifications for Bayesianism. However, I would expect any justification to break down once you start to allow for sentences where the agent's degree of belief in the sentence affects its truth value, so the post makes sense given that intuition.

Complete Class: Consequentialist Foundations (Abram Demski): An introduction to "complete class theorems", which can be used to motivate the use of probabilities and decision theory.

My opinion: This is cool, and I do want to learn more about complete class theorems. The post doesn't go into great detail on any of the theorems, but from what's there it seems like these theorems would be useful for figuring out what things we can argue from first principles (akin to the VNM theorem and dutch book arguments).

An Agent is a Worldline in Tegmark V (komponisto): Tegmark IV consists of all possible consistent mathematical structures. Tegmark V is an extension that also considers "impossible possible worlds", such as the world where 1+1=3. Agents are reasoning at the level of Tegmark V, because counterfactuals are considering these impossible possible worlds.

My opinion: I'm not really sure what you gain by thinking of an agent this way.

Interpretability

Measuring abstract reasoning in neural networks (David G. T. Barrett, Felix Hill, Adam Santoro et al)

Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV) (Been Kim et al)

Forecasting

Interpreting AI Compute Trends (Ryan Carey): A previous OpenAI post showed that the amount of compute used in the most expensive AI experiments has been growing exponentially for six years, with a doubling time of 3.5 months. This is extraordinarily fast, and can be thought of as a combination of growth in the amount spent on an experiment, and a decrease in the cost of computation. Such a trend can only continue for a few more years, before the cost of the experiment exceeds the budget of even the richest actors (such as the US government). However, this might still be enough to reach some important milestones for compute, such as "enough compute to simulate a human brain for 18 years", which is plausibly enough to get to AGI. (This would not happen for some of the larger estimates of the amount of computation in the human brain, but would happen for some of the smaller estimates.) It is still an open question which milestone we should care about.

My opinion: I generally agree with the post and its conclusions. I'm not sure how much to care about the compute milestones -- it still seems likely to me that we are bottlenecked on algorithms for generalintelligence, but I don't think about this very much.

Read more: AI and Compute

Miscellaneous (Alignment)

Troubling Trends in ML Scholarship (Zachary C. Lipton and Jacob Steinhardt): Summarized in the highlights!

AISFP blog posts

The AI Summer Fellows Program had a day on which all participants wrote a blog post. I categorized some of these, but most defied easy categorization, so I've collected the rest here. I've also not read and summarized them as carefully as usual, since there were a lot of them and they weren't as polished as typical posts.

Clarifying Consequentialists in the Solomonoff Prior (vlad_m): In the universal Solomonoff prior, since there is no time bound on how long the Turing machines can run for, some short Turing machines could encode universes in which life develops, figures out that it can influence us through the prior, and starts to predict strings that we care about but changes them slightly to influence our decisions. These could be much shorter than the intended "natural" Turing machine that "correctly" predicts the string.

My opinion: This is a more accessible introduction to the weirdness of the universal prior than the original post, but I think it is missing a lot of details that were present, so if you're confused by some aspect, it may be worth checking out the original post.

Monk Treehouse: some problems defining simulation (dranorter): Some approaches to AI alignment require you to identify copies of programs in the environment, and it is not clear how to do this in full generality. Proposals so far have attempted to define two programs to be equivalent if they do the same thing now and would also do the same thing in counterfactual worlds. This post argues that such definitions don't work using an analogy where there are monks computing by moving heavy stones in a treehouse, that could unbalance it. In this setting, there are lots of checks and balances to make sure that the program does one and only one thing; any counterfactual you specify would lead to weird results (like the treehouse falling over from unbalanced stones, or monks noticing that something is off and correcting the result, etc.) and so it wouldn't be considered equivalent to the same program on a silicon-based computer.

My opinion: I don't know where a proposed definition is supposed to be used so it's hard for me to comment on how relevant this objection is.

Agents That Learn From Human Behavior Can't Learn Human Values That Humans Haven't Learned Yet (steven0461): Suppose Alice has moral uncertainty over five utility functions, and so optimizes a weighted combination of them; while Bob's true utility function is the same weighted combination of the utility functions. Alice and Bob will mostly act the same, and so a value learning agent wouldn't be able to distinguish between them.

My opinion: The post, and a comment, note that the difference between Alice and Bob is that if Alice received further information (from a moral philosopher, maybe), she'd start maximizing a specific one of the utility functions. The value learning agent could notice this and correctly infer the utility function. It could also actively propose the information to Alice and see how she responds.

Bounding Goodhart's Law (Eric Langlois): Derives a bound on the regret from having a misspecified reward function. Essentially the regret comes from two main sources -- cases where the misspecified reward assigns too high a reward to states that the misspecified policy visits a lot, and cases where the misspecified reward assigns too low a probability to states that the true best policy would visit a lot. The post also proposes an algorithm for reward learning that takes into account these insights.

My opinion: This shows that, for a particular class of ways that the reward function might be wrong, the corresponding policy is still only slightly suboptimal. This doesn't make me feel much better about having an incorrect reward function, as it feels like this class of ways is a small subset of the ways in which we'll be wrong in practice. I do think the identification of what has to go wrong for really bad outcomes is useful, and I'd be interested to see experiments with the proposed reward learning algorithm.

An environment for studying counterfactuals (Nisan): Proposes a class of environments in which the agent is tasked with predicting the utility of every action, in addition to maximizing expected utility. It is evaluated on the utility achieved as well as correctly predicting the utility it gets. Epsilon-exploration is required, so for every action there is always some chance that the agent will be tested on predicting the utility of that action. The agent is also provided a prior P about the world, including what the agent will do (which exists due to a fixed-point theorem).

My opinion: I'm confused (I'm not an expert in this field), but I'm not sure what I'm confused about. Is there a dynamics model? Given that the agent gets access to a prior, can it find Pr(U | o, a) and choose the a with maximum expected utility? Why are we including reflection? There are often many fixed points, which one do we pick?

Dependent Type Theory and Zero-Shot Reasoning (evhub): Humans can do zero-shot reasoning (in the sense of writing down proofs) by "running a type checker in their head" (analogous to theorem provers like Lean). The post gives an example of this, using Lean syntax. However, humans seem to have very different ways of thinking -- for example, you could either generate ideas for solutions to a problem, see if they work, and iterate, or you could start proving some facts about the problem, and keep on proving things until you have proved a solution. These feel like many-shot reasoning and zero-shot reasoning respectively, even though they are both attempting a zero-shot task. This is one way to understand the difference between Iterated distillation and amplification, and Agent foundations -- the former is many-shot and the latter is zero-shot, even though both are attempting a zero-shot task.

My opinion: I found the part about how people prove things to be the most interesting part of the post, because my own method seems different from both. I usually alternate between searching for solutions, counterexamples to solutions, and proving that solutions must satisfy some property.

Conditioning, Counterfactuals, Exploration, and Gears (Diffractor): One way that you can think about counterfactuals is to condition on some low probability state, and then look at the probability distribution that implies. This seems like the most general version of counterfactuals, but it doesn't match what we intuitively mean by counterfactuals, which is more like "suppose that by fiat this constraint were met, but don't consider what would have caused it, now predict the consequences". This sort of imputing only works because there are very simple rules governing our universe, so that there are strong correlations between different experiences and so it actually is possible to generalize to very new situations. It seems very important to use this idea in order to advance beyond epsilon-exploration for new situations.

My opinion: I agree that this is an important idea, and it has arisen elsewhere -- in ML, this is part of the thinking on the problem of generalization. There are no-free-lunch theorems that say you cannot do well in arbitrary environments, where the constructions typically violate the "strong correlation between different experiences" heuristic. In philosophy, this is the problem of induction.

Read more: Don't Condition on no Catastrophes

A framework for thinking about wireheading (theotherotheralex): Humans don't wirehead (take heroin, which gives huge positive reward) because it does not further their current goals. Maybe analogously we could design an AI that realizes that wireheading would not help it achieve its current goals and so wouldn't wirehead.

My opinion: I think this is anthropomorphizing the AI too much. To the extent that a (current) reinforcement learning system can be said to "have goals", the goal is to maximize reward, so wireheading actually is furthering its current goal. It might be that in the future the systems we design are more analogous to humans and then such an approach might be useful.

Logical Uncertainty and Functional Decision Theory (swordsintoploughshares): If an agent has logical uncertainty about what action it will take, then the agent seems more likely to reason about counterfactuals correctly. For example, this would likely solve the 5-and-10 problem. Without logical uncertainty, an agent that knows about itself can be one of many different fixed points, many of which can be quite bad.

My opinion: This isn't my area of expertise, but it seems right. It feels very weird to claim that having more knowledge makes you worse off in general, but doesn't seem impossible.

Choosing to Choose? (Whispermute): If it is possible for your utility function to change, then should you optimize for your current utility function, or your expected future utility function? The post gives an argument for both sides, and ultimately says that you should optimize for your current utility function, but notes some problems with the proposed argument for it.

My opinion: I think that it is correct to optimize for your current utility function, and I didn't find the argument for the other side convincing (and wrote a comment on the post with more details).

No, I won't go there, it feels like you're trying to Pascal-mug me (Rupert): One explanation for why Pascal's mugging feels intuitively wrong is that if we were to pay the mugger, we would open ourselves up to exploitation by any other agent. Logical induction puts uncertainties on statements in such a way that it isn't exploitable by polynomial-time traders. Perhaps there is a connection here that can help us create AIs that don't get mugged.

My opinion: Non-exploitability is my preferred resolution to Pascal's mugging. However, it seems like such an obvious solution, yet there's very little discussion of it, which makes me think that there's some fatal flaw that I'm not seeing.

Conditions under which misaligned subagents can (not) arise in classifiers (anon1): Agents or subagents with "goals" are only likely to arise when you are considering tasks where it is important to keep state/memory, because past inputs are informative about future inputs. So, unaligned subagents are unlikely to arise for eg. classification tasks where it is not necessary to model how things change over time.

My opinion: I do think that classifiers with a bounded task that run for a bounded amount of time are unlikely to develop unaligned subagents with memory. However, I still feel very unclear on the term "unaligned subagent", so I'm not very confident in this assessment.

Probability is fake, frequency is real and Repeated (and improved) Sleeping Beauty problem (Linda Linsefors): Attacks the Sleeping Beauty problem in anthropics.

My opinion: Anthropics confuses me and I haven't prioritized understanding it yet, so I'm going to abstain.

Decision-theoretic problems and Theories; An (Incomplete) comparative list (somervta): It's just what it says in the title -- a list of problems in decision theory, and what particular decision theories recommend for those problems.

Mathematical Mindset (komponisto): Introduces a new term, "mathematical mindset", which is about finding good definitions or models that make it easier for you to reason about them. For example, you expect proofs with a newer definition to be shorter or more general. Key quote: "Having a “mathematical mindset” means being comfortable with words being redefined. This is because it means being comfortable with models being upgraded -- in particular, with models being related and compared to each other: the activity of theorization."

My opinion: I'm all for having better definitions that make things clearer and easier to reason about. I don't know if "ease of proofs" is the right thing to aim for -- "ease of reasoning" is closer to what I care about, even if it's informal reasoning.

The Intentional Agency Experiment (Self-Embedded Agent): In order to determine whether an agent has some intention, we can check to see whether the agent would take actions that achieve the intent under a wide range of circumstances (either counterfactuals, or actual changes to the environment). For example, to show that an ant has agency and intends to find sugar, we could block its route to the sugar and notice that it finds a path around the obstacle.

My opinion: The motivation was to use this to deduce the intentions of a superintelligent AI system, but it seems that such an AI system could figure out it is being tested and respond in the "expected" way.

Two agents can have the same source code and optimise different utility functions (Joar Skalse): Even if you have two agents with identical source code, their goals are in relation to themselves, so each agent will, for example, try to gain resources for itself. Since the two agents are now competing, they clearly have different utility functions.

My opinion: I'm somewhat confused -- I'm not sure what the point is here.

Alignment problems for economists (Chavam): What AI alignment problems could we outsource to economists? There are some who would be interested in working on alignment, but don't because it would be too much of a career risk.

My opinion: Unfortunately, the "desirable properties" for these problems all seem to conspire to make any particular problem fairly low impact.

On the Role of Counterfactuals in Learning (Max Kanwal): This post hypothesizes that since humans are computationally bounded, we infer causal models using approximate inference (eg. Gibbs sampling), as opposed to a full Bayesian update. However, approximate inference algorithms depend a lot on choosing a good initialization. Counterfactuals fill this role.

My opinion: I think I've summarizes this post badly, because I didn't really understand it. In particular, I didn't understand the jump from "humans do approximate inference over the space of models" to "counterfactuals form the initialization".

A universal score for optimizers (levin): We can measure the optimization power of an agent as the log probability that a random agent matches the outcome that the agent achieves.

My opinion: Seems like a reasonable starting point to measure optimization power. As Alex Mennen notes, it's dependent on the specific action set chosen, and doesn't take into account the strength of preferences, only their ranking.

Conceptual problems with utility functions (Dacyn): It's strange to use utility functions to model agents, because often utility functions do not determine the outcome in games with multiple agents, such as the Ultimatum game, and we have to resolve the situation with "meta-values" like fairness. So, instead of using utility functions, we need to have a conception of agency where the "values" are part of the decisionmaking process.

My opinion: In any formally defined game where the agent has perfect information (including about other agents), utility functions do in fact determine what an agent should do -- but in many cases, this seems to go against our intuitions (as in the Ultimatum game, for example). I don't think that the way to resolve this is to introduce more values; I think it is that the maximization step in maximizing expected utility depends a lot on the environment you're in, and any formalization is going to miss out on some important aspects of the real world, leading to different answers. (For example, in a repeated Ultimatum game, I would expect fairness to arise naturally.)

Near-term concerns

Adversarial examples

Benchmarking Neural Network Robustness to Common Corruptions and Surface Variations (Dan Hendrycks et al): See Import AI.

AI strategy and policy

State of AI (Nathan Benaich and Ian Hogarth)

AI capabilities

Reinforcement learning

The Pursuit of (Robotic) Happiness: How TRPO and PPO Stabilize Policy Gradient Methods (Cody Marie Wild): I barely looked at this -- I think it's an introduction to policy gradient methods for reinforcement learning. It assumes very little background (less than I assume in these summaries).

The Uncertainty Bellman Equation and Exploration (Brendan O’Donoghue et al)

Counterfactual Multi-Agent Policy Gradients (Jakob N. Foerster, Gregory Farquhar et al)

Deep learning

Feature-wise transformations (Vincent Dumoulin et al): Summarized in the highlights!

Glow: Better Reversible Generative Models (Prafulla Dhariwal et al): A generative model here means something that models the data distribution, including any underlying structure. For example, a generative model for images would let you generate new images that you hadn't seen during training. While we normally here of GANs and VAEs for current generative models, this work builds on reversible or flow-based generative models. Similarly to word vectors, we can find directions in the learned embedding space corresponding to natural categories (such as "hair color"), and manipulate an image by first encoding to the embedding space, then adding one of these directions, and then decoding it back to the manipulated image.

My opinion: This seems cool but I'm not very familiar with this area so I don't have a strong opinion. The algorithm seemed weirdly complicated to me but I think it's based on previous work, and I only spent a couple of minutes looking at it.

An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution (Rosanne Liu et al)

NLP's ImageNet moment has arrived (Sebastian Ruder)

News

Conference on Fairness, Accountability, and Transparency (FAT*): ... will be held early 2019 in Atlanta, Georgia. Abstract pre-registration deadline is August 16.

RAISE is hiring (Toon): ... for full-time content developers, to work at the EA Hotel in Blackpool.