Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter.


HARK Side of Deep Learning -- From Grad Student Descent to Automated Machine Learning (Oguzhan Gencoglu et al): This paper focuses on the negative effects of Hypothesizing After the Results are Known (HARKing), a pattern in which researchers first conduct experiments and view the results, and once they have hit the bar to be publishable, a hypothesis is constructed after the fact to explain the results. It argues that HARKing is common in machine learning, and that this has negative effects on the field as a whole. First, improvements to state-of-the-art (SotA) may be questionable because they could have been caused by sufficient hyperparameter tuning via grad student descent, instead of the new idea in a paper to which the gain is attributed. Second, there is publication bias since only positive results are reported in conferences, which prevents us from learning from negative results. Third, hypotheses that are tailored to fit results for a single dataset or task are much less likely to generalize to new datasets or tasks. Fourth, while AutoML systems achieve good results, we cannot figure out what makes them work because the high compute requirements make ablation studies much harder to perform. Finally, they argue that we need to fix HARKing in order to achieve things like ethical AI, human-centric AI, reproducible AI, etc.

Rohin's opinion: I believe that I found this paper the very first time I looked for generic new interesting papers after I started thinking about this problem, which was quite the coincidence. I'm really happy that the authors wrote the paper -- it's not in their incentives (as far as I can tell), but the topic seems crucial to address.

That said, I disagree with the paper on a few counts. The authors don't acknowledge the value of HARKing -- often it is useful to run many experiments and see what happens in order to develop a good theory. Humans are not ideal Bayesian reasoners who can consider all hypotheses at once; we often require many observations in order to even hypothesize a theory. The authors make the point that in other fields HARKing leads to bad results, but ML is significantly different in that we can run experiments much faster with a much higher iteration speed.

If we were instead forced to preregister studies, as the authors suggest, the iteration speed would drop by an order of magnitude or two; I seriously doubt that the benefits would outweigh the cost of lower iteration speed. Instead of preregistering all experiments, maybe researchers could run experiments and observe results, formulate a theory, and then preregister an experiment that would test the theory -- but in this case I would expect that researchers end up "preregistering" experiments that are very similar to the experiments that generated the theory, such that the results are very likely to come out in support of the theory.

(This does not require any active malice on the part of the researchers -- it's natural to think of predictions of the theory in the domain where you developed the theory. For example, in our recent paper (AN #45), we explicitly designed four environments where we expected our method to work and one where it wouldn't.)

Another point: I think that the underlying cause of HARKing is the incentive to chaise SotA, and if I were writing this paper I would focus on that. For example, I believe that the bias towards SotA chasing causes HARKing, and not the other way around. (I'm not sure if the authors believe otherwise; the paper isn't very clear on this point.) This is also a more direct explanation of results being caused by grad student descent or hyperparameter tuning; the HARKing in such papers occur because it isn't acceptable to say "we obtained this result via grad student descent", because that would not be a contribution to the field.

Although I've been critiquing the paper, overall I find my beliefs much closer to the authors' than the "beliefs of the field". (Not the beliefs of researchers in the field: I suspect many researchers would agree that HARKing has negative effects, even though the incentives force researchers to do so in order to get papers published.) I'd be interested in exploring the topic further, but don't have enough time to do so myself -- if you're interested in building toy models of the research field and modeling the effect of interventions on the field, reply to this email and we can see if it would make sense to collaborate.

Technical AI alignment


Agency Failure AI Apocalypse? (Robin Hanson): This is a response to More realistic tales of doom (AN #50), arguing that the scenarios described in the post are unrealistic given what we know about principal-agent problems. In a typical principal-agent problem, the principal doesn't know everything about the agent, and the agent can use this fact to gain "agency rents" where it can gain extra value for itself, or there could be an "agency failure" where the principal doesn't get as much as they want. For example, an employee might spend half of their day browsing the web, because their manager can't tell that that's what they are doing. Our economic literature on principal-agent problems suggests that agency problems get harder with more information asymmetry, more noise in outcomes, etc. but not with smarter agents, and in any case we typically see limited agency rents and failures. So, it's unlikely that the case for AI will be any different, and while it's good to have a couple of people keeping an eye on the problem, it's not worth the large investment of resources from future-oriented people that we currently see.

Rohin's opinion: I have a bunch of complicated thoughts on this post, many of which were said in Paul's comment reply to the post, but I'll say a few things. Firstly, I think that if you want to view the AI alignment problem in the context of the principal-agent literature, the natural way to think about it is with the principal being less rational than the agent. I claim that it is at least conceivable that an AI system could make humans worse off, but the standard principal-agent model cannot accommodate such a scenario because it assumes the principal is rational, which means the principal always does at least as well as not ceding any control to the agent at all. More importantly, although I'm not too familiar with the principal-agent literature, I'm guessing that the literature assumes the presence of norms, laws and institutions that constrain both the principal and the agent, and in such cases it makes sense that the loss that the principal could incur would be bounded -- but it's not obvious that this would hold for sufficiently powerful AI systems.

Learning human intent

Batch Active Preference-Based Learning of Reward Functions (Erdem Bıyık et al) (summarized by Cody): This paper builds on a trend of recent papers that try to learn human preferences, not through demonstrations of optimal behavior, but through a human expressing a preference over two possible trajectories, which has both pragmatic advantages (re limits of human optimality) and theoretic ones (better ability to extrapolate a reward function). Here, the task is framed as: we want to send humans batches of paired trajectories to rank, but which ones? Batch learning is preferable to single-sample active learning because it's more efficient to update a network after a batch of human judgments, rather than after each single one. This adds complexity to the problem because you'd prefer to not have a batch of samples that are individually high-expected-information, but which are redundant with one another. The authors define an information criterion (basically the examples about which we're most uncertain of the human's judgment) and then pick a batch of examples based on different heuristics for getting a set of trajectories with high information content that are separated from each other in feature space.

Cody's opinion: This is an elegant paper that makes good use of the toolkit of active learning for human preference solicitation, but it's batch heuristics are all very reliant on having a set of high level trajectory features in which Euclidean distance between points is a meaningful similarity metric, which feels like a not impossible to generalize but still somewhat limiting constraint.

Prerequisities: Active Preference-Based Learning of Reward Functions (Recon #5)

Training human models is an unsolved problem (Charlie Steiner)

Other progress in AI

Reinforcement learning

NeurIPS 2019 Competition: The MineRL Competition on Sample Efficient Reinforcement Learning using Human Priors (William H. Guss et al): In this challenge which is slated to start on June 1, competitors will try to build agents that obtain a diamond in Minecraft, without using too much environment interaction. This is an incredibly difficult task: in order to make this feasible, the competition also provides a large amount of human demonstrations. They also have a list of simpler tasks that will likely be prerequisites to obtaining a diamond, such as navigating, chopping trees, obtaining an iron pickaxe, and obtaining cooked meat, for which they also collect demonstrations of human gameplay. As the name suggests, the authors hope that the competition will spur researchers into embedding human priors into general algorithms in order to get sample efficient learning.

Rohin's opinion: I really like the potential of Minecraft as a deep RL research environment, and I'm glad that there's finally a benchmark / competition that takes advantage of Minecraft being very open world and hierarchical. The tasks that they define are very challenging; there are ways in which it is harder than Dota (no self-play curriculum, learning from pixels instead of states, more explicit hierarchy) and ways in which it is easier (slightly shorter episodes, smaller action space, don't have to be adaptive based on opponents). Of course, the hope is that with demonstrations of human gameplay, it will not be necessary to use as much compute as was necessary to solve Dota (AN #54).

I also like the emphasis on how to leverage human priors within general learning algorithms: I share the authors' intuition that human priors can lead to significant gains in sample efficiency. I suspect that, at least for the near future, many of the most important applications of AI will either involve hardcoded structure imposed by humans, or will involve general algorithms that leverage human priors, rather than being learned "from scratch" via e.g. RL.

Toybox: A Suite of Environments for Experimental Evaluation of Deep Reinforcement Learning (Emma Tosch et al): Toybox is a reimplementation of three Atari games (Breakout, Amidar and Space Invaders) that enables researchers to customize the games themselves in order to perform better experimental evaluations of RL agents. They demonstrate its utility using a case study for each game. For example, in Breakout we often hear that the agents learn to "tunnel" through the layer of bricks so that the ball bounces around the top of the screen destroying many bricks. To test whether the agent has learned a robust tunneling behavior, they train an agent normally, and then at test time they remove all but one brick of a column and see if the agent quickly destroys the last brick to create a tunnel. It turns out that the agent only does this for the center column, and sometimes for the one directly to its left.

Rohin's opinion: I really like the idea of being able to easily test whether an agent has robustly learned a behavior or not. To some extent, all of the transfer learning environments are also doing this, such as CoinRun (AN #36) and the Retro Contest (AN #1): if the learned behavior is not robust, then the agent will not perform well in the transfer environment. But with Toybox it looks like researchers will be able to run much more granular experiments looking at specific behaviors.

Smoothing Policies and Safe Policy Gradients (Matteo Papini et al)

Deep learning

Generative Modeling with Sparse Transformers (Rewon Child et al) (summarized by Cody): I see this paper as trying to interpolate the space between convolution (fixed receptive field, number of layers needed to gain visibility to the whole sequence grows with sequence length) and attention (visibility to the entire sequence at each operation, but n^2 memory and compute scaling with sequence length, since each new element needs to query and be queried by each other element). This is done by creating chains of operations that are more efficient, and can offer visibility to the whole sequence in k steps rather than k=1 steps, as with normal attention. An example of this is one attention step that pulls in information from the last 7 elements, and then a second that pulls in information from each 7th element back in time (the "aggregation points" of the first operation).

Cody's opinion: I find this paper really clever and potentially quite high-impact, since Transformers are so widely used, and this paper could offer a substantial speedup without much theoretical loss of information. I also just enjoyed having to think more about the trade-offs between convolutions, RNNs, and transformers, and how to get access to different points along those tradeoff curves.

Introducing Translatotron: An End-to-End Speech-to-Speech Translation Model (Ye Jia et al): This post introduces Translatotron, a system that takes speech (not text!) in one language and translates it to another language. This is in contrast to most current "cascaded" systems, which typically go from speech to text, then translate to the other language, and then go back from text to speech. While Translatotron doesn't beat current systems, it demonstrates the feasibility of this approach.

Rohin's opinion: Machine translation used to be done in multiple stages (involving parse trees as an intermediate representation), and then it was done better using end-to-end training of a deep neural net. This looks like the beginning of the same process for speech-to-speech translation. I'm not sure how much people care about speech-to-speech translation, but if it's an important problem, I'd expect the direct speech-to-speech systems to outperform the cascaded approach relatively soon. I'm particularly interested to see whether you can "bootstrap" by using the cascaded approach to generate training data for the end-to-end approach, and then finetune the end-to-end approach on the direct speech-to-speech data that's available to improve performance further.

A Recipe for Training Neural Networks (Andrej Karpathy): This is a great post detailing how to train neural networks in practice when you want to do anything more complicated than training the most common architecture on the most common dataset. For all of you readers who are training neural nets, I strongly recommend this post; the reason I'm not summarizing it in depth is because a) it would be a really long summary and b) it's not that related to AI alignment.

Meta learning

Meta-learners' learning dynamics are unlike learners' (Neil C. Rabinowitz) (summarized by Cody): We've seen evidence in prior work that meta learning models can be trained to more quickly learn tasks drawn from some task distribution, by training a model in the inner loop and optimizing against generalization error. This paper suggests that meta learning doesn't just learn new tasks faster, but has a different ordered pattern of how it masters the task. Where a "normal" learner first learns the low-frequency modes (think SGD modes, or Fourier modes) of a simple regularization task, and later the high-frequency ones, the meta learner makes progress on all the modes at the same relative rate. This meta learning behavior seems to theoretically match the way a learner would update on new information if it had the "correct" prior (i.e. the one actually used to generate the simulated tasks).

Cody's opinion: Overall I like this paper's simplicity and focus on understanding how meta learning systems work. I did find the reinforcement learning experiment a bit more difficult to parse and connect to the linear and nonlinear regression experiments, and, of course, there's always the question with work on simpler problems like this of whether the intuition extends to more complex ones

Read more: Cody's longer summary

Hierarchical RL

Multitask Soft Option Learning (Maximilian Igl et al) (summarized by Cody): This paper is a mix of variational inference and hierarchical reinforcement learning, in the context of learning skills that can be reused across tasks. Instead of learning a fixed set of options (read: skills/subpolicies), and a master task-specific policy to switch between them, this method learns cross-task priors for each skill, and then learns a task-specific posterior using reward signal from the task, but regularized towards the prior. The hope is that this will allow for an intermediary between cross-task transfer and single-task specificity.

Cody's opinion: I found this paper interesting, but also found it a bit tricky/unintuitive to read, since it used a different RL frame than I'm used to (the idea of minimizing the KL divergence between your trajectory distribution and the optimal trajectory distribution). Overall, seems like a reasonable method, but is a bit hard to intuitively tell how strong the theoretical advantages are on these relatively simple tasks.

New Comment
8 comments, sorted by Click to highlight new comments since: Today at 8:01 PM

Relevant to Robin Hanson's point: I've argued here that agents betraying their principals happens in politics all the time, sometime with disastrous results. By restricting to the economic literature on this problem, we're only looking at a small subsets of "agency problems", and implicitly assuming that institutions are sufficiently strong to detect and deter bad behaviour of very powerful AI agents - which is not at all evident.

Yay, I'm in the thing!

I have little idea if people have found my recent posts interesting or useful, or how they'd like them to be improved. I have a bunch of wilder speculation that piles up in unpublished drafts, and once I see an idea getting used or restated in multiple drafts, that's what I actually post.

Fyi, I included it because it argues for a conclusion that is relevant to AI alignment. The reason I didn't summarize it is because it seemed to say the same things as The easy goal inference problem is still hard (particularly the part about mistake models and how training for predictive accuracy gets you to human performance and no more) and the many posts on how human brains probably are not goals + optimization but are better modeled as e.g. systems of competing agents.

Thanks, this is actually really useful feedback. As the author, I "see" the differences and the ways in which I'm responding to other people, but it also makes sense to me why you'd say they're very similar. The only time I explicitly contrast what I'm saying in that post with anything else is... contrasting with my own earlier view.

From my perspective where I already know what I'm thinking, I'm building up from the basics to frame problems in what I think is a useful way that immediately suggests some of the other things I've been thinking. From your perspective, if there's something novel there, I clearly need to turn up the contrast knob.

Instead of preregistering all experiments, maybe researchers could run experiments and observe results, formulate a theory, and then preregister an experiment that would test the theory—but in this case I would expect that researchers end up “preregistering” experiments that are very similar to the experiments that generated the theory, such that the results are very likely to come out in support of the theory.

Why would you expect this? Assuming you are not suggesting "what if the researchers lie and say they did the experiment again when they didn't", then doing a similar experiment again is called "replication". If the initial result was caused by p-hacking, then the similar experiment won't support the theory. This is why we do replication.

Also, I notice the term "p-hacking" appears nowhere in your post.

So I don't really like the terminology of "p-hacking" and "replication", and I'm going to use different terminology here that I find more precise and accurate; I don't know how much I need to explain so you should ask if any particular term is unclear.

Also, I notice the term "p-hacking" appears nowhere in your post.

Indeed, because I don't think it's the relevant concept for ML research. I'm more concerned about the garden of forking paths. (I also strongly recommend Andrew Gelman's blog, which has shaped my opinions on this topic a lot.)

Doing a similar experiment again is called "replication". If the initial result was caused by p-hacking, then the similar experiment won't support the theory. This is why we do replication.

Almost all ML experiments will replicate, for some definition of "replicate" -- if you run the same code, even with different random seeds, you will usually get the same results. (I would guess it is fairly rare for researchers to overfit to particular seeds, though it can happen.)

In research involving humans, usually some experimental details will vary, just by accident, and that variation leads to a natural "robustness check" that gives us a tiny bit of information about how externally valid the result is. We might not get even that with ML.

It's also worth noting that even outside of ML, what does or doesn't count as a "replication" varies widely, so I prefer not to use the term at all, and instead talk about how a separate experiment gives us information about the validity and generalization properties of a particular theory.

Why would you expect this?

Hopefully it's a bit clearer now. Given what I've said in this comment, I might now restate the sentence as "I expect that the preregistered experiments will not be sufficiently different from the experiments the researchers ran themselves. So, I expect the results to be of limited use in informing me about the external validity of the experiment and the generalization properties of their theory."

It seems to me that if you expect that the results of your experiment can be useful in and generalized to other situations, then it has to be possible to replicate it. Or to put it another way, if the principle you discovered is useful for more than running the same program with a different seed, shouldn't it be possible to test it by some means other than running the same program with a different seed?

Or to put it another way, if the principle you discovered is useful for more than running the same program with a different seed, shouldn't it be possible to test it by some means other than running the same program with a different seed?

Certainly. But even if the results are not useful and can't be generalized to other situations, it's probably possible to replicate it, in a way that's slightly different from running the same program with a different seed. (E.g. you could run the same algorithm on a different environment that was constructed to be the kind of environment that algorithm could solve.) So this wouldn't work as a test to distinguish between useful results and non-useful results.

Relevant recent Andrew Gelman blog post