Lawrence Chan

Wiki Contributions


(Most of my comment was ninja'ed by Paul)

I'll add that I'm pretty sure that RL is doing something. The authors claim that no one has applied search methods for 4x4 matrix multiplication or larger, and the branching factor on brute force search without a big heuristic grows something like the 6th power of n? So it seems doubtful that they will scale. 

That being said, I agree that it's a bit odd to not do a head-to-head comparison at equal compute, though. The authors just cite related work (which uses much less compute) and claims superiority over them.

None of the environments and datasets you mention are actually like this.

Every single algorithmic IRL paper on video games does this, at least with Deep RL demonstrators. (Here's a list of 4 examples:,,,,

If you care about human demonstrations, it seems like Atari-HEAD and the CrowdPlay Atari dataset both do exactly this? And while there haven't been too much work in this area, a quick Google search let me find two papers that do analyze IRL variants on Atari-HEAD: and .  

My guess is the reason there hasn't been much recent work in this area is because there just aren't many people who think that value learning from demonstrations is interesting (instead, people have moved to pairwise comparisons of trajectories or language feedback). In addition, as LMs have become more capable, most of the existing value learning researchers have also moved on from working with video games to moving on LMs. 

It surprised me a bit when I discovered that there isn’t a standard publicly available value learning benchmark, despite there being data to create one.

My guess is the issue here is a lack of a single standard, as opposed to there not being any? The closest thing there is to a standard in IRL/RLHF work are the Mujoco Gym and Atari environments. People also often make variants of Mujoco environments like Assistive Gym when they have a specific task in mind as well. Or they just use a real robot, or maybe a VR one.

If your concern is that researchers are using other policies as experts and not humans, well, there's always Atari-HEAD or the CrowdPlay Atari dataset (there's not an equivalent for Mujoco envs because humans can't really do well on those environments involved without assistance or a lot of practice). If you want something else, there's always D4RL

One possible problem is that while we might expect log(P(scary coherent behavior)) to go up in general as we scale models, this doesn't mean that log(P(scary coherent behavior)) - log(P(coherent behavior)) goes up - it could simply be that the models are getting better at being coherent in general. In some cases, it could even be that the model becomes less overconfident!

For example, in figure six of the the Wei et al emergence paper, the log probability assigned to both the correct multiple choice answers and the incorrect answers both go up slowly, until they diverge at a bit over 10^22 flops:

The authors explain:

The reason is that larger models produce less-extreme probabilities (i.e., values approaching 0 or 1) and therefore the average log-probabilities have fewer extremely small values.

Also, the same figure suggests that log(P(behavior)) trends don't always continue forever---(log(P(incorrect)) certainly doesn't), so I'd caution against reading too much into just log-likelihood/cross entropy loss. 

That being said, I still think we should try to come up with a better understanding of smooth underlying changes, as well as try to come up with a theory of the  "critical thresholds". As a start, someone should probably try to either retrodict  when model capabilities emerge given the log-likelihoods mentioned in this post, or when grokking occurs using the metrics given in Neel Nanda's modular arithmetic post

It seems like a core part of this initial framing relies on the operationalisation of "competent", yet you don't really point to what you mean. Notably, "competent" cannot mean "high-reward" (because of category 4) and "competent" cannot mean "desirable" (because of category 3 and 4). Instead you point at something like "Whatever it's incentivized to do, it's reasonably good at accomplishing it".

I think here, competent can probably be defined in one of two (perhaps equivalent) ways:
1. Restricted reward spaces/informative priors over reward functions: as the appropriate folk theorem goes, any policy is optimal according to some reward function. "Most" policies are incompetent; consequently, many reward functions incentivize behavior that seems incoherent/incompetent to us. It seems that when I refer to a particular agent's behavior as "competent", I'm often making reference to the fact that it achieves high reward according to a "reasonable" reward function that I can imagine. Otherwise, the behavior just looks incoherent. This is similar to the definition used in Langosco, Koch, Sharkey et al's goal misgeneralization paper, which depends on a non-trivial prior over reward functions. 
2. Demonstrates instrumental convergence/power seeking behavior. In environments with regularities, certain behaviors are instrumentally convergent/power seeking. That is, they're likely to occur for a large class of reward functions. To evaluate if behavior is competent, we can look for behavior that seem power-seeking to us (i.e., not dying in a game). Incompetent behavior is that which doesn't exhibit power-seeking or instrumentally convergent drives.

The reason these two can be equivalent is the aforementioned folk theorem: as every policy has a reward function that rationalizes it, there exists priors over reward functions where the implied prior over optimal policies doesn't demonstrate power seeking behavior. 

Ah, I see your point. That being said, I think calling the task we train our LMs to do (learn a probabilistic model of language) "language modeling" seems quite reasonable to me - in my opinion, it seems far more unreasonable to call "generating high quality output" "language modeling". For one thing, there are many LM applications that aren't just "generate high quality text"! There's a whole class of LMs like BERT that can't really be used for text generation at all. 

One reason we at Redwood care about this result is that we want to interpret modern LMs. As outlined in the linked Chris Olah argument, we might intuitively expect that AIs get more interpretable as they get to human level performance, then less interpretable as their performance becomes more and more superhuman. If LMs were ~human level at the task they are trained on, we might hope that they contain mainly crisp abstractions that humans find useful for next token prediction. However, since even small LMs are superhuman at next token prediction, they probably contain alien abstractions that humans can't easily understand, which might pose a serious problem for interpretability efforts. 

Thanks for the response! We've rewritten the paragraph starting "The limitations detailed..." for clarity.  

Some brief responses to your points:

Shannon's estimate was about a different quantity. [...]

We agree that Shannon was interested in something else - the "true" entropy of English, using an ideal predictor for English. However, as his estimates of entropy used his wife Mary Shannon and Barnard Oliver as substitutes for his ideal predictor, we think it's still fair to treat this as an estimate of the entropy/perplexity of humans on English text. 

This article cites a paper saying that [...]

As you point out, there's definitely been a bunch of follow up work which find various estimates of the entropy/perplexity of human predictors. The Cover and King source you find above does give a higher estimate consistent with our results. Note their estimator shares many of the same pitfalls of our estimator - for example, if subjects aren't calibrated, they'll do quite poorly with respect to both the Cover and King estimator and our estimator. We don't really make any claims that our results are surprising relative to all other results in this area, merely noting that our estimate is inconsistent with perhaps the most widely known one. 

Separately, in my opinion, a far better measure of human-level performance at language modeling is the perplexity level at which a human judge can no longer reliably distinguish between a long sequence of generated text and a real sequence of natural language.

The measure you suggest is similar the methodology used by Shen et al 2017 to get a human level perplexity estimate of 12, which we did mention and criticize in our writeup. 

We disagree that this measure is better. Our goal here isn't to compare the quality of Language Models to the quality of human-generated text; we aimed to compare LMs and humans on the metric that LMs were trained on (minimize log loss/perplexity when predicting the next token). As our work shows, Language Models whose output has significantly worse quality than human text (such as GPT-2 small) can still significantly outperform humans on next token prediction. We think it's interesting that this happens, and speculated a bit more on the takeaways in the conclusion. 

A common justification I hear for adversarial training is robustness to underspecification/slight misspecification. For example, we might not know what parameters the real world has, and our best guesses are probably wrong. We want our AI to perform acceptably regardless of what parameters the world has, so we perform adversarial domain randomization. This falls under 1. as written insofar we can say that the AI learnt a stupid heuristic (use a policy that depends on the specific parameters of the environment for example) and we want it to learn something more clever (infer the parameters of the world, then do the optimal thing under those parameters). We want a model whose performance is invariant to (or at least robust to) underspecified parameters, but by default our models don't know which parameters are underspecified. Adversarial training helps insofar as it teaches our models which things they should not pay attention to. 

In the image classification example you gave for 1., I'd frame it as:
* The training data doesn't disambiguate between "the intended classification boundary is whether or not the image contains a dog" and "the intended classification boundary is whether or not the image contains a high frequency artifact that results from how the researchers preprocessed their dog images", as the two features are extremely correlated on the training data. The training data has fixed a parameter that we wanted to leave unspecified (which image preprocessing we did, or equivalently, which high frequency artifacts we left in the data).
* Over the course of normal (non-adversarial) training, we learn a model that depends on the specific settings of the parameters that we wanted to leave unspecified (that is, the specific artifacts). 
* We can grant an adversary the ability to generate images that break the high frequency artifacts while preserving the ground truth of if the image contains a dog. For example, the Lp norm-ball attacks oft studied in the adversarial image classification literature. 
* By training the model on data generated by said adversary, the model can now learn that the specific pattern of high frequency patterns doesn't matter, and can now spend more its capacity paying attention to whether or not the image contains dog features. 

Thanks for the detailed response.

On reflection, I agree with what you said - I think the amount of work it takes to translate a nice sounding idea into anything that actually works on an experimental domain is significant, and what exact work you need is generally not predictable in advance. In particular, I resonated a lot with this paragraph:

I'm also not actually sure that I would have predicted the Overcooked results when writing down the first algorithm; the conceptual story felt strong but there are several other papers where the conceptual story felt strong but nonetheless the first thing we tried didn't work.

At least from my vantage point, “having a strong story for why a result should be X” is insufficient for ex ante predictions of what exactly the results would be. (Once you condition on that being the story told in a paper, however, the prediction task does become trivial.)

I’m now curious what the MIRI response is, as well as how well their intuitive judgments of the results are calibrated.

EDIT: Here’s another toy model I came up with: you might imagine there are two regimes for science - an experiment driven regime, and a theory driven regime. In the former, it’s easy to generate many “plausible sounding” ideas and hard to be justified in holding on to any of them without experiments. The role of scientists is to be (low credence) idea generators and idea testers, and the purpose of experimentation is to primarily to discover new facts that are surprising to the scientist finding them. In the second regime, the key is to come up with the right theory/deep model of AI that predicts lots of facts correctly ex ante, and then the purpose of experiments is to convince other scientists of the correctness of your idea. Good scientists in the second regime are those who discover the right deep models much faster than others. Obviously this is an oversimplification, and no one believes it’s only one or the other, but I suspect both MIRI and Stuart Russell lie more on the “have the right idea, and the paper experiments are there to convince others/apply the idea in a useful domain” view, while most ML researchers hold the more experimentalist view of research?

I actually think this particular view is worth fleshing out, since it seems to come up over and over again in discussions of what AI alignment work is valuable (versus not).

For example, it does seem to me that >80% of the work in actually writing a published paper (at least amongst papers at CHAI) (EDIT: no longer believe this on reflection, see Rohin’s comment below) involves doing work with results that are predictable to the author after the concept (for example, actually getting your algorithm to run, writing code for experiments, running said experiments, writing up the results into a paper, etc.)

Load More