At a high level, research feedback I give to more junior research collaborators often can fall into one of three categories:
Doing quick sanity checks
Saying precisely what you want to say
Asking why one more time
In each case, I think the advice can be taken to an extreme I no longer endorse. Accordingly, I’ve tried to spell out the degree to which you should implement the advice, as well as what “taking it too far” might look like.
This piece covers doing quick sanity checks, which is the most common advice I give to junior researchers. I’ll cover the other two pieces of advice in a subsequent piece.
Doing quick sanity checks
Research is hard (almost by definition) and people are often wrong. Every researcher has wasted countless hours or days if not weeks or months chasing fruitless lines of investigation.[1] Oftentimes, this time could’ve been saved with a few basic sanity checks. Does your idea make sense at all? Does your data have obvious sources of bias (e.g. forms of selection bias) or other issues (e.g. using the wrong prompt)? Does your theorem make nontrivial claims, or is it vacuous?[2]
When doing data analysis one instance of this is to check for basic correlations between key variables. For example, if you believe that less capable language models cannot covertly perform tasks in part because they mention the existence of a side task directly in their output, how often do they include phrases such as “hidden task” or “alerting the observer” in their output, and how often are they caught when they do or don’t include those phrases? (This gives you some evidence of if this is the determinative factor, or if the weaker models have other tells.)
More generally, it’s good to quantitatively know your data on a high level, which can help you spot obvious errors. What’s the mean and standard deviation of key summary statistics? What are the key dimensions along which your data vary? For example, if you’re studying LLM agents in your scaffold, how many tool calls are the models using, and how many of them are successful? (I’ve seen many examples, especially a year or two ago, where the scaffold was broken or the LLM agent just completely fails to understand how to use it.) If you’re using LLMs with reasoning, how long are the reasoning chains? (I’ve personally been involved in research where reasoning was off by accident.)
What does the “typical” example of your dataset look like, and what are the outliers? For example, if you notice that LLMs get zero performance for writing out the steps in your n=10Tower of Hanoi problem, what does its chain of thought look like? Is it making errors or getting confused about the basic algorithm? (Often, LLMs “fail” at tasks not necessarily because they lack the capability, but because they refuse to perform the task at all.)
Example response from Claude Opus 4, where it calls the n=10 Tower of Hanoi task "extremely tedious and error prone" and refuses to do it. However, the response demonstrates that it can implement the algorithm required to solve the problem, and strongly suggests that its failure at the task does not result from a lack of understanding. See a previous post of mine for more discussion of this specific example.
Another specific instance of this advice is to make up small concrete examples. A classic piece of advice when trying to check if you’re implementing a simple algorithm correctly (e.g. when debugging code or doing coding interviews) is to walk through your code line-by-line on a small example. For example, if you’re implementing A* search, does it work on a small 4-node graph with a few edges with integer costs? (When I was TA’ing an intro to AI class, this would’ve caught maybe half of the bugs brought to me in office hours.) A related piece of advice is to make up small concrete examples when doing theoretical research.[3] For example, if you're claiming your measure of similarity is a distance metric, is it symmetrical, and does it satisfy the triangle inequality on three concrete points? (Notably, the KL divergence is not a metric!)[4]
Taking this too far. The “quick” part of “quick sanity checks” is an important part of this advice.This means making tradeoffs in the direction of speed rather than rigor. We could instead spend 30 minutes having another AI score if each output mentions a hidden task, or run A* search on a real problem, but doing so in any of these cases would likely be premature when a cheaper sanity check would catch many of the issues. The focus is to perform a sanity check, not to rigorously address all possible objections to your work, nor to create a grand unified theory of your entire discipline. If you sat down to do a five-minute spot check and find yourself three hours later building a massive data processing pipeline to use an AI to classify every single possible variable you could conceive of, you've probably taken it too far.
At a larger scale, there’s a longer post about how entire fields get lost and end up basically doing cargo cult science, and its implications for AI safety, which I may write about in the future.
Conversely, when presenting your research, you should aim to provide enough information that other people can perform quick sanity checks to trust your results. A basic version of this is to open source your data. Other ways to do this include including examples of your data in the appendix, as well as many tables and figures showing the relationship between key variables.
I had a scheme, which I still use today when somebody is explaining something that I'm trying to understand: I keep making up examples. For instance, the mathematicians would come in with a terrific theorem, and they're all excited. As they're telling me the conditions of the theorem, I construct something which fits all the conditions. You know, you have a set (one ball)--disjoint (two halls). Then the balls turn colors, grow hairs, or whatever, in my head as they put more conditions on. Finally they state the theorem, which is some dumb thing about the ball which isn't true for my hairy green ball thing, so I say, "False!"
The original example included here was about the Hessian as a way of measuring basin broadness in loss landscapes:
One hypothesis for why some neural networks generalize better than others: models that land in "broad" minima of the loss landscape are less sensitive to small perturbations of the weights, and therefore generalize better. A standard way researchers attempted to model the "broadness" of a basin is by looking at the Hessian (the matrix of second derivatives of the loss). But how does it work for basic examples such as x^2 + y^2 and x^4 + y^4, and what happens when the Hessian is singular? (Spoilers: the Hessian is indeed singular at the origin for x^4 + y^4 (in fact, it’s all zeroes), and the Hessian fails to characterize the broadness of this basin entirely.)
Written quickly as part of the Inkhaven Residency.
At a high level, research feedback I give to more junior research collaborators often can fall into one of three categories:
In each case, I think the advice can be taken to an extreme I no longer endorse. Accordingly, I’ve tried to spell out the degree to which you should implement the advice, as well as what “taking it too far” might look like.
This piece covers doing quick sanity checks, which is the most common advice I give to junior researchers. I’ll cover the other two pieces of advice in a subsequent piece.
Doing quick sanity checks
Research is hard (almost by definition) and people are often wrong. Every researcher has wasted countless hours or days if not weeks or months chasing fruitless lines of investigation.[1] Oftentimes, this time could’ve been saved with a few basic sanity checks. Does your idea make sense at all? Does your data have obvious sources of bias (e.g. forms of selection bias) or other issues (e.g. using the wrong prompt)? Does your theorem make nontrivial claims, or is it vacuous?[2]
When doing data analysis one instance of this is to check for basic correlations between key variables. For example, if you believe that less capable language models cannot covertly perform tasks in part because they mention the existence of a side task directly in their output, how often do they include phrases such as “hidden task” or “alerting the observer” in their output, and how often are they caught when they do or don’t include those phrases? (This gives you some evidence of if this is the determinative factor, or if the weaker models have other tells.)
More generally, it’s good to quantitatively know your data on a high level, which can help you spot obvious errors. What’s the mean and standard deviation of key summary statistics? What are the key dimensions along which your data vary? For example, if you’re studying LLM agents in your scaffold, how many tool calls are the models using, and how many of them are successful? (I’ve seen many examples, especially a year or two ago, where the scaffold was broken or the LLM agent just completely fails to understand how to use it.) If you’re using LLMs with reasoning, how long are the reasoning chains? (I’ve personally been involved in research where reasoning was off by accident.)
What does the “typical” example of your dataset look like, and what are the outliers? For example, if you notice that LLMs get zero performance for writing out the steps in your n=10 Tower of Hanoi problem, what does its chain of thought look like? Is it making errors or getting confused about the basic algorithm? (Often, LLMs “fail” at tasks not necessarily because they lack the capability, but because they refuse to perform the task at all.)
Example response from Claude Opus 4, where it calls the n=10 Tower of Hanoi task "extremely tedious and error prone" and refuses to do it. However, the response demonstrates that it can implement the algorithm required to solve the problem, and strongly suggests that its failure at the task does not result from a lack of understanding. See a previous post of mine for more discussion of this specific example.
Another specific instance of this advice is to make up small concrete examples. A classic piece of advice when trying to check if you’re implementing a simple algorithm correctly (e.g. when debugging code or doing coding interviews) is to walk through your code line-by-line on a small example. For example, if you’re implementing A* search, does it work on a small 4-node graph with a few edges with integer costs? (When I was TA’ing an intro to AI class, this would’ve caught maybe half of the bugs brought to me in office hours.) A related piece of advice is to make up small concrete examples when doing theoretical research.[3] For example, if you're claiming your measure of similarity is a distance metric, is it symmetrical, and does it satisfy the triangle inequality on three concrete points? (Notably, the KL divergence is not a metric!)[4]
Taking this too far. The “quick” part of “quick sanity checks” is an important part of this advice. This means making tradeoffs in the direction of speed rather than rigor. We could instead spend 30 minutes having another AI score if each output mentions a hidden task, or run A* search on a real problem, but doing so in any of these cases would likely be premature when a cheaper sanity check would catch many of the issues. The focus is to perform a sanity check, not to rigorously address all possible objections to your work, nor to create a grand unified theory of your entire discipline. If you sat down to do a five-minute spot check and find yourself three hours later building a massive data processing pipeline to use an AI to classify every single possible variable you could conceive of, you've probably taken it too far.
At a larger scale, there’s a longer post about how entire fields get lost and end up basically doing cargo cult science, and its implications for AI safety, which I may write about in the future.
For examples of this in the field of human psychology, see Scott Alexander’s critical review of the field of 5-HTTLPR studies, or Bertram Gawronski’s much more polite critique of implicit association tests.
Conversely, when presenting your research, you should aim to provide enough information that other people can perform quick sanity checks to trust your results. A basic version of this is to open source your data. Other ways to do this include including examples of your data in the appendix, as well as many tables and figures showing the relationship between key variables.
See also this excerpt from Richard Feynman’s autobiography, Surely You’re Joking, Mr. Feynman!:
I had a scheme, which I still use today when somebody is explaining something that I'm trying to understand: I keep making up examples. For instance, the mathematicians would come in with a terrific theorem, and they're all excited. As they're telling me the conditions of the theorem, I construct something which fits all the conditions. You know, you have a set (one ball)--disjoint (two halls). Then the balls turn colors, grow hairs, or whatever, in my head as they put more conditions on. Finally they state the theorem, which is some dumb thing about the ball which isn't true for my hairy green ball thing, so I say, "False!"
The original example included here was about the Hessian as a way of measuring basin broadness in loss landscapes:
One hypothesis for why some neural networks generalize better than others: models that land in "broad" minima of the loss landscape are less sensitive to small perturbations of the weights, and therefore generalize better. A standard way researchers attempted to model the "broadness" of a basin is by looking at the Hessian (the matrix of second derivatives of the loss). But how does it work for basic examples such as x^2 + y^2 and x^4 + y^4, and what happens when the Hessian is singular? (Spoilers: the Hessian is indeed singular at the origin for x^4 + y^4 (in fact, it’s all zeroes), and the Hessian fails to characterize the broadness of this basin entirely.)
For more discussion of this issue and related examples, see an intro to Singular Learning Theory post.