(This question comes from a discussion with Carl Shulman.)
In this post I describe an experiment that I'd like to see run. I'm posting a $1,000 - $10,000 prize for a convincing implementation of these experiments. I also post a number of smaller prizes for relevant desk research or important corrections to this request.
In order to understand the dynamics of the singularity, I'd like to understand how easy it is to improve algorithms and software.
We can learn something about this from looking at chess engines. It's not the most relevant domain to future AI, but it's one with an unusually long history and unusually clear (and consistent) performance metrics.
In order to quantify the quality of a chess engine, we can fix a level of play and ask "How much compute is needed for the engine to play at that level?"
One complication in evaluating the rate of progress is that it depends on what level of play we use for evaluation. In particular, newer algorithms are generally designed to play at a much higher level than older algorithms. So if we quantify the compute needed to reach modern levels of play, we will capture both absolute improvements and also "adaptation" to the new higher amounts of compute.
So we'd like to attribute progress in chess engines to three factors:
Understanding the size of factor #1 is important for extrapolating progress given massive R&D investments in software. While it is easy to separate factors #1 and #2 from publicly available information, it is not easy to evaluate factor #3.
Pick two (or more) software engines from very different times. They should both be roughly state of the art, running on "typical" machines from the era (i.e. the machines for which R&D is mostly targeted).
We then carry out two matches:
With some effort, we can estimate a quantitative ratio of "ops needed" for each of these experiments. For example, we may find that the new engine is able to beat the old engine using only 1% of the "old hardware." Whereas we may find that the old engine would require 10,000x the "new hardware" in order to compete with the new engine.
The first experiment tells us about the absolute improvements in chess engines on the task for which the old engine was optimized. (This understates the rate of software progress to the extent that people stopped working on this task.) The second experiment gives us the combination of absolute improvements + adaptation to new hardware. Typical measures of "rate of software progress" will be somewhere in between, and are sensitive to the hardware on which the evaluation is carried out.
I believe that understanding these two numbers would give us a significantly clearer picture of what's really going on with software progress in chess engines.
Here's some guesses about how to run this experiment well. I don't know much about computer chess, so you may be able to make a better proposal.
I'm not spending long evaluating any of this and my decisions are going to be completely unaccountable. Dealing with me may be a waste of time. Your only recourse will be to make an angry comment on this post.
I know that all makes it less appealing to participate. But please have it in mind before spending any time on this project, and consider yourself warned!
I'm planning to give away at least $1,000 sometime in the next 3 months if anyone runs any plausible version of this experiment, or even points me to public information from which it's possible to back out a similar number. I'm fine if that means I have to give $1,000 to a random LW comment.
I'd give away $10,000 if someone ran a clean version of the experiment, I found it convincing, and they were transparent about their methods. Before giving away the full prize I'd likely have a waiting period where I offered a separate prize for people to post replications or caveats or so on.
I'm generally expecting/hoping to wrap this up over the next couple of months, but will adjust based on responses.
If multiple people do experiments I'll make some arbitrary call about how to allocate prize money. Timing will matter a bit but it's a secondary consideration to quality of experiment. Earlier submissions can get paid based if they helped pave the way for later submissions.
If you are planning to spend time on this I encourage you to make a comment. I'm very happy to provide thoughts on a proposed experiment, e.g. to tell you what size of prize I'd expect to give it or what concerns I'd have about a proposal. None of this is binding.
If receiving a prize would be weird or problematic for some reason, I'm still interested in the results and you can opt to receive a comparable quantity of gratitude instead of $.
I'd also pay out $100-$1000 at my discretion for any of the following:
Thanks for the link (and thanks to hippke for doing the experiments), that's great.
To clarify my stance on prizes:
Thank you for your interest: It's good to see people asking similar questions! Also thank-you for incentivizing research with rewards.
Yes, I think closing the gaps will be straightforward. I still have the raw data, scripts, etc. to pick it up.
i) old engines on new hardware - can be done; needs definition of which engines/hardware
ii) raw data + reproduction - perhaps everything can be scripted and put on GitHub
iii) controls for memory + endgame tables - can be done, needs definition of requirements
iv) Perhaps the community can already agree on a set of experiments before they are performed, e.g. memory? I mean, I can look up "typical" values of past years, but I'm open for other values.
i) I'm interested in any good+scalable old engine. I think it's reasonable to focus on something easy, the most important constraint is that it is really state of the art and scales up pretty gracefully. I'd prefer 2000 or earlier.
ii) It would be great if where was at least a complete description (stuff like: these numbers were looked up from this source with links, the population was made of the following engines with implementations from this link, here's the big table of game results and the elo calculation, here was the code that was run to estimate nodes/sec).
iii) For the "old" experiment I'd like to use memory from the reference machine from the old period. I'd prefer basically remove endgame tables and opening book.
My ideal would be to pick a particular "old" year as the focus. Ideally that would be a year for which we (a) have an implementation of the engine, (b) have representative hardware from the period that we can use to compute nodes/sec for each of our engines. Then I'm interested in:
i) To pick a reference year, it seems reasonable to take the mid/late 1990s:- Almost all chess engines before ~1996 lacked (or had serious inefficiencies) using multi-cores (very lengthy discussion here).- Chess protocols became available, so that the engine and the GUI separated. That makes it straightforward to automate games for benchmarking.- Modern engines should work on machines of that age, considering RAM constraints.- The most famous human-computer games took place in 1997: Kasparov-Deep Blue. That's almost a quarter of a century ago (nice round number...). Also, at the time, commercial algorithms were considerably below human-level play.ii) Sounds goodiii) The influence of endgames tables and opening books is typically small. It is reasonable to neglect it in our experiments.iv) Yes, the 4-case-test is a good idea:- 1997 PC with 1997 engine: ELO XXXX- 1997 PC with 2021 engine: ELO XXXX- 2021 PC with 1997 engine: ELO XXXX- 2021 PC with 2021 engine: ELO XXXX
One main result of these experiments will be the split: Where does the ELO gain come from? Is it the compute, or the algo improvement? And the answer will be about 70% compute, 30% algo (give or take 10 percentage points) over the last 25 years. Without serious experiments, have a look at the Stockfish evolution at constant compute. That's a gain of +700 ELO points over ~8 years (on the high side, historically). For comparison, you gain ~70 ELO per double compute. Over 8 years one has on average gained ~400x compute, yielding +375 ELO. That's 700:375 ELO for compute:algo, or a rounded 70%-30% (SF has improved rather fast).To baseline the old machine, we don't need to boot up old hardware. There is plenty of trustworthy old benchmarking still available that has these numbers.As the modern baseline, I would certainly recommend Stockfish:- It is the best (or amongst the very top) for the last decade or so- It is open source and has a very large dev community. Steps in improvements can be explained.- Open source means it can be compiled on any machine that has a C++ compilerOther modern engines will perform similarly, because they use similar methods. After all, SF is open source.As a bonus, one could benchmark a Neural Network-based engine like LC0. There will be issues when using it without a GPU, however.As for the old engine, it is more difficult to choose. Most engines were commercial programs, not open source. There is an old version of Fritz 5 (from 1998) freely available that supports protocols. I got it installed on a modern Windows with some headache. Perhaps that could be used. Fritz was, at the time of the Kasparov-Deep Blue match, the strongest commercial engine.
I like using Fritz.
It sounds like we are on basically the same page about what experiments would be interesting.
Very tangential to the discussion so feel free to ignore, but given that you have put some though before on prize structures I am curious about the reasoning for why you would award a different prize for something done in the past versus something done in the future
I don't think this research, if done, would give you strong information about the field of AI as a whole.
I think that, of the many topics researched by AI researchers, chess playing is far from the typical case.
It's [chess] not the most relevant domain to future AI, but it's one with an unusually long history and unusually clear (and consistent) performance metrics.
An unusually long history implies unusually slow progress. There are problems that computers couldn't do at all a few years ago that they can do fairly efficiently now. Are there problems where people basically figured out how to do that decades ago and no significant progress has been made since?
The consistency of chess performance looks like more selection bias. You aren't choosing a problem domain where there was one huge breakthrough that. You are choosing a problem domain that has had slow consistent progress.
For most of the development of chess AI (All the way from Alpha Beta pruning to Alpha Zero) Chess AI's improved by an accumulation of narrow, chess specific tricks. (And more compute) How to represent chess states in memory in a fast and efficient manor. Better evaluation functions. Tables of starting and ending games. Progress on chess AI's contained no breakthroughs, no fundamental insights, only a slow accumulation of little tricks.
There are cases of problems that we basically knew how to solve from the early days of computers, any performance improvements are almost purely hardware improvements.
There are problems where one paper reduces the compute requirements by 20 orders of magnitude. Or gets us from couldn't do X at all, to able to do X easily.
The pattern of which algorithms are considered AI and which are considered maths and which are considered just programming is somewhat arbitrary. A chess playing algorithm is AI, a prime factoring algorithm is maths, a sorting algorithm is programming or computer science. Why? Well those are the names of the academic departments that work on them.
You have a spectrum of possible reference classes for transformative AI that range from the almost purely software driven progress, to the almost totally hardware driven progress.
To gain more info about transformative AI, someone would have to make either a good case for why it should be at a particular position on the scale, or a good case for why its position on the scale should be similar to the position of some previous piece of past research. In the latter case, we can gain from examining the position of that research topic. If hypothetically that topic was chess, then the research you propose would be useful. If the reason you chose chess was purely that you thought it was easier to measure, then the results are likely useless.
Is your prediction that e.g. the behavior of chess will be unrelated to the behavior of SAT solving, or to factoring? Or that "those kinds of things" can be related to each other but not to image classification? Or is your prediction that the "new regime" for chess (now that ML is involved) will look qualitatively different than the old regime?
I'm aware of very few examples of that occurring for problems that anyone cared about (i.e. in all such cases we found the breakthroughs before they mattered, not after). Are you aware of any?
a prime factoring algorithm is maths
Factoring algorithms, or primality checking, seem like fine domains to study to me. I'm also interested in those and would be happy to offer similar bounties for similar analyses.
I think it's pretty easy to talk about what distinguishes chess, SAT, classification, or factoring from multiplication. And I'm very comfortable predicting that the kind of AI that helps with R&D is more like the first four than like the last (though these things are surely on a spectrum).
You may have different intuitions, I think that's fine, in which case this explains part of why this data is more interesting to me than you.
Progress on chess AI's contained no breakthroughs, no fundamental insights, only a slow accumulation of little tricks.
Can you point to a domain where increasing R&D led to big insights that improved performance?
Perhaps more importantly, machine learning is also "a slow accumulation of little tricks," so the analogy seems fine to me. (You might think that future AI is totally different, which is fine and not something I want to argue about here.)
If Alice says this and so never learns about anything, and Bob instead learns a bunch of facts about a bunch of domains, I'm pretty comfortable betting on Bob being more accurate about most topics.
I think the general point is: different domains differ from one another. You want to learn about a bunch of them and see what's going on, in order to reason about a new domain.
I agree with the basic point that board games are selected to be domains where there is an obvious simple thing to do, and so progress started early. I think in that way they are similar to SAT solving and factoring, and (slightly) different from image classification, for which it's arguably hard to measure progress before the 90s.
I think that the more important difference between chess and image classification (in terms of making the chess data clean) is that there is a homogeneous measure of performance for chess, whereas image classification has moved on to harder and harder tasks. I think this does slightly change the nature of the task, but mostly it just makes the data clean.
I think that the main difference between chess and SAT solving is mostly that chess is more naturally interesting to people so they've been working on it longer, and that is a real factor that makes the data cleaner (without making it less useful as an analogy). SAT solving also has some of the image classification problem, of depending a lot on the distribution of instances.
(With respect to "aren't choosing a domain where there was one huge breakthrough," I'm definitely interested in domains with such breakthroughs.)