This paper came out about a week ago. I am not the author - it was published anonymously.

I learned about the paper when JJ Hepburn shared it on Slack. I just thought it seemed potentially really important and I hadn't seen it discussed on this forum yet:

 Paper title: "Large Language Models Can Self-improve"

Author: Anonymous

Abstract: "Large Language Models (LLMs) have achieved excellent performances in various tasks. However, fine-tuning an LLM requires extensive supervision. Human, on the other hand, may improve their reasoning abilities by self-thinking without external inputs. In this work, we demonstrate that an LLM is also capable of self-improving with only unlabeled datasets. We use a pre-trained LLM to generate “high-confidence” rationale-augmented answers for unlabeled questions using Chain-of-Thought prompting and self-consistency, and fine-tune the LLM using those self-generated solutions as target outputs. We show that our approach improves the general reasoning ability of a 540B-parameter LLM (74.4%→82.1% on GSM8K, 78.2%→83.0% on DROP, 90.0%→94.4% on OpenBookQA, and 63.4%→67.9% on ANLI-A3) and achieves state-of-the-art-level performance, without any ground truth label. We conduct ablation studies and show that finetuning on reasoning is critical for self-improvement."


New Comment
4 comments, sorted by Click to highlight new comments since: Today at 5:54 AM

What they are doing here is:

  1. Train a large language model unsupervised in standard way
  2. Repeat:
    1. Take some supervised learning problem, throw away the supervised outputs, and repeat:
      1. Pick a supervised input and use the language model to sample N different outputs
      2. Select a "consensus" output among the N sampled outputs
      3. Store this "consesus output" as the new "target output" for the selected input
    2. Fine-tune the language model in supervised way using the input/output pairs generated above

It's quite reasonable for them to call this "self-improvement" but it's only vaguely related to the kind of self-improvement that was much discussed in 2010s-era AI safety circles. That kind of self-improvement was about an AI that can improve its own architecture. The kind of self-improvement discussed in this paper is not about improving its own architecture.

Still, its bizarre that the thing being done in this paper works at all. It's as if the mere act of narrowing the distribution of outputs for a new supervised learning task is innately homing in on the truth... whatever that means. Strange.

The paper describes a method for self-improvement in LLMs. But does it work for recursive self-improvement? I haven't found any mention of recursion or multiple iterations in the paper.

The most relevant section seems to be 5.2 PUSHING THE LIMIT OF SELF-IMPROVEMENTS. Here the authors talk about their attempts to have the model use self-generated questions and self-generated few-shot Chain-of-Thought prompting. They did measure self-improvement when using self-generated questions, but the self-improvement wasn't as great as when they used training-set questions. But who cares if it's lesser self-improvement if it's an unsupervised process and you can just iterate on it to get more and more self-improvement?

I'm assuming their numbers show the self-improvement they measured for one iteration. Or is it actually the maximum self-improvement possible for any number of iterations? 

This paper is now on arXiv (in addition to OpenReview) and published non-anonymously there by Jiaxin Huang et al. from University of Illinois Urbana-Champaign and Google.

It seems like this could benefit the smaller labs working on LLMs and toward AGI.

Chinchilla basically made it seem like only the big-data companies would have the means to produce competitive models going forward. But if generative models can produce their own data for reliable self-improvement, that shows a way forward for companies like Anthropic who don't have massive private data sources to train on (e.g. data from YouTube or Facebook Messenger).