Most of my programmer friends believe that Language Models trained on code will not affect their day job anytime soon. In this post, I make the case that 1) code generation is already useful (assuming minimal prompt engineering skills) 2) even if you do not believe in 1), code generation will increase programmers' throughput way sooner than it will fully automate them.

Language Models trained on Code do not bring us closer to Full Code Automation

This misconception comes from thinking linearly instead of exponentially. Language models are good enough at generating code to make the very engineers building such models slightly more productive, for instance when dealing with a new API. In other words, the returns (aka the improvements in the algorithm) from investing more resources in code generation directly helps (with better developer tools) create a better code-generating algorithm.

Code generation does not automate the part of my workday where I think hard

  • It still accelerates “glue code” or “API work”—a substantial fraction of large codebases.
  • Besides, only a set of privileged engineers get to think about the broad picture every day.
  • Plus, hard thinking is mostly required at the start, when designing the architecture.
  • And thinking seldom happens in a silo. It instead requires many iterations, through coding.

I asked a model to generate code but it doesn't seem to be able to solve it

More often than not, the issue is not about the model. Try another prompt. (Example)

The output is outdated code from average programmers

Code quality (length, variable naming, taste) is prompt and hyperparameter dependent. Generally, language models use variables from the prompt and you can rename those yourself.

Only developers who repeat the same tasks will be automated so it will not affect me

You might still see gains in productivity in learning how to use a more advanced version.

My job does not involve solving simple coding tests from docstrings

You should be capable of separating your code in smaller functions and write docstrings.

Codex cannot solve my problem since it has only access to a limited training set

Github Copilot stores your data. Supposedly, the same applies to the Codex beta.

Current Language Models still make silly mistakes

If the mistake is silly, then fixing it is trivial.

Anyway, it is error prone so it cannot be used for critical software

It generates less error than I do when writing code for the first time.

I would strongly suggest applying to Github Copilot or OpenAI Codex access to check for yourself, avoiding cherry-picked examples on the internet (in good and in bad). Indeed, if you search online, you might run into outdated reviews, where it turns out that highlighted errors actually work now. If you cannot wait for beta access, I recommend asking a friend for a demo (I'm happy to showcase it to everyone), trying genji python or reading this up-to-date review.

More generally, programmers should seriously consider learning prompt engineering to avoid being left behind, and, I believe, any future forecast about AI progress should include this shorter loop between deep learning models and programmer productivity.

New Comment
13 comments, sorted by Click to highlight new comments since:

[Note: I use Copilot and like it. The 'aha' moment for me was when I needed to calculate the intersection of two lines, a thing that I would normally just copy/paste from Stack Overflow, and instead Copilot wrote the function for me. Of course I then wrote tests and it passed the tests, which seemed like an altogether better workflow.]

Language models are good enough at generating code to make the very engineers building such models slightly more productive

How much of this is 'quality of code' vs. 'quality of data'? I would naively expect that the sort of algorithmic improvements generated from OpenAI engineers using Copilot/Codex/etc. are relatively low-impact compared to the sort of benefits you get from adding your company's codebase to the corpus (or whatever is actually the appropriate version of that). I'm somewhat pessimistic about the benefits of adding Copilot-generated code to the corpus as a method of improving Copilot.

I buy that "generated code" will not add anything to the training set, and that Copilot doesn't help for having good data or (directly) better algorithms. However, the feedback loop I am pointing at is when you accept suggestions on Copilot. I think it is learning from human feedback on what solutions people select. If the model is "finetuned" to the specific dev's coding style, I would expect Codex to suggest even better code (because of high quality of finetuning data) to someone at OAI than me or you.

How much of this is 'quality of code' vs. 'quality of data'?

I'm pointing at overall gains in dev's productivity. This could be used for collecting more data, which, AFAIK, happens by collecting automatically data from the internet using code (although possibly the business collaboration between OAI and github helped). Most of the dev work would then be iteratively cleaning that data, running trainings, changing the architecture, etc. before getting to the performance they'd want, and those cycles would be a tiny bit faster using such tools.

To be clear, I'm not saying that talented engineers are coding much faster today. They're probably doing creative work at the edge of what Codex has seen. However, we're using the first version of something that, down the line, might end up giving us decent speed increases (I've been increasingly more productive the more I've learned how to use it). A company owning such model would certainly have private access to better versions to use internally, and there are some strategic considerations in not sharing the next version of its code generating model to win a race, while collecting feedback from millions of developers.

What's your take on the licensing issue?  I know for sure that Codex won't affect my day job in the near term, because I'm not allowed to use it; I guess most large companies, and open-source projects large enough to care about IP assignment, will have the same problem.

This indirectly influences the speed of model improvement we should expect: the market for this kind of tool is smaller than you might think from the size of github's userbase, so there'll be less incentive to invest in it.

Wait, they did plain forbid you to use at all during work time, or they forbid to use its outputs for IT issues? Surely, using Codex for inspiration, given a natural language prompt and looking at what function it calls does not seem to infringe any copyright rules?

  • 1) If you start with your own variable names, it would auto-complete with those, maybe using something he learned online. would that count as plagiarism in your sense? How would that differ from copy-pasting from stack overflow changing the variable names (I'm not an expert in SO copyright terms but you should probably quote SO if doing so and there might be some rules about distributing it commercially).
  • 2) imagine you are using line-by-line auto-complete, and sometimes you re-arrange the ordering of the lines, adding your own code, even modifying it a bit. At one point does it become your own code?
  • 3) In the cases 1. and 2. that I mentioned above, even if some of the outputs were verbatim (which apparently happens a tiny fraction of the time) and had exactly the same (probably conventional) variable names, would "I have some line of code with exact the same normal naming of variables on the internet" be enough for going to court?
  • 4) Assuming that developers are, or will be, more productive using such tools, don't you think they would still use Copilot-like software to a) get inspiration b) copy-paste code that they would later modify to bypass IP infringements if they are smart enough about it, even though their companies "forbids" them from using it?

I'm extremely keen to hear from people who have used Codex a decent amount (or tried to) and decided it isn't worth it. Specifically, people who wouldn't pay $15/mo for a subscription to it. Anyone?

For context, GitHub has 60,000,000 users. If 10% of them buy a $15/mo subscription, that's a billion dollars a year in annual revenue. A billion dollars is about a thousand times more than the cost to create Codex. (The cost to train the model was negligible since it's only the 12B param version of GPT-3 fine-tuned. The main cost would be the salaries of the engineers involved, I imagine.)

Maybe I'm wrong, but my first reaction to your initial number is that users doesn't mean active users. I would expect a difference of an order of magnitude, which keeps your conclusion but just with a hundred times more instead of a thousand times more.

That's reasonable. OTOH if Codex is as useful as some people say it is, it won't just be 10% of active users buying subscriptions and/or subscriptions might cost more than $15/mo, and/or people who aren't active on GitHub might also buy subscriptions.

Agreed. Part of the difficulty here is that you want to find who will buy a subscription and keep it. I expect a lot of people to try it, and most of them to drop it (either because they don't like it or because it doesn't help them enough for their taste) but no idea how to Fermi estimate that number.

Regarding your first point, I think when people say that language models "don't bring us closer to full code automation" they mean there's no way of improving/upgrading language models such that they implement full code automation. I think it would be better to argue against that claim directly instead of bringing up language model's productivity-boosting effects. There are many things that could potentially boost programmers' productivity -- better nootropics, say -- but it seems overly broad to say that they all "bring us closer to full code automation", even if it might be causally true that they reduce the time to automation in expectation.

The problem with arguing against that claim is that nobody knows whether transformers/scaling language models are sufficient for full code automation. To take your nootropics example, an analogy would be if nootropics were legal, did not have negative side effects, with a single company giving "beta access" (for now) to a new nootropic in unlimited amount at no cost to a market of tens of millions of users, that the data from using this nootropic was collected by the company to improve the product, that there actually were 100k peer-reviewed publications per year in the field of nootropics, where most of the innovation behind the tech came from a >100B-parameters model trained on open-source nootropic chemistry instructions. Would such advancements be evidence for something major we're not certain about (e.g. high bandwidth brain computer interface) or just evidence for increased productivity that would be reinjected into more nootropic investments?

Thinking about it more there's another, more serious restriction, at least for now: Codex can't write code that depends on the rest of your codebase.  Consider the following lightly-anonymized code from a real-world codebase I contribute to:

def unobserved(self) -> Set[str]:

    """Returns a set of all unobserved random variable names inside this Distribution -- that is,

those that are neither observed nor marginalized over.


    return set(self.components) - self.observed - self.marginalized

I don't think Codex could write that code from just that function signature and that docstring, because a human couldn't do it: they wouldn't know how to find the names of the observed and marginalized random variables, and they wouldn't know that self.components exists or has to be explicitly converted into a set.  And if the human didn't know what random variables are, or what marginalization is, they'd have an even tougher time.

One can imagine a prompt that might elicit this implementation, something like "this is a Python function that returns the elements of set(self.components) that do not appear in self.observed or self.marginalized", but then we are not really writing docstrings anymore, we are writing programs in a novel language with a stochastic compiler.

This should be fixable with larger context windows, I think.  If the prompt for a method could include the whole class definition aside from that method, Codex could at least in principle use other class methods and variables in sensible ways.  But this will have to wait on the practical realization of O(n) or O(nlogn) context window scaling.

I created a class initializing the attributes you mentioned, and when adding your docstring to your function signature it gave me exactly the answer you were looking for. Note that it was all in first try, and that I did not think at all about the initialization for components, marginalized or observed—I simply auto-completed.

class Distribution:
def __init__(self):
self.components = []
self.marginalized = None
self.observed = None

def unobserved(self) -> Set[str]:

"""Returns a set of all unobserved random variable names inside this Distribution -- that is,

those that are neither observed nor marginalized over.

return set(self.components) - set(self.observed) - set(self.marginalized)