Charbel-Raphaël Segerie and Épiphanie Gédéon contributed equally to this post. 
Many thanks to Davidad, Gabriel Alfour, Jérémy Andréoletti, Lucie Philippon, Vladimir Ivanov, Alexandre Variengien, Angélina Gentaz, Simon Cosson, Léo Dana and Diego Dorn for useful feedback.

TLDR: We present a new method for a safer-by design AI development. We think using plainly coded AIs may be feasible in the near future and may be safe. We also present a prototype and research ideas on Manifund.

Epistemic status: Armchair reasoning style. We think the method we are proposing is interesting and could yield very positive outcomes (even though it is still speculative), but we are less sure about which safety policy would use it in the long run.

Current AIs are developed through deep learning: the AI tries something, gets it wrong, then gets backpropagated and all its weight adjusted. Then it tries again, wrong again, backpropagation again, and weights get adjusted again. Trial, error, backpropagation, trial, error, backpropagation, ad vitam eternam ad nauseam.

Of course, this leads to a severe lack of interpretability: AIs are essentially black boxes, and we are not very optimistic about post-hoc interpretability.

We propose a different method: Constructability or AI safety via pull request.[1]

By pull request, we mean that instead of modifying the neural network through successive backpropagations, we construct and design plainly-coded AIs (or hybrid systems) and explicitly modify its code using LLMs in a clearreadable, and modifiable way.

This plan may not be implementable right now, but might be as LLMs get smarter and faster. We want to outline it now so we can iterate on it early.

One possible long-term vision that constructability could lead to, in which we make use of a black-box superhuman coder to create code that we then audit and deploy.


If the world released a powerful and autonomous agent in the wild, white box or black box, or any color really, humans might simply get replaced by AI.

What can we do in this context?

  • Don't create autonomous AGIs.
  • Keep your AGI controlled in a lab, and align it.
  • Create a minimal AGI controlled in a lab, and use it to produce safe artifacts.
    • This post focuses on this last path, and the specific artifacts that we want to create are plainly coded AIs (or hybrid systems)[2].

We present a method for developing such systems with a semi-automated training loop.

To do that, we start with a plainly coded system (that may also be built using LLMs) and iterate on its code, adding each feature and correction as pull requests that can be reviewed and integrated into the codebase.

This approach would allow AI systems that are, by design:

  • Transparent: As the system is written in plain or almost plain code, the system is more modular and understandable. As a result, it's simpler to spot backdoors, power-seeking behaviors, or inner misalignment: it is orders of magnitude simpler to refactor the system to have a part defining how it is evaluating its current situation and what it is aiming towards (if it is aiming at all). This means that if the system starts farming cobras instead of capturing them, we would be able to see it.
  • Editable: If the system starts to learn unwanted correlations or features - such as learning to discriminate on feminine markers for a resume scorer - it is much easier to see it as a node in the AI code and remove it without retraining it.
  • Overseeable: We can ensure the system is well behaved by using automatic LLM reviews of the code and by using automatic unit tests of the isolated modules. In addition, we would use simulations and different settings necessary for safety, which we will describe later.
  • Version controlable: As all modifications are made through pull requests, we can easily trace with, e.g., git tooling where a specific modification was introduced and why.

In practice, we would first use hybrid systems, that use shallow specialized networks that we can understand well for some small tasks, and then iterate on it:

If plain code is too hard, we could also use shallow networks to bridge the gap between low-level and medium-level features. 
The continuum from Deep-learning to Constructabiity
Example of a hybrid system in practice, to make a car.

Overall, we want to promote an approach like Comprehensive AI Services: Having many specialized systems that do not have full generality, but that may compose together (for instance, in the case of a humanoid housekeeper, having one function to do the dishes, one function to walk the dog, …). Our hope is to arrive at a method to train models that outperform opaque machine learning in some important metrics (faster inference time, faster and more modifiable training, more data efficient, and more modifiable code)  while still being safer.

Okay, now your reaction should be: “Surely this just won’t work”. 

Let’s analyze this: why we think this approach is feasible and how safe it would be.

Would it be feasible?

Track record of automated systems

Our idea is nothing short of automating and generalizing something humans have been doing for decades: creating expert narrow systems.

For example, Stockfish is a superhuman chess engine that did not use deep learning before 2020. It was quite understandable then and has an automatic system for testing pull requests.

In particular, note that Stockfish improved by more than 700 elos during this period while keeping its code length about constant[3], which gives significant credence to the claim that it might just be possible to iterate on a system and make it superhuman without having the codebase explode in size.

AIs have also been able to create explicit code for features we had only been able to express via deep learning so far. For example, in Learning from Human Preferences it seemed like getting the essence of a proper backflip in a single hand-crafted function would always be inferior to Reinforcement Learning from Human Feedback:

RLHF learned to backflip using around 900 individual bits of feedback from the human evaluator.Manual reward crafting: “By comparison, we took two hours to write our own reward function (the animation in the above right) to get a robot to backflip, and though it succeeds, it’s a lot less elegant than the one trained simply through human feedback.”

But, since then, we have seen Eureka, which generates reward functions that outperform expert human-engineered rewards:


Like Stockfish, Eureka continues improving while keeping its reward function short:

Eureka progressively produces better rewards that eventually exceed human-level by combining large-scale reward search with detailed reward reflection feedback.

Eureka is very similar to what we want to do. Only, instead of writing the reward functions, we would write explicitly all the agent's code.

Voyager in Minecraft is even closer to what we are proposing: an agent that interacts with the world and that codes functions to broaden its abilities. In Voyager, you can read the lines of code generated by GPT, you know what it can do and what it can't do, and you have much more control than with reinforcement learning.

Figure from Voyager, annotated in red for what we want to do to adapt it for constructability. The training phase would involve coding an agent and a skill library, and have LLMs review its performance and humans would filter the library. 

The main difference with us is that Voyager codes function on the fly, function by function, while we would validate the whole codebase before unleashing the agent, and we would remove dangerous skills like "Combat humans" beforehand. No continuous learning.

Track records of humans

Besides chess engines, it is possible to create systems that tackle useful tasks in plain-code:

  • WatsonWatson is an expert system capable of answering questions in natural language. It won Jeopardy against the champions in 2011!
  • Moon landing: Humans have been able to create the automatic pilot of the moon landing.
  • Face detectors: Humans created face detectors before Deep Learning (which you might know if you used one of those old numerical cameras back in 2005).
  • Language Tool: Language tool is a grammar and spell-checker software that was started in 2003 and now has 75k commits.
  • Wolfram Alpha: an engine that can answer questions, solve problems, and provide insights across a wide range of topics, including mathematics, science, engineering, and more. It uses a vast collection of algorithms and knowledge curated by experts.[4] 
An example of a hint in Jeopardy is, "This 'Father of Our Country' didn't really chop down a cherry tree," to which the correct response is, "Who is/was George Washington?"

But, you might say, humans have not been able to solve Go or Imagenet without deep learning.

The Crux

On the one hand, humans are not particularly good and fast at coding, so plain-code approaches to Go or Imagenet might actually work well if coded with competent models. As AI becomes more advanced and potentially transformative, it may be capable of coding systems as complex as Google.

For now, Devin, an automated software engineer, has only been released last month, and it seems likely that we are headed in that direction:

"GPT2030 will likely be superhuman at various specific tasks, including coding, hacking, and math, and potentially protein design [...] The organization that trains GPT2030 would have enough compute to run many parallel copies: I estimate enough to perform 1.8 million years of work when adjusted to human working speeds."

From What will GPT-2030 look like

Watson was about 100 years*person[5], and 1.8M years of work is in the ballpark of the effort put into the Google codebase.[6]

On the other hand, it may be the case that coding a system better than AlphaZero at Go from scratch proves extremely difficult compared to coding the entirety of the Google codebase.

Whether it is even possible to code a system that beats AlphaZero or GPT-2 with plain-code or hybrid system, as opposed to systems that are fully connected like transformers, seems like a central crux that we name “non-connectionism scalability”: How necessary is it for models to be connectionists for their performance to be general and human-like, as opposed to something more modular and explainable.

Having a plain-coded model that beats AlphaZero may not be as impossible as it sounds. For instance, this paper has succeeded in extracting superhuman chess concepts from AlphaZero and teaching those concepts to chess grandmasters. This shows that it is possible to have well-encapsulated concepts learned from AlphaZero, even though we still need to code them. More generally, the human brain is complex, but only finitely complex, so it might be possible to extract relevant learned concepts from neural networks iteratively.

There are 100M lines of code in MacOS. If someone learns one new concept each minute for 20 years, that's only 10M concepts. If you need 10 lines of code per concept, that's also 100M lines of code, and we will be helped by AIs to code them.

What about the Bitter Lesson? We think it is not in contradiction with our plan. The Bitter Lesson is mostly saying: “Compute + Search is necessary”. But it’s not saying “Deep Learning is necessary”. And we would follow the Bitter Lesson because we want to leverage compute and search at scale.

Would it be safe?

Epistemic status: Uncertain. This part will explore some potential plans and settings that would use this method. However, that plan is distinct from the method of constructability itself.

Setting for safety

Summary of the approach 

Minimal super-coder: We already have super-powerful systems for generating images, movies, music, playing Go, and translation, all without dying, which is already kind of surprising. The authors thought before that full agency would be necessary to create good music. This is not the case. Similarly, we want to create an automatic GitHub issue resolver. This is kind of dangerous but it can be seen as a translator, not from English → French, but something like an issue → pull request. As strong as a very good human, but not as strong as the best human. Except that it does it at scale, full throttle.

Coordination: We need a multinational agreement requiring all “powerful” AIs to be boxed in AI labs.

AI control: We need a very powerful coder-AI to code our AI, but the only current way to create such an AI today is via deep learning, and we cannot guarantee its safety. This is why we need to box this AI, which could be done with the AI control plan.

Audit of the code: The code produced by this AI needs to be verified, but the code might be huge. This is why we need scalable oversight. Only the audited plain-code AI would then be deployed. We could have many different plain-coded AI for different types of tasks.

Software engineering best practices: The code should not be merged if the code does not follow the best practices from Google or whatever. For example, integrating many unnitests, the same way SQLlite is very thoroughly tested: for every line of code, they have 590 lines of tests. And it's a beautiful story because even if they didn't use provably safe code, they still managed to create one of the most secure and deployed applications in the world. To create these unnitests, we can use LLM itself and then start trying to break them to find holes with aggressive Reinforcement Learning like in this paper.

Simulations: It may be costly to iterate directly in the real world. Suppose you have a robot that learns to do the dishes. In that case, you can first iterate in a simulation of the world, like SORA or Genie (e.g., by prompting: “The left arm moves left 20 degrees up” → oh no, the film shows that the plate breaks and kills a human 🙁). Then, after a few iterations in the simulation and after careful human auditing, you could deploy it in the real world. The simulation itself would be checked for bugs by automatic reviews and subsequently fine-tuned. Using a simulation would further help reduce the chance that something terrible happens in real life.

Possible concerns

Unstable coordination: Labs need to refrain from using directly the powerful AIs in the wild, which seems very unstable. The world does not seem to be headed in this direction, but we think about coordination as necessary even if our plan did not exist

Deep deception: A working hypothesis we have made so far is that we can neglect deep deception. To prevent deep deception for a general superhuman system, the only way we know of is to write formal proofs inside giant simulations (a la davidad). However, we think deep deception is probably overestimated as a concern and this is a working hypothesis for most plans.

Code leak: If there is a code leak, even with transparency, a powerful system remains dangerous in the hands of bad actors.

Improved capabilities: If this plan does work, there is a chance this could boost capabilities and lead to Recursive Self-Improvement or compute overhang, which would render the situation hard to control. That said we don't think our plan will create a more powerful AI than deep learning itself, and as this project is still a moonshot, we are not too worried about this yet (if we get something working, we will open the champagne). But it might be the case that a plain-code system could sometimes allow an absurd inference speed or data efficiency.

Misgeneralization and specification gaming: One worry is that the AI code starts to evolve against the wrong specification or goal. For instance, when it learns “go to the right” instead of “target the coin”. We believe this can be avoidable with refactors, along with LLM explaining the used heuristics and simulating the agent’s behaviour well.

Compared to other plans

While there are many valid criticisms of this plan in the absolute sense, we believe this approach is as reasonable as other agendas, for instance, in comparison with:

  • Interpretability: If plain code AI works, it would be at least as safe as solving interpretability.
  • Davidad’s bold plan: Our plan is similar to Davidad’s, but he wants proof; we only want transparency. Creating legible code should be orders of magnitude easier than proving it. 
  • Other safe-by-design approaches: Most safe-by-design approaches seem to rely heavily on formal proofs. While formal proofs offer hard guarantees, they are often unreliable because their model of reality needs to be extremely close to reality itself and very detailed to provide assurance.
  • CoEm: More info on CoEM here. CoEm, as well as the plan we describe, both rely on the notion of compositionality. However, they are more interested in making a powerful language model while we don’t specifically focus on any kind of system. 
  • OpenAI’s Superalignment: Instead of creating a system that creates zillions of AI safety blog posts, we create zillions of lines of code. But we think it would be easier for us to verify the capabilities of our systems than for OpenAI to verify the plans that are created and our metrics more straightforward.

Getting out of the chair

So far, our post has discussed technologies and possibilities in the near future. The question of what is possible right now is still open, but many hypotheses are already testable:

  • Scalability: How much can this loop actually work and learn complex patterns, especially for concepts that are not easily encoded, like image recognition?
  • Compositionality: A core aspect of our plan is that we can decompose a complex neural network into many isolated and reviewable parts. This may not necessarily hold.
  • Understandability: Even if compositionality does hold, is it easy to understand parts and subparts of the model?
  • Maintainability: Is it actually possible for this PR loop to be reviewable and does it lead to well-encapsulated features? In addition, is it overseeable, and how unmaintainable does the codebase become?

To test these assumptions, we have explored what it could look like to make an imagenet recognizer that does not use too much deep learning (up to 3 layers and 1000 neurons by net).

From Zoom In: An Introduction to Circuits
We made the assumption that it was possible to construct circuits similar to deep-learned ones from the ground up. This convolution net uses the ontology “Window + Car body + Wheels → Car.” 

The overall idealized and automatic process as we have envisionned it is:

  1. Construct an ontology to recognize a class of image
  2. Segment images and sort the segments into different classes according to this ontology
  3. Train composited shallow networks on the segments
  4. Use an automatic PR loop that contains visualization of the trained network
  5. Profit

Constructing an ontology

The idea of an ontology for image recognition is to have a graph that describes how features compose together. 

We wanted to see if Claude could construct this ontology by first providing it with one, and then iterate to integrate each specificity of the images of the training dataset:

We used Claude to modify the ontology according to each images specifically

The resulting ontology does seem to hold fairly well. In particular, (construction company information + safety warnings) → signs posted → temporary fence is exactly the sort of composition we had in mind.

One concern is that it could grow out of hand. For instance, in the process above, the ontology grows linearly. There is probably a way to have Claude refactor this ontology every n images in a way that keeps all important elements.

Segmenting the images

Using segment-anything we have been able to obtain fairly well segmented images: 



From left to right: Petal, disks, flower head, leaf and stems.

We have in particular focused on how to recognize flowers (imagenet class n119) with this process.

For n119, we have used the following simple ontology:

  • n119
    • Flower head
      • Petal
      • Disk
    • Leaf
    • Stem

Here are the results:

1. Constructing the Flower head detector

We can check the network part by part. For example, here the petal detector is doing what we want, but the disk detector is wrong on this image. Let's do another pull request to patch for this!
  • Flower head: 85% acc - 25 parameters (579 positive sample)
    • Petal: ~90% acc - 960-parameters (2000 positive samples)
    • Disk: ~90% acc - 960-parameters (300 positive samples)

Compositionality does seem to hold: from one network trained specifically on petals/not-petal, and one trained specifically on disks/not-disks, we have been able to train one that recognizes flower heads at 85% very reliably.

  • More interestingly, we have been able to swap the petal recognizer network with another petal recognizer trained on a different seed and slightly different dataset, without a noticeable drop in accuracy! This shows that it might be possible to have a good compositioned system that can be reviewed and refactored without requiring retraining other parts of the network.
  • Furthermore, the convolution networks were only trained on the segments themselves, against a black background, and did not have any positive sample of them in context (for instance the petal network was tasked with classifying lone petals). Yet some of them can recognize the features they were trained on fairly well.

2. Constructing the Full Model

  • n119 - 25 parameters
    • Flower head: 85% acc - single convolution - 25 parameters
      • Petal
      • Disk
    • Leaf: ~90% acc - CNN - 960 parameters
    • Stem: ~90% acc - CNN - 960 parameters

We have been able to scale to 76% accuracy on a balanced dataset of n119 and non-n119, with networks each less than three convolution layers and less than 1000 parameters, compared to pure deep-learning which does 92% on 1000 parameters and three convolution layers

More details in this GitHub.


We think the argument we made for feasibility is very reasonable for the path towards Comprehensive AI Services in the setting that we describe, and that the argument we made for safety is a bit weaker.

We think that constructability is neglected, important, and tractable. If you like this idea and want it to see it scale, please upvote us on this Manifund, that contains many more research ideas that we want to explore.

Of course, the priority of the Centre pour la sécurité de l'IA will remain safety culture as long as people in the world continue aiming blindlessly towards the event horizon, and we think this plan is one way to promote it.

Work done in the Centre pour la sécurité de l'IA - CeSIA.

  1. ^

    We debated the name and still are unsure which fits better. Do let us know about better names in the comments.

  2. ^

    Ideally, we would want something only written in plain code. For now, however, as deep-learning is likely necessary to create capable systems, we will also discuss hybrid models using shallow neural networks aimed at very narrow tasks and which training data makes it unlikely to have learned anything more complex. Of course, the fewer neural networks and heuristics, the better.

  3. ^

    Using its github, to evaluate its number of lines of code from sf-1.0 to sf-10, we can see that it stagnates at about 14k. On the other hand, using computerchess we can see that its elo went from 2748 elo to 3528.

  4. ^

    Wolfram wanted to do something similar to us, but using only expensive human developers.

  5. ^

    From Wikipedia: a team of ~15 for ~5 years.

  6. ^

    30k engineers for 20 years = 0.6 million years of work.

  7. ^

    For instance, something like the critical level in the RSP.

  8. ^

    We need a deep learning simulation because we think it is too hard to create a complete sim in plain code. Even GPT-V is incomplete.

  9. ^

    Thanks to Davidad for suggesting this idea.

  10. ^

    Of course, we think the global priority should not be working on this plan, but remains safety culture till people in the world continue aiming blindlessly to create autonomous replicators in the wild.

  11. ^

    We wanted to automatically sort the segments in their designed classes with either of LLaVA or Claude. However, this proved too unreliable, and we have resorted to sorting the segments manually for this prototype.

  12. ^

    ImageNet class n11939491, composed mostly of daisy flowers, but not only

New Comment