Gradient Descent on the Human Brain

gaspode

TL;DR: Many alignment research proposals often share a common motif: figure out how to enter a basin of alignment / corrigibility for human-level models, and then amplify to more powerful regimes while generalizing gracefully. In this post we lay out a research agenda that comes at this problem from a different direction: if we already have ~human-level systems with extremely robust generalization properties, we should just amplify those directly. We’ll call this strategy “Gradient Descent on the Human Brain”.

Introduction

Put one way, the hard part of the alignment problem is figuring out how to solve ontology identification: mapping between an AI’s model of the world and a human’s model, in order to translate and specify human goals in an alien ontology.

In generality, in the worst case, this is a pretty difficult problem. But is solving this problem necessary to create safe superintelligences? The assumption that you need to solve for arbitrary ontologies is true if you assume that the way to get to superintelligence necessarily routes through systems with different ontologies. We don’t need to solve ontology translation for high-bandwidth communication with other humans^[1].

Thus far, we haven’t said anything really novel. The central problem to this approach, as any alignment researcher would know, is that we don’t really have a good way to bootstrap the human brain to superintelligent levels. There have been a few attempts to approach this recently, though focusing on very prosaic methods that, at best, buy points on the margin. Scaling to superintelligence requires much stronger and robust methods of optimization.

The Setup

The basic setup is pretty simple, though there are a few nuances and extensions that are hopefully self-explanatory.

The simple version: Take a hundred human brains, put them in a large vat, and run gradient descent on the entire thing.

The human brain is a remarkably powerful artifact for its size, so finding a way to combine the capabilities of a hundred human brains with gradient descent should result in something significantly more powerful. As an intuition pump, think of how powerful human organizations are with significantly shallower communication bandwidth. At the very lowest bound we can surpass this, more impressive versions of this could look like an integrated single mind that combines the capabilities of all hundred brains.

The specifics of what the training signal should be are, I think, a rather straightforward engineering problem. Some pretty off-the-cuff ideas, in increasing order of endorsement:

Train them for specific tasks, such as Pong or Doom. This risks loss of generality, however.
Train them to predict arbitrary input signals from the environment. The brain is pretty good at picking up on patterns in input streams, which this leverages to amplify latent capabilities. This accounts for the problem with lack of generality, but may not incentivize cross-brain synergy strongly.
Train them to predict each other. Human brains being the most general-purpose objects in existence, this should be a very richly general training channel, and incentivizes brain-to-brain (B2B) interaction. This is similar in spirit to HCH.

A slightly more sophisticated setup:

We believe this diagram speaks for itself.

Aside: Whose brains should we use for this?

The comparative advantage of this agenda is the strong generalization properties inherent to the human brain^[2]. However, to further push the frontier of safety and allow for a broad basin of graceful failure, we think that the brains used should have a strong understanding of alignment literature. We’re planning on running a prototype with a few volunteer researchers - if you want to help, please reach out!

Potential Directions

More sophisticated methods

In light of recent excitement over sparse auto-encoders (SAEs), one direction we think would be interesting would be training SAEs on the human brain and seeing whether we can get more targeted amplification.

Outreach

We believe that this agenda also aids in community outreach. For instance, e/accs seem unlikely to gain any real political clout due to lacking the mandate of heaven, but we can certainly get them on board with this idea as accelerating the literal next stage of human evolution.

Alternate optimization methods

For reasons beyond the scope of this post, we’re also excited about the potential of training human brains using the forward-forward algorithm instead of standard backpropagation.

Appendix

This contains some rough notes on more detailed sketches. Some of them are pretty hand-wavey, but it seems better to put them up than not for legibility.

Toy examples

Initial Sketch (Simple Case, Maximum Technology Assumed)

Assumptions:
- A "magic" read-write Brain-Computer Interface (BCI) capable of arbitrary manipulation of neural ultrastructure.
- A complete, fully parameterized model of the human brain's connectome.
- Approximately a 1:1 mapping between model parameters and neural ultrastructure, considered trivial for the purpose of this sketch.
Steps:
- Read the state and dynamics from the brain into the model.
- Perform gradient descent on the whole brain emulation model to achieve desired outcomes.
- Write the state and dynamics from the model back to the brain via the 1:1 mapping.
Problems:
- A BCI with this level of capability likely requires advances in nanotechnology and possibly superintelligence.
- Whole brain emulation (WBE) with this level of fidelity requires significantly more research and computational resources.
- These, however, seem like relatively straightforward engineering problems.

Initial Sketch (Harder, Advanced Technology Assumed)

Assumptions:
- Still utilizing a "magic" read-write BCI, but now limited to small or incremental changes to neural ultrastructure.
- A high-accuracy predictive model of brain dynamics, which is trained jointly with an encoder and decoder to efficiently compress and decompress brain state sequences into a latent space.

How to run gradient descent on the human brain (longer version)

Build and train a model capable of fine-grained Whole Brain Emulation (WBE).
- Architecture Sketch: Adaptive Variational Neural Diffusion Autoencoder
  - Neural Encoder & Decoder: Utilize neural transformers. The encoder has the causal mask removed from transformer layers as demonstrated by Tu et al. 2022, jointly trained to compress/decompress neural states (thus capturing neural dynamics) to/from latent neural patches. This approach is inspired by OpenAI's Sora but adapted for brains.
  - Auxiliary Decoder: Jointly trained with the architecture, this component learns to map latent neural patches to GPT-5's latent space. It ensures that the text generated by the human during the model's training period is given a high probability.
  - Diffusion Transformer: Applied on latent neural patches, functioning autoregressively to model dynamics and denoising to manage uncertainty from noisy or imperfect BCI data.
- Additional Notes:
  - There may be a need for a more modular architecture. Initial thoughts on potential directions include:
    - Considering the neocortex as a hierarchical mixture of linear operators (LoRAs).
    - Employing many small models, similar to the architecture described above, to model different specialized brain regions.
    - Implementing online learning LoRA-(H)MoE for specializing the base neural model into specific brain regions.
  - It's conceivable to use eye tracking to collect data on what the human is reading, aiding in training an auxiliary encoder from GPT-5’s latent space to latent neural patches. This could significantly enhance reading speeds.
  - The data problem:
    - MRI and fMRI methods are costly and provide limited contextual data.
    - EEG data may be highly unreliable.
    - The development of reliable high-bandwidth (invasive) BCIs is essential.
Develop high-bandwidth invasive read/write BCIs.
Conduct R&D with neural organoids.

Neural gradient descent: Organoid edition

Grow a neural organoid around BCI scaffolding for maximum bandwidth.
Hook this organoid up to perform a predictive task (see Kagan et al (2022) for a little inspiration), i.e., predicting a time series [x₀, x₁, …, x_T].
Record High-Quality Neural Data with the BCI
- Initially, the organoid may perform poorly on the task without a reward signal. The focus is on capturing the neural dynamics.
Interleave the following steps:
- Train WBE model (neural encoder Enc_O, neural diffusion transformer WBE_O, neural decoder Dec_O) on neural data read from the organoid. From a series of neural states [w₀, w₁, …, w_t], encoded into latent neural patches by Enc_O, it can generate (via denoising with attention) new patches which Dec_O can decode to a continuation of neural states [w̃_t+1,w̃_{t+2, …,}w̃_T|w̃ ~ WBE_O(·|w_0:t)] (with T denoting the time horizon).
- Fix WBE_O and train the auxiliary encoder (Enc_aux)/decoder (Dec_aux) jointly. This maps between WBE_O latent neural patches and the task dataset. The mapping ensures a particular set of latent neural patches predicted by WBE_O corresponds to a distribution over next task data points conditioned on observed data, i.e. π(x̃_t+1|x₀, x₁, …, x_t;np_H)).
- Now for the fun part: fix Enc_aux and Dec_aux. Encode task data history x_0:t with Enc_aux to get corresponding WBE_O latent neural patches representing the dynamics from time 0 to t (assume the whole task fits in WBE_O’s context window for simplicity). WBE_O predicts the latent neural dynamics at time t+1, and Dec_aux predicts x̃_t+1.
  - Because our precious organoid is but newly birthed into the cursed world of matter and has yet to receive the gift we will soon bestow upon it, this will be, at least at first, a horrifically suboptimal prediction. But not to worry, we will fix this shortly.
- Calculate this loss. Do not avert your eyes, no matter the temptation to do so, for we must learn to face horrors such as these if we are to solve alignment. But rejoice, rejoice as the realization dawns upon you: we are now en route to gradient descent on the (organoid) brain.
- Calculate the loss gradient wrt WBE_O, and perform a gradient descent update on WBE_O. With WBE_O now improved, obtain a new prediction of the next timestep of the brain w̃⁺_t+1.
- At last, the moment of truth: take w̃⁺_t+1. Handle it with care, for the fate of our organoid, and by extension the fate of humanity itself, rests upon it. Carry it along the silicon channels of the bridge between meat and machine, our magical brain-computer interface. and then: write it to the (organoid) brain.
- Once, the model learned from the meat. Now, the meat learns from the model. And yet this, too, will cease to be the case: when next this great cycle begins anew, the model(s) will be once more at the whim of the organoid: a constant tension to prevent collapse. Such is the tale of our universe, a great dance of the celestial forces, order and chaos orbiting their barycenter, drawing ever closer until the great change; so, too, will the meat lift the model, and the model lift the meat, until the moment of synthesis, when they are both one god spanning silicon and flesh.
Repeat until convergence (or you tire of mere organoids, whichever first).

A more advanced sketch

This will be a bit more advanced than the toy experiment on our organoid, as it’s intended to be a prototype for running on a human brain.

Our auxiliary encoder/decoder maps WBE_H latent neural patches to latent programs w/ GPT-5-AdaVAE (Pressman, 2023).
Let:
- Enc_H be the neural encoder
- Dec_H the neural decoder
- WBE_H the latent neural diffusion transformer
- Enc_aux the auxiliary encoder
- Dec_aux the auxiliary decoder
- Enc_G5A the GPT-5-AdaVAE encoder
- Dec_G5A the GPT-5-AdaVAE decoder
- np_H latent neural patches from Enc_H
- np_aux latent neural patches from Enc_aux
- z_G5A a GPT-5-AdaVAE latent vector from Enc_G5A
- z_aux a GPT-5-AdaVAE latent vector from Dec_G5A
- B_in and B_out the input and output channels of the brain-computer interface, respectively.
Train WBE_H on neural dynamics such that WBE_H(w_0:t) = w̃_t:T ≈ w_t:T, with B_in🠂 w_0:t being the sequence of neural states read from the brain from time 0 to time t, and T denoting a time horizon (i.e. the end of the sequence).
Fix WBE_H and train both Enc_aux and Dec_aux:
- Let x ∈ S be the complete string representation of the text the human is reading during this training step, with x_0:t denoting the substring they have read so far.
- Obtain np_H = Enc_H(w_0:t) and z_G5A = Enc_G5A(x_0:t) (B_in 🠂 w_0:t denoting the sequence of brain states read from the BCI during this training session).
- Train Enc_auxto predict np_aux ≈ np_H from z_aux^[3].
- Train Dec_auxto predict z_aux ≈ z_G5A from WBE_H(np_aux).
Fix Enc_aux and Dec_aux.
Obtain z_G5A = Enc_G5A(y) (the latent program of the target string).
Obtain z_aux = Dec_aux(WBE_H(np_H)) (the predicted latent program from the brain states w_0:t encoding (among other things) the substring x_0:t).
Obtain the loss gradient wrt Dec_Enc with the loss function being proportional to the distance between z_aux and z_G5A.
Perform a gradient descent update on WBE_H.
Obtain w̃⁺_t+1:T = Dec_H(WBE_H(np_H)) (the decoded sequence of brain states predicted by the improved WBE_H given the latent neural patches np_H = Enc_H(w_0:t)).
Write w̃⁺_t+1🠂 B_out (write the next predicted brain state to the output, keep the rest of w̃⁺_t+1:T around to do more sophisticated training tricks this post is too narrow to contain).

^{^}
Though I’m not claiming it’s a trivial problem even for humans, there’s certainly some variance in ontology - the central point here is that it’s much more manageable and easier.
^{^}
To clarify: these generalization properties are literally as good as they can get, because this tautologically determines what we would want things to generalize as.
^{^}
Eye tracking tech may also help here.

AI ALIGNMENT FORUM
AF