Nathan Helm-Burger

AI alignment researcher, ML engineer. Masters in Neuroscience.

I believe that cheap and broadly competent AGI is attainable and will be built soon. This leads me to have timelines of around 2024-2027. Here's an interview I gave recently about my current research agenda. I think the best path forward to alignment is through safe, contained testing on models designed from the ground up for alignability trained on censored data (simulations with no mention of humans or computer technology). I think that current ML mainstream technology is close to a threshold of competence beyond which it will be capable of recursive self-improvement, and I think that this automated process will mine neuroscience for insights, and quickly become far more effective and efficient. I think it would be quite bad for humanity if this happened in an uncontrolled, uncensored, un-sandboxed situation. So I am trying to warn the world about this possibility. 

See my prediction markets here: 

I also think that current AI models pose misuse risks, which may continue to get worse as models get more capable, and that this could potentially result in catastrophic suffering if we fail to regulate this.

I now work for SecureBio on AI-Evals.

relevant quotes: 

"There is a powerful effect to making a goal into someone’s full-time job: it becomes their identity. Safety engineering became its own subdiscipline, and these engineers saw it as their professional duty to reduce injury rates. They bristled at the suggestion that accidents were largely unavoidable, coming to suspect the opposite: that almost all accidents were avoidable, given the right tools, environment, and training." 


"The prospect for the human race is sombre beyond all precedent. Mankind are faced with a clear-cut alternative: either we shall all perish, or we shall have to acquire some slight degree of common sense. A great deal of new political thinking will be necessary if utter disaster is to be averted." - Bertrand Russel, The Bomb and Civilization 1945.08.18


Sorted by New

Wiki Contributions


One point I’ve seen raised by people in the latter group is along the lines of: “It’s very unlikely that we’ll be in a situation where we’re forced to build AI systems vastly more capable than their supervisors. Even if we have a very fast takeoff - say, going from being unable to create human-level AI systems to being able to create very superhuman systems ~overnight - there will probably still be some way to create systems that are only slightly more powerful than our current trusted systems and/or humans; to use these to supervise and align systems slightly more powerful than them; etc. (For example, we could take a very powerful, general algorithm and simply run it on a relatively low amount of compute in order to get a system that isn’t too powerful.)” This seems like a plausible argument that we’re unlikely to be stuck with a large gap between AI systems’ capabilities and their supervisors’ capabilities; I’m not currently clear on what the counter-argument is.


I agree that this is a very promising advantage for Team Safety. I do think that, in order to make good use of this potential advantage, the AI creators need to be cautious going into the process. 

One way that I've come up with to 'turn down' the power of an AI system is to simply inject small amounts of noise into its activations. 

I agree with Steve Byrnes here. I think I have a better way to describe this.
I would say that the missing piece is 'mastery'. Specifically, learning mastery over a piece of reality. By mastery I am referring to the skillful ability to model, predict, and purposefully manipulate that subset of reality.
I don't think this is an algorithmic limitation, exactly.

Look at the work Deepmind has been doing, particularly with Gato and more recently AutoRT, SARA-RT, RT-Trajectory, UniSim , and Q-transformer. Look at the work being done with the help of Nvidia's new Robot Simulation Gym Environment. Look at OpenAI's recent foray into robotics with Figure AI. This work is held back from being highly impactful (so far) by the difficulty of accurately simulating novel interesting things, the difficulty of learning the pairing of action -> consequence compared to learning a static pattern of data, and the hardware difficulties of robotics.

This is what I think our current multimodal frontier models are mostly lacking. They can regurgitate, and to a lesser extent synthesize, facts that humans wrote about, but not develop novel mastery of subjects and then report back on their findings. This is the difference between being able to write a good scientific paper given a dataset of experimental results and rough description of the experiment, versus being able to gather that data yourself. The line here is blurry, and will probably get blurrier before collapsing entirely. It's about not just doing the experiment, but doing the pilot studies and observations and playing around with the parameters to build a crude initial model about how this particular piece of the universe might work. Building your own new models rather than absorbing models built by others. Moving beyond student to scientist.

This is in large part a limitation of training expense. It's difficult to have enough on-topic information available in parallel to feed the data-inefficient current algorithms many lifetimes-worth of experience.

So, while it is possible to improve the skill of mastery-of-reality with scaling up current models and training systems, it gets much much easier if the algorithms get more compute-efficient and data-sample-efficient to train.

That is what I think is coming.

I've done my own in-depth research into the state of the field of machine learning and potential novel algorithmic advances which have not yet been incorporated into frontier models, and in-depth research into the state of neuroscience's understanding of the brain. I have written a report detailing the ways in which I think Joe Carlsmith's and Ajeya Cotra's estimates are overestimating the AGI-relevant compute of the human brain by somewhere between 10x to 100x.

Furthermore, I think that there are compelling arguments for why the compute in frontier algorithms is not being deployed as efficiently as it could be, resulting in higher training costs and data requirements than is theoretically possible.

In combination, these findings lead me to believe we are primarily algorithm-constrained not hardware or data constrained. Which, in turn, means that once frontier models have progressed to the point of being able to automate research for improved algorithms I expect that substantial progress will follow. This progress will, if I am correct, be untethered to further increases in compute hardware or training data.

My best guess is that a frontier model of the approximate expected capability of GPT-5 or GPT-6 (equivalently Claude 4 or 5, or similar advances in Gemini) will be sufficient for the automation of algorithmic exploration to an extent that the necessary algorithmic breakthroughs will be made. I don't expect the search process to take more than a year. So I think we should expect a time of algorithmic discovery in the next 2 - 3 years which leads to a strong increase in AGI capabilities even holding compute and data constant. 

I expect that 'mastery of novel pieces of reality' will continue to lag behind ability to regurgitate and recombine recorded knowledge. Indeed, recombining information clearly seems to be lagging behind regurgitation or creative extrapolation. Not as far behind as mastery, so in some middle range. 

If you imagine the whole skillset remaining in its relative configuration of peaks and valleys, but shifted upwards such that the currently lagging 'mastery' skill is at human level and a lot of other skills are well beyond, then you will be picturing something similar to what I am picturing.


This is what I mean when I say it isn't a limit of the algorithm per say. Change the framing of the data, and you change the distribution of the outputs.



I think this is an important piece of the strategic solution. The lack of control options for open-weight models is a big part of why I think there is a much lower capability threshold for catastrophic danger from an open-weight model.

There’s no sharp line between the helper AIs of Vision 1 and the truly-autonomous AIs of Vision 2.


This post seems like it doesn't quite cleave reality at the joints, from how I'm seeing things. 

Vision 1 style models can be turned into Vision 2 autonomous models very easily. So, as you say, there's no sharp line there.

For me, Vision 3 shouldn't depend on biological neurons. I think it's more like 'brain-like AGI that is so brain-like that it is basically an accurate whole brain emulation, and thus you can trust it as much as you can trust a human (which isn't necessarily all that much)."

So again, no sharp line there from my point of view. 

Since there are lots of different people in the world with different beliefs and goals, I expect that lots of variations with similarities to #1, #2, and #3 will be active in the world. So anyone who has a hope of just one of the visions coming true needs to include very strict worldwide governance enforcement as part of their vision.

I think my vision is some weird mashup of these. Like, I'm hoping for a powerful set of semi-aligned tool AI (type-1) to assist worldwide enforcement in stamping out dangerous type-2 rogue AI in the hands of bad actors, giving us a temporary safe window in which we can achieve either better alignment of type-1 or type-3 (Bio-enhancement and Whole Brain Emulation).

Ok, so this is definitely not a human thing, so probably a bit of a tangent. One of the topics that came up in a neuroscience class once was goose imprinting. There's apparently been studies (see Eckhard Hess for the early ones) that show that the strength of the imprinting (measured by behavior following the close of the critical period) onto whatever target is related to how much running towards the target the baby geese do. The hand-wavey explanation was something like 'probably this makes sense since if you have to run a lot to keep up with your mother-goose for safety, you'll need a strong mother-goose-following behavioral tendency to keep you safe through early development'. 

I think this is an excellent description of GPT-like models. It both fits with my observations and clarifies my thinking. It also leads me to examine in a new light questions which have been on my mind recently: 

What is the limit of power of simulation that our current architectures (with some iterative improvements) can achieve when scaled to greater power (via additional computation, improved datasets, etc)?

Is a Simulator model really what we want? Can we trust the outputs we get from it to help us with things like accelerating alignment research? What might failure modes look like?

Super handy seeming intro for newcomers.

I recommend adding Jade Leung to your list of governance people.

As for the list of AI safety people, I'd like to add that there are some people who've written interesting and much discussed content that it would be worth having some familiarity with.

John Wentworth

Steven Byrnes

Vanessa Kosoy

And personally I'm quite excited about the school of thought developing under the 'Shard theory' banner.

For shard theory info: 

I'm excited to participate in this, and feel like the mental exercise of exploring this scenario would be useful for my education on AI safety. Since I'm currently funded by a grant from the Long Term Future Fund for reorienting my career to AI safety, and feel that this would be a reasonable use of my time, you don't need to pay me. I'd be happy to be a full-time volunteer for the next couple weeks.

Edit: I participated and was paid, but only briefly. Turns out I was too distracted thinking and talking about how the process could be improved and the larger world implications to actually be useful as an object-level worker. I feel like the experience was indeed useful for me, but not as useful to Beth as I had hoped. So... thanks and sorry!

Thanks Rohin. I also feel that interviewing after my 3 more months of independent work is probably the correct call.

Load More