Linda Linsefors

Hi, I am a Physicist, an Effective Altruist and AI Safety student/researcher.

Wiki Contributions


The math in the post is super hand-wavey, so I don't expect the result to be exactly correct. However in your example, l up to 100 should be ok, since there is no super position. 2.7 is almost 2 orders of magnitude off, which is not great.

Looking into what is going on: I'm basing my results on the Johnson–Lindenstrauss lemma, which gives an upper bound on the interference. In the post I'm assuming that the actual interference is order of magnitude the same as the this upper bound. This assumption is clearly fails in your example since the interference between features is zero, and nothing is the same order of magnitude as zero.

I might try to do the math more carefully, unless someone else gets there first. No promises though. 

I expect that my qualitative claims will still hold. This is based on more than the math, but math seemed easier to write down. I think it would be worth doing the math properly, both to confirm my claims, and it may be useful to have more more accurate quantitative formulas. I might do this if I got some spare time, but no promises.

my qualitative claims = my claims about what types of things the network is trading away when using super position

quantitative formulas = how much of these things are traded away for what amount of superposition.


Recently someone either suggested to me (or maybe told me they or someone where going to do this?) that we should train AI on legal texts, to teach it human values. Ignoring the technical problem of how to do this, I'm pretty sure legal text are not the right training data. But at the time, I could not clearly put into words why. Todays SMBC explains this for me:

Saturday Morning Breakfast Cereal - Law (

Law is not a good representation or explanation of most of what we care about, because it's not trying to be. Law is mainly focused on the contentious edge cases. 

Training an AI on trolly problems and other ethical dilemmas is even worse, for the same reason. 

Did you forget to provide links to research project outputs in the appendix? Or is there some other reason for this?

I think it's reasonable to think about what can be stored in a way that can be read of in a linear way (by the next layer), since that are the features that can be directly used in the next layer. 

storing them nonlinearly (in one of the host of ways it takes multiple nn layers to decode)

If it takes multiple nn layers to decode, then the nn need to unpack it before using it, and represent it as a linear readable feature later.

Good point. I need to think about this a bit more. Thanks

Just quickly writing up my though for now...

What I think is going on here is that Johnson–Lindenstrauss lemma gives a bound on how well you can do, so it's more like a worst case scenario. I.e. Johnson–Lindenstrauss lemma  gives you the worst case error for the best possible feature embedding.

I've assumed that the typical noise would be same order of magnitude as the worst case, but now I think I was wrong about this for large 

I'll have to think about what is more important of worst case and typical case. When adding up noise one should probably use worst typical case. But when calculating how many features to fit in, one should probably use worst case. 


I timed how long it took me to fill in the survey. It took 30 min. I could probably have done it in 15 min if I skipped the optional text questions. This is to be expected however. Every time I've seen someone someone guesses how long it will take to respond to their survey, it's off by a factor of 2-5. 

Current Interpretability results suggest that roughly the first half of the layers in an LLM correspond to understanding the context at increasingly abstract levels, and the second half to figuring out what to say and turning that back from abstractions into concrete tokens. It's further been observed that in the second half, figuring out what to say generally seems to occur in stages: first working out the baseline relevant facts, then figuring out how to appropriately slant/color those in the current context, then converting these into the correct language, and last getting the nitty-gritty details of tokenization right.

How do we know this? This claim seems plausible, but also I did not know that mech-interp was advanced enough to verify something like this. Where can I read more?

It looks like this to me:

Where's the colourful text?
Is it broken or am I doing something wrong?

Potentially we might be ok with it if the expected timescale is long enough (or the probability of it happening in a given timescale is low enough).

Agreed. I'd love for someone to investigate the possibility of slowing down substrate-convergence enough to be basically solved.

If that's true then that is a super important finding! And also an important thing to communicate to people! I hear a lot of people who say the opposite and that we need lots of competing AIs.

Hm, to me this conclusion seem fairly obvious. I don't know how to communicate it though, since I don't know what the crux is. I'd be up for participating in a public debate about this, if you can find me an opponent. Although, not until after AISC research lead applications are over, and I got some time to recover. So maybe late November at the earliest. 

  • An approach could be to say under what conditions natural selection will and will not sneak in. 


  • Natural selection requires variation. Information theory tells us that all information is subject to noise and therefore variation across time. However, we can reduce error rates to arbitrarily low probabilities using coding schemes. Essentially this means that it is possible to propagate information across finite timescales with arbitrary precision. If there is no variation then there is no natural selection. 

Yes! The big question to me is if we can reduced error rates enough. And "error rates" here is not just hardware signal error, but also randomness that comes from interacting with the environment.

  • In abstract terms, evolutionary dynamics require either a smooth adaptive landscape such that incremental changes drive organisms towards adaptive peaks and/or unlikely leaps away from local optima into attraction basins of other optima. In principle AI systems could exist that stay in safe local optima and/or have very low probabilities of jumps to unsafe attraction basins. 

It has to be smooth relative to the jumps the jumps that can be achieved what ever is generating the variation. Natural mutation don't typically do large jumps. But if you have a smal change in motivation for an intelligent system, this may cause a large shift in behaviour. 

  • I believe that natural selection requires a population of "agents" competing for resources. If we only had a single AI system then there is no competition and no immediate adaptive pressure.

I though so too to start with. I still don't know what is the right conclusion, but I think that substrate-needs convergence it at least still a risk even with a singleton. Something that is smart enough to be a general intelligence, is probably complex enough to have internal parts that operate semi independently, and therefore these parts can compete with each other. 

I think the singleton scenario is the most interesting, since I think that if we have several competing AI's, then we are just super doomed. 

And by singleton I don't necessarily mean a single entity. It could also be a single alliance. The boundaries between group and individual is might not be as clear with AIs as with humans. 

  • Other dynamics will be at play which may drown out natural selection. There may be dynamics that occur at much faster timescales that this kind of natural selection, such that adaptive pressure towards resource accumulation cannot get a foothold. 

This will probably be correct for a time. But will it be true forever? One of the possible end goals for Alignment research is to build the aligned super intelligence that saves us all. If substrate convergence is true, then this end goal is of the table. Because even if we reach this goal, it will inevitable start to either value drift towards self replication, or get eaten from the inside by parts that has mutated towards self replication (AI cancer), or something like that.

  • Other dynamics may be at play that can act against natural selection. We see existence-proofs of this in immune responses against tumours and cancers. Although these don't work perfectly in the biological world, perhaps an advanced AI could build a type of immune system that effectively prevents individual parts from undergoing runaway self-replication. 

Cancer is an excellent analogy. Humans defeat it in a few ways that works together

  1. We have evolved to have cells that mostly don't defect
  2. We have an evolved immune system that attracts cancer when it does happen
  3. We have developed technology to help us find and fight cancer when it happens
  4. When someone gets cancer anyway and it can't be defeated, only they die, it don't spread to other individuals. 

Point 4 is very important. If there is only one agent, this agent needs perfect cancer fighting ability to avoid being eaten by natural selection. The big question to me is: Is this possible?

If you on the other hand have several agents, they you defiantly don't escape natural selection, because these entities will compete with each other. 


Load More