Linda Linsefors

Hi, I am a Physicist, an Effective Altruist and AI Safety student/researcher.

Wiki Contributions


Todays thoughts: 

I suspect it's not possible to build autonomous aligned AIs (low confidence). The best we can do is some type of hybrid humans-in-the-loop system. Such a system will be powerful enough to eventually give us everything we want, but it will also be much slower and intellectually inferior to what is possible with out humans-in-the-loop. I.e. the alignment tax will be enormous. The only way the safe system can compete, is by not building the unsafe system. 

Therefore we need AI Governance. Fortunately, political action is getting a lot of attention right now, and the general public seems to be positively inclined to more cautious AI development. 

After getting an immediate stop/paus on larger models, I think next step might be to use current AI to cure aging. I don't want to miss the singularity because I died first, and I think I'm not the only one who feels this way. It's much easier to be patient and cautious in a world where aging is a solved problem. 

We probably need a strict ban on building autonomous superintelligent AI until we reached technological maturity. It's probably not a great idea to build them after that either, but they will probably not pose the same risk any longer. This last claim is not at all obvious. The hardest attack vector to defend against would be manipulation. I think reaching technological maturity will make us able to defend against any military/hard-power attack. This includes for example having our own nano-bot defence system, to defend against hostile nanobots. Manipulation is harder, but I think there are ways to solve that, with enough time to set up our defences.

An important crux for what there end goal is, including if there is some stable end where we're out of the danger, is to what extent technological maturity also leads to a stable cultural/political situation, or if that keeps evolving in ever new directions. 

Recently an AI safety researcher complained to me about some interaction they had with an AI Safety communicator. Very stylized, there interaction went something like this:

(X is some fact or topic related to AI Safety

Communicator: We don't know anything about X and there is currently no research on X.

Researcher: Actually, I'm working on X, and I do know some things about X.

Communicator: We don't know anything about X and there is currently no research on X.


I notice that I semi-frequently hear communicators saying things like the thing above. I think what they mean is that our our understanding of X is far from the understanding that is needed, and the amount of researchers working on this is much fewer than what would be needed, and this get rounded off to we don't know anything and no one is doing anything about it. If this is what is going on then I think this is bad. 

I think that is some cases when someone says "We don't know anything about X and there is currently no research on X." they probably literally mean it. There are some people who think that approximately no-one working on AI Safety is doing real AI Safety researchers. But I also think that most people who are saying "We don't know anything about X and there is currently no research on X." are doing some mixture of rounding off, some sort of unreflexively imitation learning, i.e. picking up the sentence structure from others, especially from high status people. 

I think using a language that hides the existence of the research that does exist is bad. Primarily because it's misinformative. Do we want all new researchers to start from scratch? Because that is what happens if you tell them there is no pre-existing research and they believe you. 

I also don't think this exaggeration will help with recruitment. Why do you think people would prefer to join a completely empty research field instead of a small one? From a personal success perspective (where success can mean either impact or career success) a small research field is great, lots if low-hanging fruit around. But a completely untrodden research direction is terrible, you will probably just get lost, not get anything done, and even if you fid something, there's nowhere to publish it.

Recording though in progress...

I notice that I don't expect FOOM like RSI, because I don't expect we'll get an mesa optimizer with coherent goals. It's not hard to give the outer optimiser (e.g. gradient decent) a coherent goal. For the outer optimiser to have a coherent goal is the default. But I don't expect that to translate to the inner optimiser. The inner optimiser will just have a bunch of heuristics and proxi-goals, and not be very coherent, just like humans. 

The outer optimiser can't FOOM, since it don't do planing, and don't have strategic self awareness. It's can only do some combination of hill climbing and random trial and error. If something is FOOMing it will be the inner optimiser, but I expect that one to be a mess.

I notice that this argument don't quite hold. More coherence is useful for RSI, but complete coherence is not necessary. 

I also notice that I expect AIs to make fragile plans, but on reflection, I expect them to gett better and better with this. By fragile I mean that the longer the plan is, the more likely it is to break. This is true for human too though. But we are self aware enough about this fact to mostly compensate, i.e. make plans that don't have too many complicated steps, even if the plan spans a long time.

There is no study material since this is not a course. If you are accepted to one of the project teams they you will work on that project. 

You can read about the previous research outputs here: Research Outputs – AI Safety Camp

The most famous research to come out of AISC is the coin-run experiment.
(95) We Were Right! Real Inner Misalignment - YouTube
[2105.14111] Goal Misgeneralization in Deep Reinforcement Learning (

But the projects are different each year, so the best way to get an idea for what it's like is just to read the project descriptions. 

Second reply. And this time I actually read the link.
I'm not suppressed by that result.

My original comment was a reaction to claims of the type [the best way to solve almost any task is to develop general intelligence, therefore there is a strong selection pressure to become generally intelligent]. I think this is wrong, but I have not yet figured out exactly what the correct view is. 

But to use an analogy, it's something like this: In the example you gave, the AI get's better at the sub tasks by learning on a more general training set. It seems like general capabilities was useful. But consider that we just trained on even more data for a singel sub task, then wouldn't it develop general capabilities, since we just noticed that general capabilities was useful for that sub task. I was planing to say "no" but I notice that I do expect some transfer learning. I.e. if you train on just one of the dataset, I expect it to be bad at the other ones, but I also expect it to learn them quicker than without any pre-training. 

I seem to expect that AI will develop general capabilities when training on rich enough data, i.e. almost any real world data. LLM is a central example of this. 

I think my disagreement with at least my self from some years ago and probably some other people too (but I've been away a bit form the discourse so I'm not sure), is that I don't expect as much agentic long term planing as I used to expect. 

I agree that eventually, at some level of trying to solve enough different types of tasks, GI will be efficient, in terms of how much machinery you need, but it will never be able to compete on speed. 

Also, it's an open question what is "enough different types of tasks". Obviously, for a sufficient broad class of problems GI will be more efficient (in the sense clarified above). Equally obviously, for a sufficient narrow class of problems narrow capabilities will be more efficient. 

Humans have GI to some extent, but we mostly don't use it. This is interesting. This means that a typical human environment is complex enough so that it's worth carrying around the hardware for GI. But even though we have it, it is evolutionary better to fall back at habits, or imitation, or instinkt, for most situations.

Looking back to exactly what I wrote, I said there will not be any selection pressure for GI as long as other options are available. I'm not super confident in this. But if I'm going to defend it here anyway by pointing out that "as long as other options are available", is doing a lot of the work here. Some problems are only solvable by noticing deep patterns in reality, and in this case a sufficiently deep NN with sufficient training will learn this, and that is GI.

I think we agreement.

I think the confusion is because it is not clear form that section of the post if you are saying 
1)"you don't need to do all of these things" 
2) "you don't need to do any of these things".

Because I think 1 goes without saying, I assumed you were saying 2. Also 2 probably is true in rare cases, but this is not backed up by your examples.

But if 1 don't go without saying, then this means that a lot of "doing science" is cargo-culting? Which is sort of what you are saying when you talk about cached methodologies.

So why would smart, curious, truth-seeking individuals use cached methodologies? Do I do this?

Some self-reflection: I did some of this as a PhD student, because I was new, and it was a way to hit the ground running. So, I did some science using the method my supervisor told me to use, while simultaneously working to understand the reason behind this method. I did spend less time that I would have wanted to understand all the assumptions of the sub-sub field of physics I was working in, because of the pressure to keep publishing and because I got carried away by various fun math I could do if i just accepted these assumptions. After my PhD I felt that if I was going to stay in Physics, I wanted to take year or two for just learning, to actually understand Loop Quantum Gravit, and all the other competing theories, but that's not how academia works unfortunately, which is one of the reasons I left.

I think that the fundament of good Epistemic is to not have competing incentives.

In particular, four research activities were often highlighted as difficult and costly (here in order of decreasing frequency of mention):

  • Running experiments
  • Formalizing intuitions
  • Unifying disparate insights into a coherent frame
  • Proving theorems

I don't know what your first reaction to this list is, but for us, it was something like: "Oh, none of these activities seems strictly speaking necessary in knowledge-production." Indeed, a quick look at history presents us with cases where each of those activities was bypassed:

What these examples highlight is the classical failure when searching for the need of customers: to anchor too much on what people ask for explicitly, instead of what they actually need. 


I disagree that this conclusion follows from the examples. Every example you list uses at least one of the methods in your list. So, this might as well be used as evidence for why this list of methods are important. 

In addition, several of the listed examples benefited from division of labour. This is a common practice in Physics. Not everyone does experiments. Some people instead specialise in the other steps of science, such as 

  • Formalizing intuitions
  • Unifying disparate insights into a coherent frame
  • Proving theorems

This is very different from concluding that experiments are not necessary.


Similar but not exactly.

I mean that you take some known distribution (the training distribution) as a starting point. But when sampling actions you do so from shifted on truncated distribution to favour higher reward policies. 

The in the decision transformers I linked, AI is playing a variety of different games, where the programmers might not know what a good future reward value would be. So they let the system AI predict the future reward, but with the distribution shifted towards higher rewards.

I discussed this a bit more after posting the above comment, and there is something I want to add about the comparison. 

In quantilizers if you know the probability of DOOM from the base distribution, you get an upper bound on DOOM for the quantaizer. This is not the case for type of probability shift used for the linked decision transformer.

DOOM = Unforeseen catastrophic outcome. Would not be labelled as very bad by the AI's reward function but is in reality VERY BAD.

Any policy can be model as a consequentialist agent, if you assume a contrived enough utility function. This statement is true, but not helpful.

The reason we care about the concept agency, is because there are certain things we expect from consequentialist agents, e.g. instrumental convergent goals, or just optimisation pressure in some consistent direction. We care about the concept of agency because it holds some predictive power. 

[... some steps of reasoning I don't know yet how to explain ...]

Therefore, it's better to use a concept of agency that depend on the internal properties of an algorithm/mind/policy-generator.

I don't think agency can be made into a crisp concept. It's either a fuzzy category or a leaky abstraction depending on how you apply the concept. But it does point to something important. I think it is worth tracking how agentic different systems are, because doing so has predictive power.

Load More