All of Linda Linsefors's Comments + Replies

If you think it would be helpful, you are welcome to suggest a meta philpsophy topic for AI Safety Camp.

More info at (I'm typing on a phone, I'll add actuall link later if I remember too)

But I think orgs are more likely to be well-known to grant-makers on average given that they tend to have a higher research output,

I think your getting the causality backwards. You need money first, before there is an org. Unless you count informal multi people collaborations as orgs. 

I think people how are more well-known to grant-makers are more likely to start orgs. Where as people who are less known are more likely to get funding at all, if they aim for a smaller garant, i.e. as an independent researcher. 

Counter point. After the FTX collapse, OpenPhil said publicly (some EA Forum post)  that they where raising their bar for funding. I.e. there are things that would have been funded before that would now not be funded. The stated reason for this is that there are generally less money around, in total. To me this sounds like the thing you would do if money is the limitation. 

I don't know why OpenPhil don't spend more. Maybe they have long timelines and also don't expect any more big donors any time soon? And this is why they want to spend carefully?

From what I can tell, the field have been funding constrained since the FTX collapse.

What I think happened: 
FTX had lots of money and a low bar for funding, which meant they spread a lot of money around. This meant that more project got started, and probably even more people got generally encouraged to join. Probably some project got funded that should not have been, but probably also some really good projects got started that did not get money before because not clearing the bar before due to not having the right connections, or just bad att writing ... (read more)

Todays thoughts: 

I suspect it's not possible to build autonomous aligned AIs (low confidence). The best we can do is some type of hybrid humans-in-the-loop system. Such a system will be powerful enough to eventually give us everything we want, but it will also be much slower and intellectually inferior to what is possible with out humans-in-the-loop. I.e. the alignment tax will be enormous. The only way the safe system can compete, is by not building the unsafe system. 

Therefore we need AI Governance. Fortunately, political action is getting a lo... (read more)

Recently an AI safety researcher complained to me about some interaction they had with an AI Safety communicator. Very stylized, there interaction went something like this:

(X is some fact or topic related to AI Safety

Communicator: We don't know anything about X and there is currently no research on X.

Researcher: Actually, I'm working on X, and I do know some things about X.

Communicator: We don't know anything about X and there is currently no research on X.


I notice that I semi-frequently hear communicators saying things like the thing above. I think ... (read more)

Recording though in progress...

I notice that I don't expect FOOM like RSI, because I don't expect we'll get an mesa optimizer with coherent goals. It's not hard to give the outer optimiser (e.g. gradient decent) a coherent goal. For the outer optimiser to have a coherent goal is the default. But I don't expect that to translate to the inner optimiser. The inner optimiser will just have a bunch of heuristics and proxi-goals, and not be very coherent, just like humans. 

The outer optimiser can't FOOM, since it don't do planing, and don't have strategic s... (read more)

There is no study material since this is not a course. If you are accepted to one of the project teams they you will work on that project. 

You can read about the previous research outputs here: Research Outputs – AI Safety Camp

The most famous research to come out of AISC is the coin-run experiment.
(95) We Were Right! Real Inner Misalignment - YouTube
[2105.14111] Goal Misgeneralization in Deep Reinforcement Learning (

But the projects are different each year, so the best way to get an idea for what it's like is just to read the project descrip... (read more)

Second reply. And this time I actually read the link.
I'm not suppressed by that result.

My original comment was a reaction to claims of the type [the best way to solve almost any task is to develop general intelligence, therefore there is a strong selection pressure to become generally intelligent]. I think this is wrong, but I have not yet figured out exactly what the correct view is. 

But to use an analogy, it's something like this: In the example you gave, the AI get's better at the sub tasks by learning on a more general training set. It seems like ... (read more)

I agree that eventually, at some level of trying to solve enough different types of tasks, GI will be efficient, in terms of how much machinery you need, but it will never be able to compete on speed. 

Also, it's an open question what is "enough different types of tasks". Obviously, for a sufficient broad class of problems GI will be more efficient (in the sense clarified above). Equally obviously, for a sufficient narrow class of problems narrow capabilities will be more efficient. 

Humans have GI to some extent, but we mostly don't use it. This i... (read more)

I think we agreement.

I think the confusion is because it is not clear form that section of the post if you are saying 
1)"you don't need to do all of these things" 
2) "you don't need to do any of these things".

Because I think 1 goes without saying, I assumed you were saying 2. Also 2 probably is true in rare cases, but this is not backed up by your examples.

But if 1 don't go without saying, then this means that a lot of "doing science" is cargo-culting? Which is sort of what you are saying when you talk about cached methodologies.

So why would sm... (read more)

In particular, four research activities were often highlighted as difficult and costly (here in order of decreasing frequency of mention):

  • Running experiments
  • Formalizing intuitions
  • Unifying disparate insights into a coherent frame
  • Proving theorems

I don't know what your first reaction to this list is, but for us, it was something like: "Oh, none of these activities seems strictly speaking necessary in knowledge-production." Indeed, a quick look at history presents us with cases where each of those activities was bypassed:

  • Einstein figured out special and genera
... (read more)
2Adam Shimi10mo
Thanks for your comment! Actually, I don't think we really disagree. I might have just not made my position very clear in the original post. The point of the post is not to say that these activities are not often valuable, but instead to point out that they can easily turn into "To do science, I need to always do [activity]". And what I'm getting from the examples is that in some cases, you actually don't need to do [activity]. There's a shortcut, or maybe just you're in a different phase of the problem. Do you think there is still a disagreement after this clarification?

Similar but not exactly.

I mean that you take some known distribution (the training distribution) as a starting point. But when sampling actions you do so from shifted on truncated distribution to favour higher reward policies. 

The in the decision transformers I linked, AI is playing a variety of different games, where the programmers might not know what a good future reward value would be. So they let the system AI predict the future reward, but with the distribution shifted towards higher rewards.

I discussed this a bit more after posting the above co... (read more)

From my reading of quantilizers, they might still choose "near-optimal" actions, just only with a small probability. Whereas a system based on decision transformers (possibly combined with a LLM) could be designed that we could then simply tell to "make me a tea of this quantity and quality within this time and with this probability" and it would attempt to do just that, without trying to make more or better tea or faster or with higher probability.

Any policy can be model as a consequentialist agent, if you assume a contrived enough utility function. This statement is true, but not helpful.

The reason we care about the concept agency, is because there are certain things we expect from consequentialist agents, e.g. instrumental convergent goals, or just optimisation pressure in some consistent direction. We care about the concept of agency because it holds some predictive power. 

[... some steps of reasoning I don't know yet how to explain ...]

Therefore, it's better to use a concept of agency that ... (read more)

You mean, in that you can simply prompt for a reasonable non-infinite performance and get said outcome?

Thanks :)
How are the completions provided? 
Are you just looking at the output probabilities for the two relevant completions?

1Ethan Perez1y
The completions are provided by the task authors (2 completions written for each example). We give those to the LM by evaluating the output probability of each completion given the input text. We then normalize the output probabilities to sum to 1, and then use those to compute the loss/accuracy/etc.

I'm confused why the uniform baseline is always 0.5.
This makes sense when the model is choosing between A and B, or Y or N. But I don't see why you consider 0.5 to be a baseline in the other two cases.

I think the baseline is useful for interpretation. In some of the examples the reason the smaller model does better is because it is just answer randomly, while the larger model is misled somehow. But if there is no clear baseline, then I suggest removing this line from the plot.

1Ethan Perez1y
These are all 2-way classification tasks (rather than e.g., free-form generation tasks), where the task authors provided 2 possible completions (1 correct and 1 incorrect), which is why we have a baseline!

In this particular experiment, the small models did not have an object-level hypotheses. It just had no clue and answered randomly.

I think the experiment shows that sometimes smaller models are too dumb to pick up the misleading correlation, which can though off bigger models.

Todays hot takes (or something)

  • There is nothing special about human level intelligence, unless you have imitation learning, in which case human level capabilities are very special.
  • General intelligence is not very efficient. Therefore there will not be any selection pressure for general intelligence as long as other options are available.
  • The no free lunch theorem only says that you can’t learn to predict noise.


GI is very efficient, if you consider that you can reuse a lot machinery that you learn, rather than needing to relearn it over and over again. 

LM memetics:

LM = language model (e.g. GPT-3)

If LMs reads each others text we can get LM-memetics. A LM meme is a pattern which, if it exists in the training data, the LM will output at higher frequency that in the training data. If the meme is strong enough and LLMs are trained on enough text from other LMs, the prevalence of the meme can grow exponentially. This has not happened yet.

There can also be memes that has a more complicated life cycle, involving both humans and LMs. If the LM output a pattern that humans are extra interested in, then the humans ... (read more)

Lets say that 

U_A = 3x + y

Then (I think) for your inequality to hold, it must be that 

U_B = f(3x+y), where f' >= 0

If U_B care about x and y in any other proportion, then B can make trade-offs between x and y which makes things better for B, but worse for A. 

This will be true (in theory) even if both A and B are satisfisers. You can see this by assuming replacing y and x with sigmoids of some other variables.

Yes, I like this one. We don't want the AI to find a way to give it self utility while making things worse for us. And if we are trying to make things better for us, we don't want the AI to resist us.

Do you want to find out what these inequalities implies about the utility functions? Can you find examples where your condition is true for non-identical functions? 

I don't have a specific example right now but some things that come to mind:

  • Both utility functions ultimately depend in some way on a subset of background conditions, i.e. the world state
  • The world state influences the utility functions through latent variables in the agents' world models, to which they are inputs.
  • changes only when (A's world model) changes which is ultimately caused by new observations, i.e. changes in the world state (let's assume that both A and B perceive the world quite accurately).
  • If whenever changes doesn't decrease,
... (read more)

This is a good question.

The not so operationalized answer is that a good operationalization is one that are helpful for achieving alignment.

An operationalization of [helpfulness of an operationalization] would give some sorts to gears level understanding of what shape the operationalization should have to be helpful. I don't have any good model for this, so I will just gesture vaguely.

I think that mathematical descriptions are good, since they are more precise. My first operationalization attempt is pretty mathematical which is good. It is also more "const... (read more)

Can't you restate the second one as the relationship between two utility functions and such that increasing one (holding background conditions constant) is guaranteed not to decrease the other? I.e. their respective derivatives are always non-negative for every background condition.

I recently updated how I view the alignment problem. The post that caused my update is this one form the shard sequence. Also worth mentioning is older post that points to the same thing, but I just happen to read it later.

Basically I used to think we needed to solve both outer and inner alignment separately. No I no longer think this is a good decomposition of the problem. 

It’s not obvious that alignment must factor in the way described above. There is room for trying to set up training in such a way to guarantee a friendly mesa-objective somehow wit

... (read more)

If something is good at replicating, then there will be more of that thing, this creates a selection effect for things that are good at replicating. The effects of this can be observed in biology and memetics. 

Maybe self replication can be seen as an agentic system with the goal of self replicating? In this particular question all uncertainty comes from "agent" being a fuzzy concept, and not from any uncertainty about the world. So answering this question will be a choice of perspective, not information about the world. 

Either way, the type of ag... (read more)

Related to 

infraBook Club I: Corrigibility is bad ashkually

One of my old blog posts I never wrote (I did not even list it in a "posts I will never write" document) is one about how corrigibility are anti correlated with goal security. 

Something like: If you build an AI that don't resist someone trying to change its goals, it will also not try to stop bad actors from changing its goal. (I don't think this particular worry applies to Paul's version of corrigibility, but this blog post idea was from before I learned about his definition.)

I'm not talking about recursive self-improvement. That's one way to take a sharp left turn, and it could happen, but note that humans have neither the understanding nor control over their own minds to recursively self-improve, and we outstrip the rest of the animals pretty handily. I'm talking about something more like “intelligence that is general enough to be dangerous”, the sort of thing that humans have and chimps don't.


Individual humans can't FOOM (at lest not yet), but humanity did. 

My best guess is that humanity took a sharp left turn whe... (read more)

(Just typing as I think...)

What if I push this line of thinking to the extreme. If I just pick agents randomly from the space of all agents, then this should be maximally random, and that should be even better. Now the part where we can mine information of alignment from the fact that humans are at least some what aligned is gone. So this seems wrong. What is wrong here? Probably the fact that if you pick agents randomly from the space of all agents, you don't get greater variation of aliment, compare to if you pick random humans, because probably all the ... (read more)

I mean that the information of what I value exists in my brain. Some of this information is pointers to things in the real world. So in a sense the information partly exist in the relation/correlation between me and the world. 

I defiantly don't mean that I can only care about my internal brain state. To me that is just obviously wrong. Although I have met people who disagree, so I see where the misunderstanding came from.

2Vladimir Nesov1y
That's not what I'm talking about. I'm not talking about what known goals are saying, or what they are speaking of, what they consider valuable or important. I'm talking about where the data to learn what they are is located, as we start out not knowing the goals at all and need to learn them. There is a particular thing, say a utility function, that is the intended formulation of goals. It could be the case that this intended utility function could be found somewhere in the brain. That doesn't mean that it's a utility function that cares about brains, the questions of where it's found and what it cares about are unrelated. Or it could be the case that it's recorded on an external hard drive, and the brain only contains the name of the drive (this name is a "pointer to value"). It's simply not the case that you can recover this utility function without actually looking at the drive, and only looking at the brain. So utility function u itself depends on environment E, that is there is some method of formulating utility functions t such that u=t(E). This is not the same as saying that utility of environment depends on environment, giving the utility value u(E)=t(E)(E) (there's no typo here). But if it's actually in the brain, and says that hard drives are extremely valuable, then you do get to know what it is without looking at the hard drives, and learn that it values hard drives.

Blogposts are the result of noticing difference in beliefs. 
Either between you and other of between you and you, across time.

I have lots of ideas that I don't communicate. Sometimes I read a blogpost and think "yea I knew that, why didn't I write this". And the answer is that I did not have an imagined audience.

My blogposts almost always span after I explained a thing ~3 times in meat space. Generalizing from these conversations I form an imagined audience which is some combination of the ~3 people I talked to. And then I can write. 

(In a convers... (read more)

I almost totally agree with this post. This comment is just nit picking and speculation.

Evolution has an other advantage, that is relate to "getting a lot's of tries" but also importantly different.

It's not just that evolution got to tinker a lot before landing on a fail proof solution. Evolution don't even need a fail proof solution. 

Evolution is "trying to find" a genome, which in interaction with reality, forms a brain that causes that human to have lots of kids. Evolution found a solution that mostly works, but sometimes don't. Some humans decided... (read more)

2Alex Turner1y
Here's a consideration which Quintin pointed out. It's actually a good thing that there is variance in human altruism/caring. Consider a uniform random sample of 1024 people, and grade them by how altruistic / caring they are (in whatever sense you care to consider). The most aligned and median-aligned people will have a large gap. Therefore, by applying only 10 bits of optimization pressure to the generators of human alignment (in the genome+life experiences), you can massively increase the alignment properties of the learned values. This implies that it's relatively easy to optimize for alignment (in the human architecture & if you know what you're doing). Conversely, people have ~zero variance in how well they can fly. If it were truly hard (in theory) to improve the alignment of a trained policy, people would exhibit far less variance in their altruism, which would be bad news for training an AI which is even more altruistic than people are.

This is probably too obvious to write, but I'm going to say it anyway. It's my short form, and approximately no-one reads short forms. Or so I'm told.

Human value formation is to a large part steered by other humans suggesting value systems for you. You get some hard to interpret reward signal from your brainstem, or something. There are lots of "hypothesis" for the "correct reward function" you should learn. 

(Quotation marks because there are no ground through for what values you should have. But this is mathematically equivalent to a learning the tru... (read more)

What is alignment? (operationalisation)

Toy model: Each agent has a utility function they want to maximise. The input to the utility function is a list of values describing the state of the world. Different agents can have different input vectors. Assume that every utility function monotonically increases, decreases or stays constant for changes in each impute variable (I did say it was a toy model!). An agent is said to value something if the utility function increases with increasing quantity of that thing. Note that if an agents utility function decrease... (read more)

Re second try: what would make a high-level operationalisation of that sort helpful? (operationalize the helpfulness of an operationalisation)

Here's what you wrote:

This interpretation makes sense even in the absence of “agents” with “beliefs”, or “independent experiments” repeated infinitely many times. It directly talks about maps matching territories, and the role probability plays, without invoking any of the machinery of frequentist or subjectivist interpretations.

Do you still agree with yourself?

I'm no longer sure. Thanks!

In that case I'm confused about this statement

This interpretation makes sense even in the absence of “agents” with “beliefs”

What is priors in the absence of  something like agents with beliefs?

Frequencies are one example.

We’ve shown that the probability  summarizes all the information in  relevant to , and throws out as much irrelevant information as possible. 

This seems correct.

Lets say two different points in the data configuration space, X_1 and X_2, provide equal evidence for q. Then P[q|X_1] = P[q|X_2]. The two different data possibilities are mapped to the same point in this compressed map. So far so good.

(I assume that I should interpret the object P[q|X] as a function over X, not as a point probability for a specific X.)


... (read more)
Yeah, I basically buy that complaint. None of this was intended to get rid of priors, because a correct model shouldn't get rid of priors.

I don't think this is true:

But there’s a biological analogy: classical conditioning. E.g. I can choose to do X right before Y, and then I’ll learn an association between X and Y which I wouldn’t have learned if I’d done X a long time before doing Y.

I could not find any study that test this directly, but I don't expect conditioning to work if you yourself causes the unconditioned stimuli (US), Y in your example. My understanding of conditioning is that if there is no surprise there is no learning. For example: If you first condition an animal to expect... (read more)

I agree that it's not exactly FDT. I think I actually meant updateless decision theory (UDT), but I'm not sure because I have some of uncertainty to exactly what others mean by UDT.

I claim that mutations + natural selection (evolution) selects for agents that acts according to the policy they would have wanted to pre-commit to, at the time of their birth (last mutation).

Yes, there are some details around who I recognize as a copy of me. In classical FDT this would be anyone who are running the same program (what ever that means). In evolution this would be anyone who are carrying the same genes. Both of these concept are complicated by "same program" and "same genes" are scalar (or more complicated?) and not Boolean values.

Edit: I'm not sure I agree with what I just said. I believe something in this direction, but I want to think some more. For example, people with similar genes probably don't cooperate because decision theory (my decision to cooperate with you is correlated with your decision to cooperate with me), but because shared goals (we both want to spread our shared genes).

Update: I'm shifting towards thinking that peer-review could be good if done right, because:  

  • I'm told it does work well in practice some times, creating a proof of concept.
  • I can see how being asked specifically to review some specific work could make me motivated to put in more work than I would do in a more public format (talk or blogpost).

Some thoughts on this question that you mention briefly in the talk.

What decision theory does evolution select for?

I think that evolution selects for functional decision theory (FDT). More specifically, it selects the best policy over a life time, and not the best action in a given situation. I don't mean that we actually cognitively calculate FDT, but that there is an evolutionary pressure to act as if we follow FDT 

Example: Revenge
By revenge I mean burning some of your utility just to get back at someone who hurt you. 

Revenge is equivalent to t... (read more)

This argument sounds roughly right to me, though I'm not sure FDT is exactly the right thing. If two organisms were functionally identical but had totally different genomes, then FDT would decide as though they're one unit, whereas I think evolution would select for deciding as though only the genetically-identical organisms are a unit? I'm not entirely sure about that.

I'm surprised by two implicit claims you seem to be making.

And if your research is confined to arXiv or the Alignment Forum, it can be really hard to get any sort of deep feedback on it.

Is your experience that peer-review is a good source of deep feedback? 

I have a few peer-reviewed physics publications. The only useful peer-review feedback I got was a reviewer who pointed out a typo in one of my equations. Everything else have been gatekeeping related comments, which is no surprise given that peer-review mainly is a status gate keeping function. I go... (read more)

4Adam Shimi1y
It's not that easy to justify a post from a year ago, but I think that what I meant was that the alignment forum has a certain style of alignment research, and thus only reading it means you don't see stuff like CHAI research or other works that are and aim at alignment without being shared that much on the AF.
7Linda Linsefors1y
Update: I'm shifting towards thinking that peer-review could be good if done right, because:   * I'm told it does work well in practice some times, creating a proof of concept. * I can see how being asked specifically to review some specific work could make me motivated to put in more work than I would do in a more public format (talk or blogpost).

I'm also interested to se the list of candidate instincts. 

Regarding funding, how much money do you need? Just order of magnitude. There lots of diffrent grants and where you want to appy depends on the size of your budget.

"Most AI reserch focus on building machines that do what we say. Aligment reserch is about building machines that do what we want."

Source: Me, probably heavely inspred by "Human Compatible" and that type of arguments. I used this argument in conversations to explain AI Alignment for a while, and I don't remember when I started. But the argument is very CIRL (cooperative inverse reinforcment learning).

I'm not sure if this works as a one liner explanation. But it does work as a conversation starter of why trying to speify goals directly is a bad idea. And ho... (read more)

I think one major reason why people don't tend to get hijacked by imagined adversaries is that you can't simulate someone who is smarter than you, and therefore you can defend against anything you can simulate in your mind.

This is not a perfect arugment since I can imagine someone that has power over me in the real world, and for example imagine how angry they would be at me if I did something they did not like. But then their power over me comes from their power in the real world, not their ability to outsmart me inside my own mind.

2Abram Demski1y
Not to disagree hugely, but I have heard one religious conversion (an enlightenment type experience) described in a way that fits with "takeover without holding power over someone". Specifically this person described enlightenment in terms close to "I was ready to pack my things and leave. But the poison was already in me. My self died soon after that." It's possible to get the general flow of the arguments another person would make, spontaneously produce those arguments later, and be convinced by them (or at least influenced).

The correct labeling of how violent a knifing is, is not 50.1%, or 49.9%. The correct label is 0 or 100%. There is no "ever so slightly" in the training data. The percentage is about the uncertanty of classifyer, it is not about degrees of violence in the sample. It it was the other way around, then I would mostsy agree with the current training scheem, as I said.

If the model is well calibrated then half the samples would be safe, and half violent at 50%. Moving a up the safe one is helpfull. Decreesing missclassification of safe samples will increas the c... (read more)

There’s one thing you can do that definitely works, which is to only get labels for snippets which are just barely considered safe enough by your classifier. Eg if your threshold is set to 99%, so that a completion won’t be accepted unless the classifier is 99% sure that it’s safe, then there’s no point looking at completions rated as <99% likely to be safe (because the classifier isn’t going to accept them), and also it’s probably a better bet to look at things that the model thinks are 99.1% likely to be safe rather than 99.9%, because (assuming the m

... (read more)
You don't want to 'maximize information' (or minimize variance). You want to minimize the number of errors you make at your decision-threshold. Your threshold is not at 50%, it's at 99%. Moving an evil sample from 50% to 0% is of zero intrinsic value (because you have changed the decision from 'Reject' to 'Reject' and avoided 0 errors). Moving an evil sample from 99.1% to 98.9% is very valuable (because you have changed the decision from 'Accept' to 'Reject' and avoided 1 error). Reducing the error on regions of data vastly far away from the decision threshold, such as deciding whether a description of a knifing is ever so slightly more 'violent' than a description of a shooting and should be 50.1% while the shooting is actually 49.9%, is an undesirable use of labeling time.

Not sure how usefull this is, but I think this counts as a selection theorem.
(Paper by Caspar Oesterheld, Joar Skalse, James Bell and me)

We played around with taking learning algorithms designed for multi armed bandit problems (your action matters but not your policy) and placing them in Newcomblike environments (both your acctual action and your probability distribution over actions matters). And then we proved some stuf about their behaviour.


That is definitely a selection theorem, and sounds like a really cool one! Well done.

Studying early cosmolog has a lot of the same epistemic problems as AI safety. You can't do direct experiment. You can't observe what's going on. You have to extrapolate anything you know far beyond where this knolage is trustworthy. 

By early cosmology I mean anything before recombination (when the matter transfomed from plasma to gas, and the uinverse became transparent to light) but especially anything to do with cosmic inflation or compeeting theories, or stuff about how it all started, or is cyclic, etc.

Unfortunatly I don't know what lessons we ca... (read more)

Load More