Linda Linsefors

Hi, I am a Physicist, an Effective Altruist and AI Safety student/rehearser.


Announcing Web-TAISU, May 13-17

So... apparently I underestimate the need to send out event reminders, but better late than never. Today is the 2:nd day (out of 4) of Web-TAISU, and it is not too late to join.

General information about the event:

Collaborative Schedule:

Let me know if you have any questions.

Using vector fields to visualise preferences and make them consistent

As mentioned, I did think of this of this model before, and I also disagree with Justin/Convergence on how to use it.

Lets say that the underlying space for the vector field is the state of the world. Should we really remove curl? I'd say no. It is completely valid to want to move along some particular path, even a circle, or more likely, a spiral.

Alternatively, lets say that the underlying space for the vector field is world histories. Now we should remove curl, becasue any circular preference in this space is inconsistent. But what even is the vector field in this picture?


My reason for considering values as a vector is becasue that is sort of how it feels to me on the inside. I have noticed that my own values are very different depending on my current mood and situation.

  • When I'm sand/depressed, I become a selfish hedonist. All I care about is for me to be happy again.
  • When I'm happy I have more complex and more altruistic values. I care about truth and the well-being of others.

It's like these wants are not tracking my global values at all, but just pointing out a direction in which I want to move. I doubt that I even have global values, because that would be very complicated, and also what would be the use of that? (Except when building a super intelligent AI, but that did not happen much in our ancestral environment.)

[Meta] Do you want AIS Webinars?

Let's do it!

If you pick a time and date and write up an abstract, then I will sort out the logistic. Worst case it's just you and me having a conversation, but most likely some more people will show up.

Linda Linsefors's Shortform

I'm basically ready to announce the next Technical AI Safety Unconference (TAISU). But I have hit a bit of decision paralysis as to what dates it should be.

If you are reasonably interested in attending, please help me by filling in this doodle

If you don't know what this is about, have a look at the information for the last one.

The venue will be EA Hotel in Blackpool UK again.

“embedded self-justification,” or something like that

The way I understand your division of floors and sealing, the sealing is simply the highest level meta there is, and the agent has *typically* no way of questioning it. The ceiling is just "what the algorithm is programed to do". Alpha Go is had programed to update the network weights in a certain way in response to the training data.

What you call floor for Alpha Go, i.e. the move evaluations, are not even boundaries (in the sense nostalgebraist define it), that would just be the object level (no meta at all) policy.

I think this structure will be the same for any known agent algorithm, where by "known" I mean "we know how it works", rather than "we know that it exists". However Humans seems to be different? When I try to introspect it all seem to be mixed up, with object level heuristics influencing meta level updates. The ceiling and the floor are all mixed together. Or maybe not? Maybe we are just the same, i.e. having a definite top level, hard coded, highest level meta. Some evidence of this is that sometimes I just notice emotional shifts and/or decisions being made in my brain, and I just know that no normal reasoning I can do will have any effect on this shift/decision.

Vanessa Kosoy's Shortform

I agree that you can assign what ever belief you want (e.g. what ever is useful for the agents decision making proses) for for what happens in the counterfactual when omega is wrong, in decision problems where Omega is assumed to be a perfect predictor. However if you want to generalise to cases where Omega is an imperfect predictor (as you do mention), then I think you will (in general) have to put in the correct reward for Omega being wrong, becasue this is something that might actually be observed.

All I know is Goodhart

Weather this works or not is going to depend heavily on what looks like.

Given , i.e. , what does this say about ?

The answer depends on the amount of mutual information between , and . Unfortunately the the more generic is, (i.e. any function is possible) the less mutual information there will be. Therefore, unless we know some structure about , the restriction to is not going to do much. The agent will just find a very different policy that also actives very high in some very Goodharty way, but does not get penalized because low value for on is not correlated with low value on .

This could possibly be fixed by adding assumptions of the type for any that does too well on . That might yield something interesting, or it might just be a very complicated way of specifying as satisfiser, I don't know.

Conceptual Problems with UDT and Policy Selection

I think UDT1.1 have two fundamentally wrong assumptions built in.

1) Complete prior: UDT1.1 follows the policy that is optimal according to it's prior. This is incommutable in general settings and will have to be approximated some how. But even an approximation of UDT1.1 assumes that UDT1.1 is at least well defined. However in some multi agent settings or when the agent is being fully simulated by the environment, or any other setting where the environment is necessary bigger than the agent, then UDT1.1 is ill defined.

2) Free will: In the problem Agent Simulates Predictor, the environment is smaller than the agent, so it is falls outside the above point. Here instead I think the problem is that the agent assumes that it has free will, when in fact it behaves in a deterministic manner.

The problem of free will in Decision Problems is even clearer in the smoking lesion problem:

You want to smoke and you don't want Cancer. You know that people who smoke are more likely get cancer, but you also know that smoking does not cause cancer. Instead, there is a common cause, some gene, that happens to both increase the risk of cancer and make it more likely that a person with this gene are more likely to choose to smoke. You can not test if you have the gene.

Say that you decide to smoke, becasue ether you have the gene or not so you might as well enjoy smoking. But what if everyone though like this? Then there would be no correlation between the cancer gene and smoking. So where did the statistics about smokers getting cancer come from (in this made up version of reality).

If you are the sort of person who smokes no mater what, then ether:

a) You are sufficiently different from most people such that the statistics does not apply to you.


b) The cancer gene is correlated with being the sort of person that has a decision possess that leads to smoking.

If b is correct, then maybe you should be the sort of algorithm that decides not to smoke, as to increase the chance of being implemented into a brain that lives in a body with less risk of cancer. But if you start thinking like that, then you are also giving up your hope at affecting the universe, and resign to just choosing where you might find yourself, and I don't think that is what we want from a decision theory.

But there also seems to be no good way of thinking about how to steer the universe with out pretending to have free will. But since that is actually a falls assumption, there will be weird edge cases where you're reasoning breaks down.

TAISU - Technical AI Safety Unconference

The TAISU is now full. I might still accept exceptional applications. But don't expect to be accepted just becasue you meet the basic requirements.

Load More