## AI ALIGNMENT FORUMAF

Linda Linsefors

Hi, I am a Physicist, an Effective Altruist and AI Safety student/researcher.

# Wiki Contributions

LM memetics:

LM = language model (e.g. GPT-3)

If LMs reads each others text we can get LM-memetics. A LM meme is a pattern which, if it exists in the training data, the LM will output at higher frequency that in the training data. If the meme is strong enough and LLMs are trained on enough text from other LMs, the prevalence of the meme can grow exponentially. This has not happened yet.

There can also be memes that has a more complicated life cycle, involving both humans and LMs. If the LM output a pattern that humans are extra interested in, then the humans will multiply that pattern by quoting it in their blogpost, which some other LM will read, which will make the pattern more prevalent in the output of that transformer, possibly.

Generative models memetics:

Same thing can happen for any model trained to imitate the training distribution.

Lets say that

U_A = 3x + y

Then (I think) for your inequality to hold, it must be that

U_B = f(3x+y), where f' >= 0

If U_B care about x and y in any other proportion, then B can make trade-offs between x and y which makes things better for B, but worse for A.

This will be true (in theory) even if both A and B are satisfisers. You can see this by assuming replacing y and x with sigmoids of some other variables.

Yes, I like this one. We don't want the AI to find a way to give it self utility while making things worse for us. And if we are trying to make things better for us, we don't want the AI to resist us.

Do you want to find out what these inequalities implies about the utility functions? Can you find examples where your condition is true for non-identical functions?

This is a good question.

The not so operationalized answer is that a good operationalization is one that are helpful for achieving alignment.

An operationalization of [helpfulness of an operationalization] would give some sorts to gears level understanding of what shape the operationalization should have to be helpful. I don't have any good model for this, so I will just gesture vaguely.

I think that mathematical descriptions are good, since they are more precise. My first operationalization attempt is pretty mathematical which is good. It is also more "constructive" (not sure if this is the exact right word), i.e. it describes alignment in terms of internal properties, rather than outcomes. Internal properties are more useful as design guidelines, as long as they are correct. The big problem with my first operationalization is that it don't actually point to what we want.

The problem with the second attempt is that it just states what outcome we want. There is nothing in there to help us achieve it.

I recently updated how I view the alignment problem. The post that caused my update is this one form the shard sequence. Also worth mentioning is older post that points to the same thing, but I just happen to read it later.

Basically I used to think we needed to solve both outer and inner alignment separately. No I no longer think this is a good decomposition of the problem.

It’s not obvious that alignment must factor in the way described above. There is room for trying to set up training in such a way to guarantee a friendly mesa-objective somehow without matching it to a friendly base-objective. That is: to align the AI directly to its human operator, instead of aligning the AI to the reward, and the reward to the human.

Quote from here

If something is good at replicating, then there will be more of that thing, this creates a selection effect for things that are good at replicating. The effects of this can be observed in biology and memetics.

Maybe self replication can be seen as an agentic system with the goal of self replicating? In this particular question all uncertainty comes from "agent" being a fuzzy concept, and not from any uncertainty about the world. So answering this question will be a choice of perspective, not information about the world.

Either way, the type of agency I'm mainly interested in is the type of agency that have other goals than just self replication. Although maybe there are things to be learned from the special case of having self replication as a goal?

If the AI learns my values then this is a replication of my values. But there are also examples of magic agentic force where my values are not copied at any point along the way.

Looking at how society is transferred between generations, might have some clues to value learning? But I'm less optimistic about looking at what is similar between self replication in general, because I think I already know this, and also, it seems to be one abstraction level to high, i.e. the similarity are properties above the mechanistic details, and those details is what I want.

Related to

infraBook Club I: Corrigibility is bad ashkually

One of my old blog posts I never wrote (I did not even list it in a "posts I will never write" document) is one about how corrigibility are anti correlated with goal security.

Something like: If you build an AI that don't resist someone trying to change its goals, it will also not try to stop bad actors from changing its goal. (I don't think this particular worry applies to Paul's version of corrigibility, but this blog post idea was from before I learned about his definition.)

I'm not talking about recursive self-improvement. That's one way to take a sharp left turn, and it could happen, but note that humans have neither the understanding nor control over their own minds to recursively self-improve, and we outstrip the rest of the animals pretty handily. I'm talking about something more like “intelligence that is general enough to be dangerous”, the sort of thing that humans have and chimps don't.

Individual humans can't FOOM (at lest not yet), but humanity did.

My best guess is that humanity took a sharp left turn when we got a general enough language, and then again when we got writing, and possibly again when the skill of reading an writing spread to a majority of the population.

Before language human intelligence was basically limited to what a single brain could do. When we got language we got the ability of adding compute (more humans) to the same problem solving task. Humanity got parallel computing. This extra capabilities could be used to invent things to increase the population, i.e. recusing self improvement.

Later, writing gave us external memory. Before our computations where limited by human memory, but now we could start to fill up libraries, unlocking a new level of recursive self improvement.

Every increase in literacy and communication technology (e.g. the internet) is humanity upgrading its capability.

(Just typing as I think...)

What if I push this line of thinking to the extreme. If I just pick agents randomly from the space of all agents, then this should be maximally random, and that should be even better. Now the part where we can mine information of alignment from the fact that humans are at least some what aligned is gone. So this seems wrong. What is wrong here? Probably the fact that if you pick agents randomly from the space of all agents, you don't get greater variation of aliment, compare to if you pick random humans, because probably all the random agents you pick are just non aligned.

So what is doing most of the work here is that humans are more aligned than random. Which I expect you to agree on. What you are also saying (I think) is that the tale end level of alignment in humans is more important in some way than the mean or average level of aliment in humans. Because if we have the human distribution, we are just a few bits from locating the tail of the distribution. E.g. we are 10 bits away from locating the top 0.1 percentile. And because the tail is what matters, randomness is in our favor.

Does this capture what you are tying to say?

I mean that the information of what I value exists in my brain. Some of this information is pointers to things in the real world. So in a sense the information partly exist in the relation/correlation between me and the world.

I defiantly don't mean that I can only care about my internal brain state. To me that is just obviously wrong. Although I have met people who disagree, so I see where the misunderstanding came from.