## AI ALIGNMENT FORUMAF

Linda Linsefors

Hi, I am a Physicist, an Effective Altruist and AI Safety student/researcher.

# Wiki Contributions

Sorted by

At this writing www.aisafety.camp goes to our new website while aisafety.camp goes to our old website. We're working on fixing this.

I have two hypothesises for what is going on. I'm leaning towards 1, but very unsure.

1)

king - man + woman = queen

is true for word2vec embeddings but not in LLaMa2 7B embeddings because word2vec has much fewer embedding dimensions.

• LLaMa2 7B has 4096 embedding dimensions.
• This paper uses a variety of word2vec with 50, 150 and 300 embedding dimensions.

Possibly when you have thousands of embedding dimensions, these dimensions will encode lots of different connotations of these words. These connotations will probably not line up with the simple relation [king - man + woman = queen], and therefore we get [king - man + woman  queen] for high dimensional embeddings.

2)

king - man + woman = queen

Isn't true for word2vec either. If you do it with word2vec embeddings you get more or less the same result I did with LLaMa2 7B.

(As I'm writing this, I'm realising that just getting my hands on some word2vec embeddings and testing this for myself, seems much easier than to decode what the papers I found is actually saying.)

"▁king" - "▁man" + "▁woman"  "▁queen" (for LLaMa2 7B token embeddings)

I tired to replicate the famous "king" - "man" + "woman" = "queen" result from word2vec using LLaMa2 token embeddings. To my surprise it dit not work.

I.e, if I look for the token with biggest cosine similarity to "▁king" - "▁man" + "▁woman" it is not "▁queen".

Top ten cosine similarly for

• "▁king" - "▁man" + "▁woman"
is ['▁king', '▁woman', '▁King', '▁queen', '▁women', '▁Woman', '▁Queen', '▁rey', '▁roi', 'peror']
• "▁king" + "▁woman"
is ['▁king', '▁woman', '▁King', '▁Woman', '▁women', '▁queen', '▁man', '▁girl', '▁lady', '▁mother']
• "▁king"
is ['▁king', '▁King', '▁queen', '▁rey', 'peror', '▁roi', '▁prince', '▁Kings', '▁Queen', '▁König']
• "▁woman"
is ['▁woman', '▁Woman', '▁women', '▁man', '▁girl', '▁mujer', '▁lady', '▁Women', 'oman', '▁female']
• projection of "▁queen" on span( "▁king", "▁man", "▁woman")
is ['▁king', '▁King', '▁woman', '▁queen', '▁rey', '▁Queen', 'peror', '▁prince', '▁roi', '▁König']

"▁queen" is the closest match only if you exclude any version of king and woman. But this seems to be only because "▁queen" is already the 2:nd closes match for "▁king". Involving "▁man" and "▁woman" is only making things worse.

I then tried looking up exactly what the word2vec result is, and I'm still not sure.

Wikipedia sites Mikolov et al. (2013). This paper is for embeddings from RNN language models, not word2vec, which is ok for my purposes, because I'm also not using word2vec. More problematic is that I don't know how to interpret how strong their results are. I think the relevant result is this

We see that the RNN vectors capture significantly more syntactic regularity than the LSA vectors, and do remarkably well in an absolute sense, answering more than one in three questions correctly.

which don't seem very strong. Also I can't find any explanation of what LSA is.

I also found this other paper which is about word2vec embeddings and have this promising figure

But the caption is just a citation to this third paper, which don't have that figure!

I've not yet read the two last papers in detail, and I'm not sure if or when I'll get back to this investigation.

If someone knows more about exactly what the word2vec embedding results are, please tell me.

I don't think seeing it as a one dimensional dial, is a good picture here.

The AI has lots and lots of sub-circuits, and many* can have more or less self-other-overlap. For “minimal self-other distinction while maintaining performance” to do anything, it's sufficient that you can increase self-other-overlap in some subset of these, without hurting performance.

* All the circuits that has to do with agent behaviour, or beliefs.

This already strongly suggests some connection between induction heads and in-context learning, but beyond just that, it appears this window is a pivotal point for the training process in general: whatever's occurring is visible as a bump on the training curve (figure below). It is in fact the only place in training where the loss is not convex (monotonically decreasing in slope).

I can see the bump, but it's not the only one. The two layer graph has a second similar bump, which also exists in the one layer model, and I think I can also see it very faintly in the three level model. Did they ignore the second bump because it only exists in small models, while their bump continues to exist in bigger models?

I feel a bit behind on everything going on in alignment, so for the next weeks (or more) I'll focus on catching up on what ever I find interesting. I'll be using my short form, to record my though.

I make no promises that reading this is worth anyone's time.

What to focus on?

I do have some opinions on what aliment directions are more or less promising. I'll probably venture in other directions too, but my main focus is going to be around what I expect an alignment solution to look like.

1. I think that to have an aligned AI it is necessary (but not sufficient) that we have shared abstractions/ontology/concepts/ (what ever you want to call it) with the AI.
2. I think the way to make progress on the above is to understand what ontology/concepts/abstraction our current AIs are using, and the process that shapes these abstraction.
3. I think the way to do this is though mech-interp, mixed with philosophising and theorising. Currently I think the mech-interp part (i.e. look at what is actually going on in a network) is the bottleneck, since I think that philosophising with out data (i.e. agent foundations) has not made much progress lately.

Conclusion:

• I'll mainly focus on reading up on mech-interp and related areas such as dev-interp. I've started on the interp section of Lucius's aliment reading list.
• I should also read some John Wentworth, since his plan is pretty close to the path I think is most promising.

Feel free to though other recommendations at me.

Some though on things I read so far

I really liked Understanding and controlling a maze-solving policy network. It's a good experiment and a good writeup.

But also, how interesting is this. Basically they removed the cheese observation, it made the agent act as if there where no cheese. This is not some sophisticated steering technique that we can use to align the AIs motivation.

I discussed this with Lucius who pointed out, that the interesting result is that: The the cheese location information is linearly separable from other information, in the middle of the network. I.e. it's not scrambled in a completely opaque way.

Which brings me to Book Review: Design Principles of Biological Circuits

Alon’s book is the ideal counterargument to the idea that organisms are inherently human-opaque: it directly demonstrates the human-understandable structures which comprise real biological systems.

Both these posts are evidence for the hypothesis that we should expect evolved networks to be modular, in a way that is possible for us to decode.

By "evolved" I mean things in the same category as natural selection and gradient decent.

In the real network, there are a lot more than two activations. Our results involve a 32,768-dimensional cheese vector, subtracted from about halfway through the network:

Did you try other locations in the network?

I would expect it to work pretty much anywhere, and I'm interested to know if my prediction is correct.

I'm pretty sure that what happens is (as you also suggest) that the agent stops seeing the cheese.

Imagine you did the cheese subtraction on the input layer (i.e. the pixel values of the maze). In this case this just trivially removed the cheese from the picture, resulting in behaviour that is identical to no cheese. So I expect something similar to happen to later layer, as long as what the network is mostly doing is just de-coding the image. So at what ever layer this trick stops working, this should mean that the agent has started planing it's moves.

The math in the post is super hand-wavey, so I don't expect the result to be exactly correct. However in your example, l up to 100 should be ok, since there is no super position. 2.7 is almost 2 orders of magnitude off, which is not great.

Looking into what is going on: I'm basing my results on the Johnson–Lindenstrauss lemma, which gives an upper bound on the interference. In the post I'm assuming that the actual interference is order of magnitude the same as the this upper bound. This assumption is clearly fails in your example since the interference between features is zero, and nothing is the same order of magnitude as zero.

I might try to do the math more carefully, unless someone else gets there first. No promises though.

I expect that my qualitative claims will still hold. This is based on more than the math, but math seemed easier to write down. I think it would be worth doing the math properly, both to confirm my claims, and it may be useful to have more more accurate quantitative formulas. I might do this if I got some spare time, but no promises.

my qualitative claims = my claims about what types of things the network is trading away when using super position

quantitative formulas = how much of these things are traded away for what amount of superposition.

Recently someone either suggested to me (or maybe told me they or someone where going to do this?) that we should train AI on legal texts, to teach it human values. Ignoring the technical problem of how to do this, I'm pretty sure legal text are not the right training data. But at the time, I could not clearly put into words why. Todays SMBC explains this for me:

Saturday Morning Breakfast Cereal - Law (smbc-comics.com)

Law is not a good representation or explanation of most of what we care about, because it's not trying to be. Law is mainly focused on the contentious edge cases.

Training an AI on trolly problems and other ethical dilemmas is even worse, for the same reason.