# Recommended Sequences

AGI safety from first principles
Embedded Agency
2022 MIRI Alignment Discussion

# Recent Discussion

This is a brief, stylized recounting of a few conversations I had at some point last year with people from the non-academic AI safety community:[1]

Me: you guys should write up your work properly and try to publish it in ML venues.

Them: well that seems like a lot of work and we don't need to do that because we can just talk to each other and all the people I want to talk to are already working with me.

Me: What about the people who you don't know who could contribute to this area and might even have valuable expertise?  You could have way more leverage if you can reach those people.  Also, there is increasing interest from the machine learning community in safety and alignment... because of progress...

0mikbp2d
This should be obvious for everyone! As an outside observer and huge sympathizer, it is super-frustrating how siloed the broad EA/rational/AI-alignment/adjacent community is --this specific issue with publication is only one of the consequences. Many of "you people" only interacting between "yourselves" (and I'm not referring to you, Davids), very often even socially. I mean, you guys are trying to do the most good possible, so help others use and leverage on your work! And don't waste time reinventing what is already common or, at least, what already exists outside. More mixing would also help prevent Leverage-style failures and probably improve what from the outside seems like a very weird and unhealthy "bay area social dynamics" (as put by Kaj here [https://www.lesswrong.com/posts/duyJ9uFo2pnPgr3Yn/here-have-a-calmness-video]).
1David Manheim3h
Thanks, agreed. And as an aside, I don't think it's entirely coincidental that neither of the people who agree with you are in the Bay.
3David Manheim3d
I think that the costs usually are worth it far more often than it occurs, from an outside view - which was David's point, and what I was trying to respond to. I think that it's more valuable than one expects to actually just jump through the hoops. And especially for people who haven't yet ever had any outputs actually published, they really should do that at least once. (Also, sorry for the zombie reply.)
2Daniel Kokotajlo3d
I love zombie replies. If you reread this conversation, you'll notice that I never said I think these people are correct. I was just saying that their stated motivations and views are their real motivations and views.  I actually do agree with you and David Krueger that on the margin more LW types should be investing in making their work publishable and even getting it published. The plan had always been "do research first, then communicate it to the world when the time is right" well now we are out of time so the time is right.

Thanks, reading closely I see how you said that, but it wasn't clear initially. (There's an illusion of disagreement, which I'll christen the "twitter fight fallacy," where unless the opposite is said clearly, people automatically assume replies are disagreements.)

1JanBrauner5d
This feels like a really adversarial quote. Concretely, the post says: This looks correct to me; there are LW posts that already basically look like papers. And within the class of LW posts that should be on arXiv at all, which is the target audience of my post, posts that basically look like papers aren't vanishingly rare.
1Raymond Arnold5d
I interpreted this as saying something superficial about style, rather than "if your post does not represent 100+ hours of research work it's probably not a good fit for archive." If that's what you meant I think the post could be edited to make that more clear. If the opening section of your essay made it more clear which posts it was talking about I'd probably endorse it (although I'm not super familiar with the nuances of arXiv gatekeeping so am mostly going off the collective response in the comment section)
1JanBrauner5d
I wrote this post. I don't understand where your claim ("Arxiv mods have stated pretty explicitly they do not want your posts on Arxiv") is coming from.

TL;DR: There are many posts on the Alignment Forum/LessWrong that could easily be on arXiv. Putting them on arXiv has several large benefits and (sometimes) very low costs.

# Benefits of having posts on arXiv

There are several large benefits of putting posts on arXiv:

• 1. Much better searchability, shows up in google scholar searches.
• 3. The article can accumulate citations, which are shown in google/google scholar search results.

1) - 3) lead to more people reading your research, which hopefully leads to more people building on it and maybe useful feedback from outside of the established alignment community. In particular, if people see that the paper already has citations, this will lead to more people reading it, which will lead to more citations,...

3Issa Rice3d
I didn't log the time I spent on the original blog post, and it's kinda hard to assign hours to this since most of the reading and thinking for the post happened while working on the modeling aspects of the MTAIR project. If I count just the time I sat down to write the blog post, I would guess maybe less than 20 hours. As for the "convert the post to paper" part, I did log that time and it came out to 89 hours, so David's estimate of "perhaps another 100 hours" is fairly accurate.

I probably put in an extra 20-60 hours, so the total is probably closer to 150 - which surprises me. I will add that a lot of the conversion time was dealing with writing more, LaTeX figures and citations, which were all, I think, substantive valuable additions. (Changing to a more scholarly style was not substantively valuable, nor was struggling with latex margins and TikZ for the diagrams, and both took some part of the time.)

I wanted to write a long, detailed, analytic post about this, somewhat like my Radical Probabilism post (to me, this is a similarly large update). However, I haven't gotten around to it for a long while. And perhaps it is better as a short, informal post in any case.

I think my biggest update over the past year has been a conversion to teleosemantics. Teleosemantics is a theory of semantics -- that is, "meaning" or "aboutness" or "reference".[1]

To briefly state the punchline: Teleosemantics identifies the semantics of a symbolic construct as what the symbolic construct has been optimized to accurately reflect

Previously, something seemed mysterious about the map/territory relationship. What could possibly imbue 'symbols' with 'meaning'? The map/territory analogy seems inadequate to answer this question. Indeed, to analogize "belief"...

If you pin down what a thing refers to according to what that thing was optimized to refer to, then don't you have to look at the structure of the one who did the optimizing in order to work out what a given thing refers to? That is, to work out what the concept "thermodynamics" refers to, it may not be enough to look at the time evolution of the concept "thermodynamics" on its own, I may instead need to know something about the humans who were driving those changes, and the goals held within their minds. But, if this is correct, then doesn't it raise anot...

Everyone carries a shadow, and the less it is embodied in the individual’s conscious life, the blacker and denser it is. — Carl Jung

Acknowlegements: Thanks to Janus and Jozdien for comments.

# Background

In this article, I will present a mechanistic explanation of the Waluigi Effect and other bizarre "semiotic" phenomena which arise within large language models such as GPT-3/3.5/4 and their variants (ChatGPT, Sydney, etc). This article will be folklorish to some readers, and profoundly novel to others.

## Prompting LLMs with direct queries

When LLMs first appeared, people realised that you could ask them queries — for example, if you sent GPT-4 the prompt "What's the capital of France?", then it would continue with the word "Paris". That's because (1) GPT-4 is trained to be a good model of internet text,...

8Chris van Merwijk16h
I agree. Though is it just the limited context window that causes the effect? I may be mistaken, but from my memory it seems like they emerge sooner than you would expect if this was the only reason (given the size of the context window of gpt3).

A good question. I've never seen it happen myself; so where I'm standing, it looks like short emergence examples are cherry-picked.

1Chris van Merwijk17h
It seems to me like this informal argument is a bit suspect. Actually I think this argument would not apply to Solomonof Induction.  Suppose we have to programs that have distributions over bitstrings. Suppose p1 assigns uniform probability to each bitstring, while p2 assigns 100% probability to the string of all zeroes. (equivalently, p1 i.i.d. samples bernoully from {0,1}, p2 samples  0 i.i.d. with 100%).  Suppose we use a perfect Bayesian reasoner to sample bitstrings, but we do it in precisely the same way LLMs do it according to the simulator model. That is, given a bitstring, we first formulate a posterior over programs, i.e. a "superposition" on programs, which we use  to sample the next bit, then we recompute the posterior, etc. Then I think the probability of sampling 00000000... is just 50%. I.e. I think the distribution over bitstrings that you end up with is just the same as if you just first sampled the program and stuck with it. I think tHere's a messy calculation which could be simplified (which I won't do): P(firstnbitsareallzero)=n∏i=0P(xi=0|x<i=0)=∏∑p∈p1,p2P(xi=0|p)∗P(p|x<i=0)=∏2−i−1+12−i+1=2−i−1+12 Limit of this is 0.5. I don't wanna try to generalize this, but based on this example it seems like if an LLM was an actual Bayesian, Waluigi's would not be attractors. The informal argument is wrong because it doesn't take into account the fact that over time you sample increasingly many non-waluigi samples, pushing down the probability of Waluigi. Then again, the presense of a context window completely breaks the above calculation in a way that preserves the point. Maybe the context window is what makes Waluigi's into an attractor? (Seems unlikely actually, given that the context windows are fairly big).

This is the third in a three post sequence about interpreting Othello-GPT. See the first post for context.

This post is a detailed account of what my research process was, decisions made at each point, what intermediate results looked like, etc. It's deliberately moderately unpolished, in the hopes that it makes this more useful!

# The Research Process

This project was a personal experiment in speed-running doing research, and I got the core results in in ~2.5 days/20 hours. This post has some meta level takeaways from this on doing mech interp research fast and well, followed by a (somewhat stylised) narrative of what I actually did in this project and why - you can see the file tl_initial_exploration.py in the paper repo for the code that I wrote as I...