This is a brief, stylized recounting of a few conversations I had at some point last year with people from the non-academic AI safety community:[1]
Me: you guys should write up your work properly and try to publish it in ML venues.
Them: well that seems like a lot of work and we don't need to do that because we can just talk to each other and all the people I want to talk to are already working with me.
Me: What about the people who you don't know who could contribute to this area and might even have valuable expertise? You could have way more leverage if you can reach those people. Also, there is increasing interest from the machine learning community in safety and alignment... because of progress...
Thanks, reading closely I see how you said that, but it wasn't clear initially. (There's an illusion of disagreement, which I'll christen the "twitter fight fallacy," where unless the opposite is said clearly, people automatically assume replies are disagreements.)
TL;DR: There are many posts on the Alignment Forum/LessWrong that could easily be on arXiv. Putting them on arXiv has several large benefits and (sometimes) very low costs.
There are several large benefits of putting posts on arXiv:
1) - 3) lead to more people reading your research, which hopefully leads to more people building on it and maybe useful feedback from outside of the established alignment community. In particular, if people see that the paper already has citations, this will lead to more people reading it, which will lead to more citations,...
I probably put in an extra 20-60 hours, so the total is probably closer to 150 - which surprises me. I will add that a lot of the conversion time was dealing with writing more, LaTeX figures and citations, which were all, I think, substantive valuable additions. (Changing to a more scholarly style was not substantively valuable, nor was struggling with latex margins and TikZ for the diagrams, and both took some part of the time.)
I wanted to write a long, detailed, analytic post about this, somewhat like my Radical Probabilism post (to me, this is a similarly large update). However, I haven't gotten around to it for a long while. And perhaps it is better as a short, informal post in any case.
I think my biggest update over the past year has been a conversion to teleosemantics. Teleosemantics is a theory of semantics -- that is, "meaning" or "aboutness" or "reference".[1]
To briefly state the punchline: Teleosemantics identifies the semantics of a symbolic construct as what the symbolic construct has been optimized to accurately reflect.
Previously, something seemed mysterious about the map/territory relationship. What could possibly imbue 'symbols' with 'meaning'? The map/territory analogy seems inadequate to answer this question. Indeed, to analogize "belief"...
If you pin down what a thing refers to according to what that thing was optimized to refer to, then don't you have to look at the structure of the one who did the optimizing in order to work out what a given thing refers to? That is, to work out what the concept "thermodynamics" refers to, it may not be enough to look at the time evolution of the concept "thermodynamics" on its own, I may instead need to know something about the humans who were driving those changes, and the goals held within their minds. But, if this is correct, then doesn't it raise anot...
Everyone carries a shadow, and the less it is embodied in the individual’s conscious life, the blacker and denser it is. — Carl Jung
Acknowlegements: Thanks to Janus and Jozdien for comments.
In this article, I will present a mechanistic explanation of the Waluigi Effect and other bizarre "semiotic" phenomena which arise within large language models such as GPT-3/3.5/4 and their variants (ChatGPT, Sydney, etc). This article will be folklorish to some readers, and profoundly novel to others.
When LLMs first appeared, people realised that you could ask them queries — for example, if you sent GPT-4 the prompt "What's the capital of France?", then it would continue with the word "Paris". That's because (1) GPT-4 is trained to be a good model of internet text,...
A good question. I've never seen it happen myself; so where I'm standing, it looks like short emergence examples are cherry-picked.
This is the third in a three post sequence about interpreting Othello-GPT. See the first post for context.
This post is a detailed account of what my research process was, decisions made at each point, what intermediate results looked like, etc. It's deliberately moderately unpolished, in the hopes that it makes this more useful!
This project was a personal experiment in speed-running doing research, and I got the core results in in ~2.5 days/20 hours. This post has some meta level takeaways from this on doing mech interp research fast and well, followed by a (somewhat stylised) narrative of what I actually did in this project and why - you can see the file tl_initial_exploration.py
in the paper repo for the code that I wrote as I...