Coding day in and out on LessWrong 2.0. You can reach me at habryka@lesswrong.com
Mod note: I removed Dan H as a co-author since it seems like that was more used as convenience for posting it to the AI Alignment Forum. Let me know if you want me to revert.
If the difference between these papers is: we do activations, they do weights, then I think that warrants more conceptual and empirical comparisons.
Yeah, it's totally possible that, as I said, there is a specific other paper that is important to mention or where the existing comparison seems inaccurate. This seems quite different from a generic "please have more thorough related work sections" request like the one you make in the top-level comment (which my guess is was mostly based on your misreading of the post and thinking the related work section only spans two paragraphs).
The level of comparison between the present paper and this paper seems about the same as I see in papers you have been a co-author in.
E.g. in https://arxiv.org/pdf/2304.03279.pdf the Related Works section is basically just a list of papers, with maybe half a sentence describing their relation to the paper. This seems normal and fine, and I don't see even papers you are a co-author on doing something substantively different here (this is again separate from whether there are any important papers omitted from the list of related works, or whether any specific comparisons are inaccurate, it's just making a claim about the usual level of detail that related works section tend to go into).
I don't understand this comment. I did a quick count of related works that are mentioned in the "Related Works" section (and the footnotes of that section) and got around 10 works, so seems like this is meeting your pretty arbitrarily established bar, and there are also lots of footnotes and references to related work sprinkled all over the post, which seems like the better place to discuss related work anyways.
I am not familiar enough with the literature to know whether this post is omitting any crucial pieces of related work, but the relevant section of this post seems totally adequate in terms of volume (and also the comments are generally a good place for people to drop links to related work, if they think there is interesting related work missing).
Also, linking to a related work in a footnote seems totally fine. It is somewhat sad that link-text isn't searchable by-default, so searching for the relevant arxiv link is harder than it has to be. Might make sense to add some kind of tech solution here.
Yeah, does sure seem like we should update something here. I am planning to spend more time on AIAF stuff soon, but until then, if someone has a drop-in paragraph, I would probably lightly edit it and then just use whatever you send me/post here.
This is not commenting on the substance of this post, but I really feel like the title of this post should be "The self-alignment problem".
Like, we talk about "The alignment problem" not "The unalignment problem". The current title makes me think that the problem is that I somehow have to unalign myself, which doesn't really make sense.
Direct optimizers typically have a very specific architecture requiring substantial iteration and search. Luckily, it appears that our current NN architectures, with a fixed-length forward pass and a lack of recurrence or support for branching computations as is required in tree search makes the implementation of powerful mesa-optimizers inside the network quite challenging.
I think this is being too confident on what "direct optimizers" require.
There is an ontology, mostly inherited from the graph-search context, in which "direct optimizers" require recurrence and iteration, but at least I don't have any particularly strong beliefs about what a direct optimizer needs in terms of architecture, and don't think other people know either. The space of feed-forward networks is surprisingly large and rich and I am definitely not confident you can't find a direct optimizer in that space.
Current LLMs also get quite high scores at imitating pretty agentic and optimizy humans, which suggest the networks do perform something quite close to search or direct optimization somewhere within it's forward pass.
Perhaps I've simply been misreading John, and he's been intending to say "I have some beliefs, and separately I have some suggestive technical results, and they feel kinda related to me! Which is not to say that any onlooker is supposed to be able to read the technical results and then be persuaded of any of my claims; but it feels promising and exciting to me!".
For what it's worth, I ask John about once ever month or two about his research progress and his answer has so far been (paraphrased) "I think I am making progress. I don't think I have anything to show you that would definitely convince you of my progress, which is fine because this is a preparadigmatic field. I could give you some high-level summaries or we could try to dive into the math, though I don't think I have anything super robust in the math so far, though I do think I have interesting approaches."
You might have had a totally different experience, but I've definitely had the epistemic state so far that John's math was in the "trying to find remotely reasonable definitions with tenuous connection of formalism to reality" stage, and not the "I have actually demonstrated robust connection of math to reality stage", so I feel very non-mislead by John. A good chunk of this impression comes from random short social interactions I've had with John, so someone who more engaged with just his online writing might come away with a different impression (though I've also done that a lot and don't super feel like John has ever tried to sell me in his writing on having super robust math to back things up).
This is just false, because it is not taking into account the cost of doing expected value maximization, since giving consistent preferability scores is just very expensive and hard to do reliably.
I do really want to put emphasis on the parenthetical remark "(at least in some situations, though they may not arise)". Katja is totally aware that the coherence arguments require a bunch of preconditions that are not guaranteed to be the case for all situations, or even any situation ever, and her post is about how there is still a relevant argument here.
Mod note: It felt fine to do this once or twice, but it's not an intended use-case of AI Alignment Forum membership to post to the AI Alignment Forum with content that you didn't write.
I would have likely accepted this submission to the AI Alignment Forum anyways, so it seems best to just go via the usual submission channels. I don't want to set a precedent of weirdly confusing co-authorship for submission purposes. You can also ping me on Intercom in-advance if you want to get an ahead notice of whether the post fits on the AIAF, or want to make sure it goes live there immediately.