Running Lightcone Infrastructure, which runs LessWrong. You can reach me at firstname.lastname@example.org
Would it be OK for me to just copy-paste the blogpost content here? It seems to all work formatting wise, and people rarely click through to links.
Chain of thought, simple decompositions, and imitations of human tool use (along comprehensible interfaces) are already important for LM performance.
I want to separate prompt-engineering from factored cognition. There are various nudges you can use to get LLMs to think in ways that are more productive or well-suited for the task at hand, but this seems quite different to me from truly factored cognition, where you spin up a sub-process that solves a sub-problem, and then propagate that back up to a higher-level process (like Auto-GPT). I don't currently know of any not-extremely-gerry-mandered task where doing this actually improves task performance compared to just good prompt engineering. I've been looking for examples of this for a while, so if you do have any, I would greatly appreciate it.
Overall I think the world is shaping up extremely far in the direction of "AI systems learn to imitate human cognitive steps and then compose them into impressive performance."
I've seen little evidence of this so far, and don't think current LLM performance is even that well-characterized by this. This would be great, but I don't currently think its true.
For example, I don't really understand whether this model is surprised or unsurprised by the extreme breadth of knowledge that modern LLMs have. I don't see any "imitation of human cognitive steps" when an LLM is capable of remembering things from a much wider range of topics. It seems just that its way of acquiring knowledge is very different from humans, giving rise to a very different capability-landscape. This capability does not seem to be built out of "composition of imitations of human cognitive steps".
Similarly, when I use Codex for programming, I do not see any evidence that suggests that Codex is solving programming problems by composing imitations of human cognitive steps. Indeed, it mostly seems to just solve the problems in one-shot, vastly faster than I would be able to even type, and in a way that seems completely alien to me as a programmer.
E.g. I'd bet on increasingly wide gaps between each of LM snap judgments, chain of thought, and tool-use.
I do indeed predict that we will see chain-of-thought become less faithful as model capabilities increase, and that other ways of doing the same thing as chain-of-thought but internalized to the model will take over.
I have no strong opinions on tool-use. Seems like the LLMs will use APIs the same way as humans would. I do think if you train more on end-to-end tasks, the code they write to solve sub-problems will become less readable. I have thought less about this and wouldn't currently take a bet.
I would guess that end-to-end optimization will make at most marginal differences in efficacy (probably smaller than RLHF).
I am not super confident of this, but my best guess is that we will see more end-to-end optimization, and that those will make a big difference in task performance. It also seems like a natural endpoint of something like RLAF, where you have the AI guide a lot of the training process itself when given a highly-complicated objective, and then you do various forms of RL on self-evaluations.
Language model agents are built out of LM parts that solve human-comprehensible tasks, composed along human-comprehensible interfaces.
This seems like a very narrow and specific definition of languade model agents that doesn't even obviously apply to the most agentic language model systems we have right now. It is neither the case that human-comprehensible task decomposition actually improves performance on almost any task for current language models (Auto-GPT does not actually work), and it is not clear that current RLHF and RLAF trained-models are "solving a human-comprehensible task" when I chat with them. They seem to pursue a highly-complicated mixed-objective which includes optimizing for sycophancy, and various other messy things, and their behavior seems only weakly characterized as primarily solving a human-comprehensible task.
But even assuming that current systems are doing something that is accurately described as solving human-comprehensible tasks composed along human-comprehensible interfaces, this seems (to me) unlikely to continue much into the future. RLHF and RLAF already encourage the system to do more of its planning internally in a non-decomposed manner, due to the strong pressure towards giving short responses, and it seems likely we will train language model agents on various complicated game-like environments in order to train them explicitly to do long-term planning.
These systems would still meaningfully be "LM agents", but I don't see any reason to assume that those systems would continue to do things in the decomposed manner that you seem to be assuming here.
My guess is it might be best to clarify that you are not in-general in favor of advancing agents built on top of language models, which seem to me to be very hard to align in-general, but are only in favor of advancing a specific technology for making agents out of language models, which tries to leverage factored cognition and tries to actively avoid giving the agents complicated end-to-end tasks with reinforcement learning feedback.
And my guess is that we then have a disagreement in that you expect that by-default the ML field will develop in a direction that will leverage factored-cognition style approaches, which doesn't currently seem that likely to me. I expect more end-to-end reinforcement learning on complicated environments, and more RLHF-style optimization to internalize agentic computation. Might be worth trying to make a bet on the degree to which LM capabilities will meaningfully be characterized as doing transparent task-decomposition and acting along simple human-comprehensible interfaces.My current guess is that sharing info on language model capabilities is overall still good, but I disagree with the "Improvements in LM agents seem good for safety" section. My best guess is that the fastest path to LM agents is to just do more end-to-end training on complicated environments. This will not produce anything that is particularly easy to audit or align. Overall, this approach to building agents seems among the worst ways AI could develop in terms of making it easy to align, and my guess is there are many other ways that would be substantially better to push ahead instead.
I do feel a bit confused about the framing here. It seems, if I am reading these results correctly, that CoT provides pretty little safety guarantee at the relevant scales. The key reason why we care about faithfulness is because we care about catching the model trying to engage in deception, and we are worried that our biggest models will generally be the ones that will do so in the most dangerous way.
This means "choosing the model size" is not really an option. Probably all large models face some incentives/selection towards deception, and as such, the headline result here seems like it should be "Chain-of-Thought Reasoning seems like it will get less and less useful at detecting deception as model size increases, and seems unlikely to reliably detect deception when models pose large risk".
The other headline takeaway for me is "Faithfulness will go down as model size increases, this means that if we have a deception benchmark that leverages CoT, it will look like it will get better as models get bigger, but this will actually be the result of CoT becoming a worse measure as model size increases".
What is the use-case for choosing the size of a model such that CoT is faithful? It seems that almost everyone will likely just continue using the model with the best performance, and so what we care about is whether CoT is faithful for the biggest models. The use-case of choosing the model size specifically to make CoT faithful seems like a very narrow use-case that's unlikely to get used much.
This kind of experiment has been at the top of my list of "alignment research experiments I wish someone would run". I think the chess environment is one of the least interesting environments (compared to e.g. Go or Starcraft), but it does seem like a good place to start. Thank you so much for doing these experiments!
I do also think Gwern's concern about chess engines not really being trained on games with material advantage is an issue here. I expect a proper study of this kind of problem to involve at least finetuning engines.
Mod note: It felt fine to do this once or twice, but it's not an intended use-case of AI Alignment Forum membership to post to the AI Alignment Forum with content that you didn't write.
I would have likely accepted this submission to the AI Alignment Forum anyways, so it seems best to just go via the usual submission channels. I don't want to set a precedent of weirdly confusing co-authorship for submission purposes. You can also ping me on Intercom in-advance if you want to get an ahead notice of whether the post fits on the AIAF, or want to make sure it goes live there immediately.
Mod note: I removed Dan H as a co-author since it seems like that was more used as convenience for posting it to the AI Alignment Forum. Let me know if you want me to revert.
If the difference between these papers is: we do activations, they do weights, then I think that warrants more conceptual and empirical comparisons.
Yeah, it's totally possible that, as I said, there is a specific other paper that is important to mention or where the existing comparison seems inaccurate. This seems quite different from a generic "please have more thorough related work sections" request like the one you make in the top-level comment (which my guess is was mostly based on your misreading of the post and thinking the related work section only spans two paragraphs).
The level of comparison between the present paper and this paper seems about the same as I see in papers you have been a co-author in.
E.g. in https://arxiv.org/pdf/2304.03279.pdf the Related Works section is basically just a list of papers, with maybe half a sentence describing their relation to the paper. This seems normal and fine, and I don't see even papers you are a co-author on doing something substantively different here (this is again separate from whether there are any important papers omitted from the list of related works, or whether any specific comparisons are inaccurate, it's just making a claim about the usual level of detail that related works section tend to go into).
I don't understand this comment. I did a quick count of related works that are mentioned in the "Related Works" section (and the footnotes of that section) and got around 10 works, so seems like this is meeting your pretty arbitrarily established bar, and there are also lots of footnotes and references to related work sprinkled all over the post, which seems like the better place to discuss related work anyways.
I am not familiar enough with the literature to know whether this post is omitting any crucial pieces of related work, but the relevant section of this post seems totally adequate in terms of volume (and also the comments are generally a good place for people to drop links to related work, if they think there is interesting related work missing).
Also, linking to a related work in a footnote seems totally fine. It is somewhat sad that link-text isn't searchable by-default, so searching for the relevant arxiv link is harder than it has to be. Might make sense to add some kind of tech solution here.