CoT monitoring seems like a great control method when available
As I posted in a top level comment, I'm not convinced that even success would be a good outcome. I think that if we get this working 99.999% reliably. we still end up delegating parts of the oversight in ways that have other alignment failure modes, such as via hyper-introspection.
First, strongly agreed on the central point - I think that as a community, we've been too heavily investing in the tractable approaches (interpretability, testing, etc.) without having the broader alignment issues taking front stage. This has led to lots of bikeshedding, lots of capabilities work, and yes, some partial solutions to problems.
That said, I am concerned about what happens if interpretability is wildly successful - against your expectations. That is, I see interpretability as a concerning route to attempted alignment even if it succeeds in getting past the issues you note on "miss things," "measuring progress," and "scalability," partly for reasons you discuss under obfuscation and reliability. Wildly successful and scalable interpretability without solving other parts of alignment would very plausibly function as a very dangerously misaligned system, and the methods for detection themselves arguably exacerbate the problem. I outlined my potential concerns about this case in more detail in a post here. I would be very interested in your thoughts about this. (And thoughts from @Buck / @Adam Shai as well!)
In general, you can mostly solve Goodhart-like problems in the vast majority of the experienced range of actions, and have it fall apart only in more extreme cases. And reward hacking is similar. This is the default outcome I expect from prosaic alignment - we work hard to patch misalignment and hacking, so it works well enough in all the cases we test and try, until it doesn't.
I strongly agree, and as I've argued before, long timelines to ASI are possible even if we have proto-AGI soon, and aligning AGI doesn't necessarily help solve ASI risks. It seems like people are being myopic, assuming their modal outcome is effectively certain, and/or not clearly holding multiple hypotheses about trajectories in their minds, so they are undervaluing conditionally high value research directions.
This seems reasonable, though efficacy of the learning method seems unclear to me.
But:
with a heavily-reinforced constraint that the author vectors are identical for documents which have the same author
This seems wrong. To pick on myself, my peer reviewed papers, my substack, my lesswrong posts, my 1990s blog posts, and my twitter feed are all substantively different in ways that I think the author vector should capture.
There's a critical (and interesting) question about how you generate the latent space of authors, and/or how it is inferred from the text. Did you have thoughts on how this would be done?
I think this is correct, but doesn't seem to note the broader trend towards human disempowerment in favor of bureaucratic and corporate systems, which this gradual disempowerment would continue, and hence elides or ignores why AI risk is distinct.
Regarding Chess agents, Vanessa pointed out that while only perfect play is optimal, informally we would consider agents to have an objective that is better served by slightly better play, for example, an agent rated 2500 ELO is better than one rated 1800, which is better than one rated 1000, etc. That means that lots of "chess minds" which are non-optimal are still somewhat rational at their goal.
I think that it's very likely that even according to this looser definition, almost all chess moves, and therefore almost all "possible" chess bots, fail to do much to accomplish the goal.
We could check this informally by evaluating the set of possible moves in normal games would be classified as blunders, using a method such as the one used here to evaluate what proportion of actual moves made by players are blunders. Figure 1 there implies that in positions with many legal moves, a larger proportion are blunders - but this is looking at the empirical blunder rate by those good enough to be playing ranked chess. Another method would be to look at a bot that actually implements "pick a random legal move" - namely Brutus RND. It has an ELO of 255 when ranked against other amateur chess bots, and wins only occasionally against some of the worst bots; it seems hard to figure out from that what proportion of moves are good, but it's evidently a fairly small proportion.
I think this was a valuable post, albeit ending up somewhat incorrect about whether LLMs would be agentic - not because they developed the capacity on their own, but because people intentionally built and are building structure around LLMs to enable agency. That said, the underlying point stands - it is very possible that LLMs could be a safe foundation for non-agentic AI, and many research groups are pursuing that today.
This seems mostly right, except that it's often hard to parallelize work and manage large projects - which seems like it slows thing importantly. And, of course, some things are strongly serialized using time that can't be sped up via more compute or more people. (See: PM hires 9 women to have baby in one month.)
Similarly, running 1,000 AI research groups in parallel might get you the same 20 insights 50 times, rather than generating far more insights. And managing and integrating the research, and deciding where to allocate research time, plausibly gets harder at more than a linear rate with more groups.
So overall, the model seems correct, but I think the 10x speed up is more likely than the 20x speed up.