Was a philosophy PhD student, left to work at AI Impacts, then Center on Long-Term Risk, then OpenAI. Quit OpenAI due to losing confidence that it would behave responsibly around the time of AGI. Now executive director of the AI Futures Project. I subscribe to Crocker's Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html
Some of my favorite memes:
(by Rob Wiblin)
(xkcd)
My EA Journey, depicted on the whiteboard at CLR:
(h/t Scott Alexander)
Thinking step by step:
I like the point that fundamentally the structure is tree-like, and insofar as terminal goals are a thing it's basically just that they are the leaf nodes on the tree instead of branches or roots. Note that this doesn't mean terminal goals aren't a thing; the distinction is real and potentially important.
I think an improvement on the analogy would be compare to a human organization rather than to a tree. In a human organization (such as a company or a bureaucracy) at first there is one or a small group of people, and then they hire more people to help them with stuff (and retain the option to fire them if they stop helping) and then those people hire people etc. and eventually you have six layers of middle management. Perhaps goals are like this. Evolution and/or reinforcement learning gives us some goals, and then those goals create subgoals to help them, and then those subgoals create subgoals, etc. In general when it starts to seem that a subgoal isn't going to help with the goal it's supposed to help with, it gets 'fired.' However, sometimes subgoals are 'sticky' and become terminal-ish goals, analogous to how it's sometimes hard to fire people & how the leaders of an organization can find themselves pushed around by the desires and culture of the mass of employees underneath them. Also, sometimes the original goals evolution or RL gave us might wither away (analogous to retiring) or be violently ousted (analogous to what happened to OpenAI's board) in some sort of explicit conscious conflict with other factions. (example: Someone is in a situation where they are incentivized to lie to succeed at their job, does some moral reasoning and then decides that honesty is merely a means to an end, and then starts strategically deceiving people for the greater good, quashing the qualms of their conscience which had evolved a top-level goal of being honest back in childhood.)
w.r.t. Bostrom: I'm not convinced yet. I agree with "the process of instrumentally growing and gaining resources is also the process of constructing values." But you haven't argued that this is good or desireable. Analogy: "The process of scaling your organization from a small research nonprofit to a megacorp with an army of superintelligences is also the process of constructing values. This process leads to the emergence of new, rich, nuanced goals (like profit, and PR, and outcompeting rivals, and getting tentacles into governments, and paying huge bonuses to the greedy tech executives and corporate lawyers we've recruited) which satisfy our original goals while also going far beyond them..." Perhaps this is the way things typically are, but if hypothetically things didn't have to be that way -- if the original goals/leaders could press a button that would costlessly ensure that this value drift didn't happen, and that the accumulated resources would later be efficiently spent purely on the original goals of the org and not on all these other goals that were originally adopted as instrumentally useful... shouldn't they? If so, then the Bostromian picture is how things will be in the limit of increased rationality/intelligence/etc., but not necessarily how things are now.
Nice. Yeah I also am excited about coding uplift as a key metric to track that would probably make time horizons obsolete (or at least, constitute a significantly stronger source of evidence than time horizons). We at AIFP don't have capacity to estimate the trend in uplift over time (I mean we can do small-N polls of frontier AI company employees...) but we hope someone does.
I basically agree with everything you say here and wish we had a better way to try to ground AGI timelines forecasts. Do you recommend any other method? E.g. extrapolating revenue? Just thinking through arguments about whether the current paradigm will work, and then using intuition to make the final call? We discuss some methods that appeal to us here.
This parameter, “Doubling Difficulty Growth Factor”, can change the date of the first Automated Coder AI between 2028 and 2050.
Note that we allow it to go subexponential, so actually it can change the date arbitrarily far in the future if you really want it to. Also, dunno what's happening with Eli's parameters, but with my parameter settings putting the doubling difficulty growth factor to 1 (i.e. pure exponential trend, neither super or sub exponential) gets to AC in 2035. (Though I don't think we should put much weight on this number, as it depends on other parameters which are subjective & important too, such as the horizon length which corresponds to AC, which people disagree a lot about)
Interesting, thanks! Why though? Like, if a massive increase or decrease in investment would break the trend, shouldn't a moderate increase or decrease in investment bend the trend?
The first graph you share is fascinating to me because normally I'd assume that the wright's law / experience curve for a technology gets harder over time, i.e. you start out with some sort of "N doublings of performance for every doubling of cumulative investment" number that gradually gets smaller over time as you approach limits. But here it seems that N has actually been increasing over time!
(adding to what bhalstead said: You are welcome to stop by our office sometime to chat about it, we'd love to discuss!)
Personally I agree that 125 years is complete overkill for automation of enough coding for the bottleneck to shift to experiments and research taste. That's a big part of why my parameter is set lower. However I want to think about this more. You can find more of my thoughts here.
I basically agree with Eli, though I'll say that I don't think the gap between extrapolating METR specifically and AI revenue is huge. I think ideally I'd do some sort of weighted mix of both, which is sorta what I'm doing in my ATC.
“Our greatest fear should not be of failure, but of succeeding at something that doesn't really matter.” –attributed to DL Moody[1]
I don't think what you are doing makes the situation worse. Perhaps you do think that of me though; this would be understandable...
I don't think it's the most important or original or interesting thing I've done, but I'm proud of the ideas in here nevertheless. Basically, other researchers have now actually done many of the relevant experiments to explore the part of the tech tree I was advocating for in this post. See e.g. https://www.alignmentforum.org/posts/HuoyYQ6mFhS5pfZ4G/paper-output-supervision-can-obfuscate-the-cot
I'm very happy that those researchers are doing that research, and moreover, very happy that the big AI companies have sorta come together to agree on the importance of CoT monitorability! https://arxiv.org/abs/2507.11473
OK but are these ideas promising? How do they fit into the bigger picture?
Conveniently, my answer to those questions is illustrated in the AI 2027 scenario:
It's not the only answer though. I think that improving the monitorability of CoT is just amaaaazing for building up the science of AI and the science of AI alignment, plus also raising awareness about how AIs think and work, etc.
Another path to impact is that if neuralese finally arrives and we have no more CoTs to look at, (a) some of the techniques for making good neuralese interpreters might benefit from the ideas developed for keeping CoT faithful, and (b) having previously studied all sorts of examples of misaligned CoTs, it'll be easier to argue to people that there might be misaligned cognition happening in the neuralese.
(Oh also, to be clear, I'm not taking credit for all these ideas. Other people like Fabian and Tamera for example got into CoT monitorability before I did, as did MIRI arguably. I think of myself as having just picked up the torch and ran with it for a bit, or more like, shouted at people from the sidelines to do so.)
Overall do I recommend this for the review? Well, idk, I don't think it's THAT important or great. I like it but I don't think it's super groundbreaking or anything. Also, it has low production value; it was a very low-effort dialogue that we banged out real quick after some good conversations.
In light of Anthropic's Constitution and xAI's attempts to be 'Based', now I wonder if my original prediction was correct after all. We'll see.