I've recently updated towards substantially shorter AI timelines and much faster progress in some areas. [1] The largest updates I've made are (1) an almost 2x higher probability of full AI R&D automation by EOY 2028 (I'm now a bit below 30% [2] while I was previously expecting around 15%; my guesses are pretty reflectively unstable) and (2) I expect much stronger short-term performance on massive and pretty difficult but easy-and-cheap-to-verify software engineering (SWE) tasks that don't require that much novel ideation [3] . For instance, I expect that by EOY 2026, AIs will have a 50%-reliability...
This post reflects my personal opinion and not necessarily that of other members of Apollo Research.
TLDR: I think funders should heavily incentivize AI safety work that enables spending $100M+ in compute or API budgets on automated AI labor that directly and differentially translates to safety.
I think we are in a short timeline world (and we should take the possibility seriously even if we don't have full confidence yet). This means that I think funders should aim to allocate large amounts of money (e.g. $1-50B per year across the ecosystem) on AI safety in the next 2-3 years.
I think that the AI safety funders have been allocating way too little funding and their spending has been far too conservative in the past 5 years. So, in my opinion,...
Datasets might be nice.
Written quickly as part of the Inkhaven Residency.
At a high level, research feedback I give to more junior research collaborators often can fall into one of three categories:
In each case, I think the advice can be taken to an extreme I no longer endorse. Accordingly, I’ve tried to spell out the degree to which you should implement the advice, as well as what “taking it too far” might look like.
This piece covers doing quick sanity checks, which is the most common advice I give to junior researchers. I’ll cover the other two pieces of advice in a subsequent piece.
Research is hard (almost by definition) and people are often wrong. Every...
Edited the title!
I recently wrote about complete feedback, an idea which I think is quite important for AI safety. However, my note was quite brief, explaining the idea only to my closest research-friends. This post aims to bridge one of the inferential gaps to that idea. I also expect that the perspective-shift described here has some value on its own.
In classical Bayesianism, prediction and evidence are two different sorts of things. A prediction is a probability (or, more generally, a probability distribution); evidence is an observation (or set of observations). These two things have different type signatures. They also fall on opposite sides of the agent-environment division: we think of predictions as supplied by agents, and evidence as supplied by environments.
In Radical Probabilism, this division is not so strict....
I missed this post when it came out and just came across it. Re "Logical Induction as Metaphilosophy" I have the same objection that I recently made against "reflective equilibrium":
Another argument against this form of reflective equilibrium is that it seems to imply anti-realism about normative decision theory, given differing intuitions between people. I think this is plausible but not likely, so it seems bad to bake it into our methodology of doing decision theory.
In other words, in Logical Induction as Metaphilosophy there seems to be nothing groundin...
Crossposted from the DeepMind Safety Research Medium Blog. Read our full paper about this topic by Max Kaufmann, David Lindner, Roland S. Zimmermann, and Rohin Shah.
Overseeing AI agents by reading their intermediate reasoning “scratchpad” is a promising tool for AI safety. This approach, known as Chain-of-Thought (CoT) monitoring, allows us to check what a model is thinking before it acts, often helping us catch concerning behaviors like reward hacking and scheming.
However, CoT monitoring can fail if a model’s chain-of-thought is not a good representation of the reasoning process we want to monitor. For example, training LLMs with reinforcement learning (RL) to avoid outputting problematic reasoning can result in a model learning to hide such reasoning without actually removing problematic behavior.
Previous work produced an inconsistent picture of whether RL training hurts CoT monitorability. In some prior...
AI stack + conflict parity requires lots of robots (or crazy novel tech) but doesn't require AIs as capable as TEDAI. TEDAI is a very high capabilities bar. So, in worlds without a software only singularity and especially with slower takeoff, I think you may reach AI stack + conflict parity prior to TEDAI. (It's certainly possible to have great military robots and robot industrial capacity with AIs that are well within the human range on key skills.) TEDAI probably follows reasonably quickly, because economic doubling times are so fast in such a world. In ... (read more)