I didn't believe the theory of change at the time and still don't. The post doesn't really make a full case for it, and I doubt it really convinced anyone to work on this for the right reasons.
It may do a good job of giving the author's perspective, but given all these gaps it's not very memorable today in explaining the risks OP cares about-- even if we do end up worrying about them in 5-10 years.
To clarify I'm not very confident that AI will be aligned; I still have a >5% p(takeover doom | 10% of AI investment is spent on safety). I'm not really sure why it feels different emotionally but I guess this is just how brains are sometimes.
I'm glad to see this post come out. I've previously opined that solving these kinds of problems is what proves a field has become paradigmatic:
Paradigms gain their status because they are more successful than their competitors in solving a few problems that the group of practitioners has come to recognize as acute. ––Thomas Kuhn
It has been proven many times across scientific fields that a method that can solve these proxy tasks is more likely to achieve an application. The approaches sketched out here seem like a particularly good fit for a large lab like GDM, because the North Star can be somewhat legible and the team has enough resources to tackle a series of proxy tasks that are relevant and impressive. Not that it would be a bad fit elsewhere either.
Seems difficult for three reasons:
Given these difficulties I'd put it below 50/50, but this challenge seems significantly harder than the one I think we will actually face, which is more like having an AI we can defer to that doesn't try to sabotage the next stage of AI research, for each stage until the capability level that COULD disempower 2025 humanity, plus maybe other things to keep the balance of power stable.
Also I'm not sure what "drawn from the same distribution" means here, AI safety is trying dozens of theoretical and empirical directions, plus red teaming and model organisms can get much more elaborate with impractical resource investments, so things will look very qualitatively different in a couple of decades.
Thanks for writing this, I've suspected for a while that we're ahead, which is great but a bit emotionally difficult when I'd spent basically my whole career with the goal of heroically solving an almost impossible problem. And this is a common view among AI safety people, e.g. Ethan Perez said recently he's focused on problems other than takeover due to not being as worried about it.
I do expect some instrumental pressure towards misaligned power-seeking, but the number of tools we have to understand, detect, and prevent it is now large enough that it seems we'll be fine until the Dyson sphere is under construction, at which point things are a lot more uncertain but probably we'll figure something out there too.
Agree that your research didn't make this mistake, and MIRI didn't make all the same mistakes as OpenAI. I was responding in context of Wei Dai's OP about the early AI safety field. At that time, MIRI was absolutely being uncooperative: their research was closed, they didn't trust anyone else to build ASI, and their plan would end in a pivotal act that probably disempowers some world governments and possibly ends up with them taking over the world. Plus they descended from a org whose goal was to build ASI before Eliezer realized alignment should be the focus. Critch complained as late as 2022 that if there were two copies of MIRI, they wouldn't even cooperate with each other.
It's great that we have the FLI statement now. Maybe if MIRI had put more work into governance we could have gotten it a year or two earlier, but it took until Hendrycks got involved for the public statements to start.
We absolutely do need to "race to build a Friendly AI before someone builds an unFriendly AI". Yes, we should also try to ban Unfriendly AI, but there is no contradiction between the two. Plans are allowed (and even encouraged) to involve multiple parallel efforts and disjunctive paths to success.
Disagree, the fact that there needs to be a friendly AI before an unfriendly AI doesn't mean building it should be plan A, or that we should race to do it. It's the same mistake OpenAI made when they let their mission drift from "ensure that artificial general intelligence benefits all of humanity" to being the ones who build an AGI that benefits all of humanity.
Plan A means it would deserve more resources than any other path, like influencing people by various means to build FAI instead of UFAI.
Also mistakes, from my point of view anyway
The Interwebs seem to indicate that that's only if you give it a laser spot to aim at, not with just GPS.
Good catch.
Agree grenade sized munitions won't damage buildings, I think the conversation is drifting between FPVs and other kinds of drones, and also between various settings, so I'll just state my beliefs.
I'm giving this +1 review point despite not having originally been excited about this in 2024. Last year, I and many others were in a frame where alignment plausibly needed a brilliant idea. But since then, I've realized that execution and iteration on ideas we already have is highly valuable. Just look at how much has been done with probes and steering!
Ideas like this didn't match my mental picture of the "solution to alignment", and I still don't think it's in my top 5 directions, but with how fast AI safety has been growing, we can assign 10 researchers to each of 20 "neglected approach"es like this, so it deserves +1 point.
The post has an empirical result that's sufficient to concretize the idea and show it has some level of validity, which is necessary. Adam Jones has a critique. However, the only paper on this so far didn't make it to a main conference and only has 3 cites, so the impact isn't large (yet).