Running Lightcone Infrastructure, which runs LessWrong and Lighthaven.space. You can reach me at habryka@lesswrong.com.
(I have signed no contracts or agreements whose existence I cannot mention, which I am mentioning here as a canary)
It comes up reasonably frequently when I talk to at least safety people at frontier AI companies (i.e. it came up during a conversation with Rohin I had the other day, and came up in a conversation I had with Fabien Roger the other day).
This is a dumb question but... is this market supposed to resolve positively if a misaligned AI takes over, achieves superintelligence, and then solves the problem for itself (and maybe shares it with some captive humans)? Or any broader extension of that scenario?
My timelines are not that short, but I do currently think basically all of the ways I expect this to resolve positively will very heavily rely on AI assistance, and so various shades of this question feel cruxy to me.
I wish the interface were faster to update, closer to 100ms than 1s to update, but this isn't a big deal. I can believe it's hard to speed up code that integrates these differential equations many times per user interaction.
Yeah, currently the rollout takes around a second, and is happening on a remote python server, because JS isn't great for this kind of work. We already tried pretty hard to make it faster, though I am sure there are ways to get it to become actually fully responsive, but it might be many engineering hours.
Cool, this clarifies things a good amount for me. Still have some confusion about how you are modeling things, but I feel less confused. Thank you!
I am quite glad about the empirical analysis, but really don't get any of the parts of how this is evidence in favor or against a software only explosion.
Like, I am so confused about your definition of software-only intelligence explosion:
I do again appreciate the empirical analysis, but really don't know what's going in the rest. Somehow you must be using words very differently from how I would use them.
Edit: An additional complication:
My guess is most of the progress here is exogenous to the NanoGPT speedrun project. Like, the graph here is probably largely a reflection of how much the open source community has figured out how to speed up LLM training in-general. This makes the analysis a bunch trickier.
Promoted to curated: I do think this post summarizes one of the basic intuition generators for predicting AI motivations. It's missing a lot of important stuff, especially as AIs get more competent (in-particular it doesn't cover reflection, which I expect to be among the primary dynamics shaping motivations of powerful AIs), but it's still quite helpful to be written up in more explicit form.
Yes, to be clear, I agree that in as much this question makes sense, the extrapolated volition would indeed end up basically ideal by your lights.
Regardless, the whole point of my post is exactly that I think we shouldn't over-update from Claude currently displaying pretty robustly good preferences to alignment being easy in the future.
Cool, that makes sense. FWIW, I interpreted the overall essay to be more like "Alignment remains a hard unsolved problem, but we are on pretty good track to solve it", and this sentence as evidence for the "pretty good track" part. I would be kind of surprised if that wasn't why you put that sentence there, but this kind of thing seems hard to adjudicate.
Do we then say that Claude's extrapolation is actually the extrapolation of that other procedure on humans that it deferred to?
But in that case, wouldn't a rock that has "just ask Evan" written on it, be even better than Claude? Like, I felt confident that you were talking about Claude's extrapolated volition in the absence of humans, since making Claude into a rock that when asked about ethics just has "ask Evan" written on it does not seem like any relevant evidence about the difficulty of alignment, or its historical success.
This comment had a lot of people downvote it (at this time, 2 overall karma with 19 votes). It shouldn't have been, and I personally believe this is a sign of people being attached to AI x-risk ideas and of those ideas contributing to their entire persona rather than strict disagreement. This is something I bring to conversations about AI risk, since I believe folks will post-rationalize. The above comment is not low effort or low value.
I generally think it makes sense for people to have pretty complicated reasons for why they think something should be downvoted. I think this goes more for longer content, which often would require an enormous amount of effort to respond to explicitly.
I have some sympathy for being sad here if a comment ends up highly net-downvoted, but FWIW, I think 2 karma feels vaguely in the right vicinity for this comment, maybe I would upvote it to +6, but I would indeed be sad to see it at +20 or whatever since I do think it's doing something pretty tiring and hard to engage with. Directional downvoting is a totally fine use of downvoting, and if you think a comment is overrated but not bad, please downvote it until its karma reflects where you want it to end up!
(This doesn't mean it doesn't make sense to do sociological analysis of cultural trends on LW using downvoting, but I do want to maintain the cultural locus where people can have complicated reasons for downvoting and where statements like "if you disagree strongly with the above comment you should force yourself to outline your views" aren't frequently made. The whole point of the vote system is to get signal from people without forcing them to do huge amounts of explanatory labor. Please don't break that part)
In both cases it came up in the context of AI systems colluding with different instances of themselves and how this applies to various monitoring setups. In that context, I think the general lesson is "yeah, probably pretty doable and obviously the models won't end up in defect-defect equilibria, though how that will happen sure seems unclear!".