No, I think the blue-team will keep having the latest and best LLMs and be able to stop such attempts from randos. These AGIs won't be so much magically superintelligent that they can take all the unethical actions needed to take over the world, without other AGIs stopping them.
But it feels like you'd need to demonstrate this with some construction that's actually adversarially robust, which seems difficult.
I agree it's kind of difficult.
Have you seen Nicholas Carlini's Game of Life series? It starts by building up logical gates up to a microprocessor that factors 15 in to 3 x 5.
Depending on the adversarial robustness model (e.g. every second the adversary can make 1 square behave the opposite of lawfully), it might be possible to make robust logic gates and circuits. In fact the existing circuits are a little robust already -- though not at the tune of 1 square per tick, that's too much power for the adversary.
I am concerned that long chains of RL are just sufficiently fucked functions with enough butterfly effects that this wouldn't be well approximated by this process.
This is a concern. Two possible replies:
IMO at any level of sampling?
Vacuously true. The actual question is: how much do you need to sample? My guess is it's too much, but we'd see the base model scaling better than the RL'd model just like in this paper.
Fortunately, DeepSeek's Mathv2 just dropped which is an open-source model that gets IMO gold. We can do the experiment: is it similarly not improving with sampling compared to its own base model? My guess is yes, the same will happen.
It's not impossible that we are in an alignment-by-default world. But, I claim that our current insight isn't enough to distinguish such a world from the gradual disempowerment/going out with a whimper world.
Well, yeah, I agree with that! You might notice this item in my candidate "What's next" list:
- Prevent economic incentives from destroying all value. Markets have been remarkably aligned so far but I fear their future effects. (Intelligence Curse, Gradual Disempowerment. Remarkably European takes.)
This is not classical alignment emphasizing scheming etc. Rather it's "we get more slop, AIs outcompete humans and so the humans that don't own AIs have no recourse." So I don't think that undermines my point at all.
Approximately never, I assume, because doing so isn't useful.
You should not assume such things. Humans invented scheming to take over, it might be the very reason we are intelligent.
Until then, I claim we have strong reasons to believe that we just don't know yet.
We don't know but we never really know and must act under uncertainty. I put forward that we can make a good guess.
Thank you for the reply. I want to engage without making it tiresome. The problem is that there are many things I disagree with in the worldview, the disagreement isn't reducible to 1-5 double cruxes, but here are some candidates for the biggest cruxes for me. If any of these are wrong it's bad news for my current view:
And here's another prediction where I really stick my neck out, which isn't load-bearing to the view, but still increases my confidence, so defeating it is important:
I still disagree with several of the points, but for time reasons I request that readers not update against Evan's points if he just doesn't reply to these.
disagree that increasing capabilities are exponential in a capability sense. It's true that METR's time horizon plot increases exponentially, but this still corresponds to a linear intuitive intelligence. (Like loudness (logarithmic) and sound pressure (exponential); we handle huge ranges well.) Each model that has come out has exponentially larger time horizon but is not (intuitively empirically) exponentially smarter.
"we still extensively rely on direct human oversight and review to catch alignment issues" That's a fair point and should decrease confidence in my view, though I expected it. For properly testing sandwiching we'll probably have to wait till models are superhuman, or use weak models + less weak models and test it out. Unfortunately perhaps the weak models are still too weak. But we've reached the point where you can maybe just use the actual Opus 3 as the weak model?
If we have a misaligned model doing research, we have lots of time to examine it with the previous model. I also do expect to see sabotage in the CoT or in deception probes
I updated way down on Goodharting on model internals due to Cundy and Gleave.
Again, readers please don't update down on these due to lack of a response.
I'm honestly very curious what Ethan is up to now, both you and Thomas Kwa implied that he's not doing alignment anymore. I'll have to reach out...
In fact, base model seem to be better than RL models at reasoning, when you take best of N (with the same N for both the RL'd and base model). Check out my post summarizing the research on the matter:
Yue, Chen et al. have a different hypothesis: what if the base model already knows all the reasoning trajectories, and all RL does is increase the frequency of reasoning or the frequency of the trajectory that is likely to work? To test this, Yue, Chen et al. use pass@K: let’s give the LLM a total of K attempts to answer the question, and if any of them succeed, mark the question as answering correctly. They report the proportion of correct questions in the data set.
If the RL model genuinely learns new reasoning skills, over many questions the pass@K performance of RL will remain higher than the performance of the base model. As we increase K, the base model answers more and more of the easy questions, so its performance improves. But the RL model’s performance also answers more and more difficult questions. The performance of both increases in tandem with larger K.
What actually happened is neither of these two things. For large enough K, the base model does better than the RL model. (!!!)
Is there anyone who significantly disputes this?
I disputed this in the past.
I debated this informally in an Alignement Workshop with a very prominent scientist, and in my own assessment lost. (Keeping vague because I'm unsure if it's Chatham House rules.)
Yeah, true. It's gone so well for so long that I forgot. I didn't spend a lot of time thinking about this list.