williawa
williawa has not written any posts yet.

williawa has not written any posts yet.

I wrote more in depth why I think they solve themselves in the post I linked, but I would say:
First of all, I'm assuming that if we get intent alignment, we'll get CEV alignment shortly after. Because people want the AI aligned to their CEV, not their immediate intent. And solving CEV should be within reach for a intent-aligned ASI. (Here, by CEV alignment I mean "aligning an AI to the values of a person or collection of people that they'd endorse upon learning all the relevant facts and reflecting".)
If you agree with this, I can state the reason I think these issues will "solve themselves" / become "merely practical/political" by asking... (read more)
Firstly, I think there is a decent chance we get alignment by default (where default means hundreds of very smart people work day and night on prosaic techniques). I'd put it at ~40%? Maybe 55% Anthropics AGI would be aligned? I wrote about semi-mechanistic stories for how I think current techniques could lead to genuinely aligned AI. Or CEV-style alignment and corrigibility that's stable.
I think your picture is a little too rosy though. Like first of all
Alignment faking shows incorrigibility but if you ask the model to be corrected towards good things like CEV, I think it would not resist.
Alignment faking is not corrigible or intent aligned. If you have an AI that's... (read 572 more words →)
This does not make sense to me. I think corrigibility basins make sense, but I think alignment basins do not. If the AI has some values, which overlap with human values in many situations, but come apart under enough optimization, why would the AI want to be pointed in a different direction? I think it would not. Agents are already smart enough to scheme and alignment-fake, and a smarter agent would be able to predict the outcome of the process you're describing: it / its successors would have different values than it has, and those differences would be catastrophic from its perspective if extrapolated far enough.