Any updates to your model of the socioeconomic path to aligned AI deployment? Namely:
I expect there to be no major updates, but seems worthwhile to keep an eye on this.
So my new main position is: which potential alignment targets (human values, corrigibility, Do What I Mean, human
Still on the "figure out agency and train up an aligned AGI unilaterally" path?
"Train up an AGI unilaterally" doesn't quite carve my plans at the joints.
One of the most common ways I see people fail to have any effect at all is to think in terms of "we". They come up with plans which "we" could follow, for some "we" which is not in fact going to follow that plan. And then they take political-flavored actions which symbolically promote the plan, but are not in fact going to result in "we" implementing the plan. (And also, usually, the "we" in question is to... (read more)
Any changes to your median timeline until AGI, i. e., do we actually have these 9-14 years?
Here's a dump of my current timeline models. (I actually originally drafted this as part of the post, then cut it.)
My current intuition is that deep learning is approximately one transformer-level paradigm shift away from human-level AGI. (And, obviously, once we have human-level AGI things foom relatively quickly.) That comes from an intuitive extrapolation: if something were about as much better as the models of the last 2-3 years, as the models of the last 2-3 yea... (read more)
Values steer optimization; they are not optimized against
I strongly disagree with the implication here. This statement is true for some agents, absolutely. It's not true universally.
It's a good description of how an average human behaves most of the time, yes. We're often puppeted by our shards like this, and some people spend the majority of their lives this way. I fully agree that this is a good description of most of human cognition, as well.
But it's not the only way humans can act, and it's not when we're at our most strategically powerful.
Consider if ... (read more)
I'd rather sacrifice even more corrigibility properties (like how this already isn't too worried about subagent stability) for better friendliness
Do you have anything specific in mind?
The crux is likely in a disagreement of which approaches we think are viable. In particular:
You need basically perfect interpretability, compared with approaches that require no or just some interpretability capabilities
What are the approaches you have in mind, that are both promising and don't require this? The most promising ones that come to my mind are the Shard Theory-inspired one and ELK. I've recently became much more skeptical of the former, and the latter IIRC didn't handle mesa-optimizers/the Sharp Left Turn well (though I haven't read Paul's lat... (read more)
Another point here is that "an empty room" doesn't mean "no context". Presumably when you're sitting in an empty room, your world-model is still active, it's still tracking events that you expect to be happening in the world outside the room — and your shards see them too. So, e. g., if you have a meeting scheduled in a week, and you went into an empty room, after a few days there your world-model would start saying "the meeting is probably soon", and that will prompt your punctuality shard.
Similarly, your self-model is part of the world-model, so even if ... (read more)
Suggestion: make it a CYOA-style interactive piece, where the reader is tasked with aligning AI, and could choose from a variety of approaches which branch out into sub-approaches and so on. All of the paths, of course, bottom out in everyone dying, with detailed explanations of why. This project might then evolve based on feedback, adding new branches that counter counter-arguments made by people who played it and weren't convinced. Might also make several "modes", targeted at ML specialists, general public, etc., where the text makes different tradeoffs ... (read more)
Arbital was meant to support galaxy-brained attempts like this; Arbital failed.