All of Thane Ruthenis's Comments + Replies

Any updates to your model of the socioeconomic path to aligned AI deployment? Namely:

  • Any changes to your median timeline until AGI, i. e., do we actually have these 9-14 years?
  • Still on the "figure out agency and train up an aligned AGI unilaterally" path?
  • Has the FTX fiasco impacted your expectation of us-in-the-future having enough money=compute to do the latter?

I expect there to be no major updates, but seems worthwhile to keep an eye on this.

So my new main position is: which potential alignment targets (human values, corrigibility, Do What I Mean, human

... (read more)
3johnswentworth4d
Basically no. I basically buy your argument, though there's still the question of how safe a target DWIM is.

Still on the "figure out agency and train up an aligned AGI unilaterally" path?

"Train up an AGI unilaterally" doesn't quite carve my plans at the joints.

One of the most common ways I see people fail to have any effect at all is to think in terms of "we". They come up with plans which "we" could follow, for some "we" which is not in fact going to follow that plan. And then they take political-flavored actions which symbolically promote the plan, but are not in fact going to result in "we" implementing the plan. (And also, usually, the "we" in question is to... (read more)

Any changes to your median timeline until AGI, i. e., do we actually have these 9-14 years?

Here's a dump of my current timeline models. (I actually originally drafted this as part of the post, then cut it.)

My current intuition is that deep learning is approximately one transformer-level paradigm shift away from human-level AGI. (And, obviously, once we have human-level AGI things foom relatively quickly.) That comes from an intuitive extrapolation: if something were about as much better as the models of the last 2-3 years, as the models of the last 2-3 yea... (read more)

Values steer optimization; they are not optimized against

I strongly disagree with the implication here. This statement is true for some agents, absolutely. It's not true universally.

It's a good description of how an average human behaves most of the time, yes. We're often puppeted by our shards like this, and some people spend the majority of their lives this way. I fully agree that this is a good description of most of human cognition, as well.

But it's not the only way humans can act, and it's not when we're at our most strategically powerful.

Consider if ... (read more)

3Alex Turner5d
As usual, you've left a very insightful comment. Strong-up, tentative weak disagree, but haven't read your linked post yet. Hope to get to that soon.

I'd rather sacrifice even more corrigibility properties (like how this already isn't too worried about subagent stability) for better friendliness

Do you have anything specific in mind?

1Charlie Steiner8d
One thing might be that I'd rather have an AI design that's more naturally self-reflective, i.e. using its whole model to reason about itself, rather than having pieces that we've manually retargeted to think about some other pieces. This reduces how much Cartesian doubt is happening on the object level all at the same time, which sorta takes the AI farther away from the spec. But this maybe isn't that great an example, because maybe it's more about not endorsing the "retargeting the search" agenda.

The crux is likely in a disagreement of which approaches we think are viable. In particular:

You need basically perfect interpretability, compared with approaches that require no or just some interpretability capabilities

What are the approaches you have in mind, that are both promising and don't require this? The most promising ones that come to my mind are the Shard Theory-inspired one and ELK. I've recently became much more skeptical of the former, and the latter IIRC didn't handle mesa-optimizers/the Sharp Left Turn well (though I haven't read Paul's lat... (read more)

Another point here is that "an empty room" doesn't mean "no context". Presumably when you're sitting in an empty room, your world-model is still active, it's still tracking events that you expect to be happening in the world outside the room — and your shards see them too. So, e. g., if you have a meeting scheduled in a week, and you went into an empty room, after a few days there your world-model would start saying "the meeting is probably soon", and that will prompt your punctuality shard.

Similarly, your self-model is part of the world-model, so even if ... (read more)

Suggestion: make it a CYOA-style interactive piece, where the reader is tasked with aligning AI, and could choose from a variety of approaches which branch out into sub-approaches and so on. All of the paths, of course, bottom out in everyone dying, with detailed explanations of why. This project might then evolve based on feedback, adding new branches that counter counter-arguments made by people who played it and weren't convinced. Might also make several "modes", targeted at ML specialists, general public, etc., where the text makes different tradeoffs ... (read more)

Arbital was meant to support galaxy-brained attempts like this; Arbital failed.