Thane Ruthenis

Wiki Contributions


I'd rather sacrifice even more corrigibility properties (like how this already isn't too worried about subagent stability) for better friendliness

Do you have anything specific in mind?

The crux is likely in a disagreement of which approaches we think are viable. In particular:

You need basically perfect interpretability, compared with approaches that require no or just some interpretability capabilities

What are the approaches you have in mind, that are both promising and don't require this? The most promising ones that come to my mind are the Shard Theory-inspired one and ELK. I've recently became much more skeptical of the former, and the latter IIRC didn't handle mesa-optimizers/the Sharp Left Turn well (though I haven't read Paul's latest post yet, so I may be wrong on that).

The core issue, as I see it, is that we'll need to aim the AI at humans in some precise way — tell it to precisely translate for us, or care about us in some highly specific way, or interpret commands in the exact way humans intend them, or figure out how to point it directly at the human values, or something along those lines. Otherwise it doesn't handle capability jumps well, whether we crank it up to superintelligence straight away or try to carefully steer it along.

And the paradigm of loss functions and broad regularizers (e. g., speed/complexity penalties) seems to consist of tools too crude for this purpose. The way I see it, we'll need fine manipulation.

Since writing the original post, I've been trying to come up with convincing-to-me ways to side-step this problem (as I allude to at the post's end), but no idea so far.

You need to figure out the right thought similarity measure to bootstrap it, and there seem to be risks if you get it wrong

Yeah, that's a difficulty unique to this approach.

Another point here is that "an empty room" doesn't mean "no context". Presumably when you're sitting in an empty room, your world-model is still active, it's still tracking events that you expect to be happening in the world outside the room — and your shards see them too. So, e. g., if you have a meeting scheduled in a week, and you went into an empty room, after a few days there your world-model would start saying "the meeting is probably soon", and that will prompt your punctuality shard.

Similarly, your self-model is part of the world-model, so even if everything outside the empty room were wiped out, you'd still have your "internal context" — and there'd be some shards that activate in response to events in it as well.

It's actually pretty difficult to imagine what an actual "no context" situation for realistic agents would look like. I guess you can imagine surgically removing all input channels from the WM to shards, to model this?

Suggestion: make it a CYOA-style interactive piece, where the reader is tasked with aligning AI, and could choose from a variety of approaches which branch out into sub-approaches and so on. All of the paths, of course, bottom out in everyone dying, with detailed explanations of why. This project might then evolve based on feedback, adding new branches that counter counter-arguments made by people who played it and weren't convinced. Might also make several "modes", targeted at ML specialists, general public, etc., where the text makes different tradeoffs regarding technicality vs. vividness.

I'd do it myself (I'd had the idea of doing it before this post came out, and my preliminary notes covered much of the same ground, I feel the need to smugly say), but I'm not at all convinced that this is going to be particularly useful. Attempts to defeat the opposition by building up a massive evolving database of counter-arguments have been made in other fields, and so far as I know, they never convinced anybody.

The interactive factor would be novel (as far as I know), but I'm still skeptical.

(A... different implementation might be to use a fine-tuned language model for this; make it an AI Dungeon kind of setup, where it provides specialized counter-arguments for any suggestion. But I expect it to be less effective than a more coarse hand-written CYOA, since the readers/players would know that the thing they're talking to has no idea what it's talking about, so would disregard its words.)