johnswentworth

johnswentworth's Comments

[AN #80]: Why AI risk might be solved without additional intervention from longtermists

I'm not really thinking about provable guarantees per se. I'm just thinking about how to point to the AI's concept of human values - directly point to it, not point to some proxy of it, because proxies break down etc.

(Rough heuristic here: it is not possible to point directly at an abstract object in the territory. Even though a territory often supports certain natural abstractions, which are instrumentally convergent to learn/use, we still can't unambiguously point to that abstraction in the territory - only in the map.)

A proxy is probably good enough for a lot of applications with little scale and few corner cases. And if we're doing something like "train the AI to follow your instructions", then a proxy is exactly what we'll get. But if you want, say, an AI which "tries to help" - as opposed to e.g. an AI which tries to look like it's helping - then that means pointing directly to human values, not to a proxy.

Now, it is possible that we could train an AI against a proxy, and it would end up pointing to actual human values instead, simply due to imperfect optimization during training. I think that's what you have in mind, and I do think it's plausible, even if sounds a bit crazy. Of course, without better theoretical tools, we still wouldn't have a way to directly check even in hindsight whether the AI actually wound up pointing to human values or not. (Again, not talking about provable guarantees here, I just want to be able to look at the AI's own internal data structures and figure out (a) whether it has a notion of human values, and (b) whether it's actually trying to act in accordance with them, or just something correlated with them.)

[AN #80]: Why AI risk might be solved without additional intervention from longtermists
Yes, I would call that the central problem. (Though it would also be fine to build a pointer to a human and have the AI "help the human", without necessarily pointing to human values.)

How would we do either of those things without workable theory of embedded agency, abstraction, some idea of what kind-of-structure human values have, etc?

[AN #80]: Why AI risk might be solved without additional intervention from longtermists

I'm going to start a fresh thread on this, it sounds more interesting (at least to me) than most of the other stuff being discussed here.

[AN #80]: Why AI risk might be solved without additional intervention from longtermists

No. The whole notion of a human "meaning things" presumes a certain level of abstraction. One could imagine an AI simply reasoning about molecules or fields (or at least individual neurons), without having any need for viewing certain chunks of matter as humans who mean things. In principle, no predictive power whatsoever would be lost in that view of the world.

That said, I do think that problem is less central/immediate than the problem of taking an AI which does know what we mean, and pointing at that AI's concept-of-what-we-mean - i.e. in order to program the AI to do what we mean. Even if an AI learns a concept of human values, we still need to be able to point to that concept within the AI's concept-space in order to actually align it - and that means translating between AI-notion-of-what-we-want and our-notion-of-what-we-want.

[AN #80]: Why AI risk might be solved without additional intervention from longtermists
but why must it have a confusing interface? Couldn't you just talk to it, and it would know what you mean?

That's where the Don Norman part comes in. Interfaces to complicated systems are confusing by default. The general problem of systematically building non-confusing interfaces is, in my mind at least, roughly equivalent to the full technical problem of AI alignment. (Writing a program which knows what you mean is also, in my mind, roughly equivalent to the full technical problem of AI alignment.) A wording which makes it more obvious:

  • The main problem of AI alignment is to translate what a human wants into a format usable by a machine
  • The main problem of user interface design is to help/allow a human to translate what they want into a format usable by a machine

Something like e.g. tool AI puts more of the translation burden on the human, rather than on the AI, but that doesn't make the translation itself any less difficult.

In a non-foomy world, the translation doesn't have to be perfect - humanity won't be wiped out if the AI doesn't quite perfectly understand what we mean. Extreme capabilities make high-quality translation more important, not just because of Goodhart, but because the translation itself will break down in scenarios very different from what humans are used to. So if the AI has the capabilities to achieve scenarios very different from what humans are used to, then that translation needs to be quite good.

[AN #80]: Why AI risk might be solved without additional intervention from longtermists
Though I still have questions, like, how exactly did someone stumble upon the correct mathematical principles underlying intelligence by trial and error?

You mentioned that, conditional on foom, you'd be confused about what the world looks like. Is this the main thing you're confused about in foom worlds, or are there other major things too?

[AN #80]: Why AI risk might be solved without additional intervention from longtermists

Cool, just wanted to make sure I'm engaging with the main argument here. With that out of the way...

  • I generally buy the "no foom => iterate => probably ok" scenario. There are some caveats and qualifications, but broadly-defined "no foom" is a crux for me - I expect at least some kind of decisive strategic advantage for early AGI, and would find the "aligned by default" scenario plausible in a no-foom world.
  • I do not think that a lack of goal-directedness is particularly relevant here. If an AI has extreme capabilities, then a lack of goals doesn't really make it any safer. At some point I'll probably write a post about Don Norman's fridge which talks about this in more depth, but the short version is: if we have an AI with extreme capabilities but a confusing interface, then there's a high chance that we all die, goal-direction or not. In the "no foom" scenario, we're assuming the AI won't have those extreme capabilities, but it's foom vs no foom which matters there, not goals vs no goals.
  • I also disagree with coordination having any hope whatsoever if there is a problem. There's a huge unilateralist problem there, with millions of people each easily able to push the shiny red button. I think straight-up solving all of the technical alignment problems would be much easier than that coordination problem.

Looking at both the first and third point, I suspect that a sub-crux might be expectations about the resource requirements (i.e. compute & data) needed for AGI. I expect that, once we have the key concepts, human-level AGI will be able to run in realtime on an ordinary laptop. (Training might require more resources, at least early on. That would reduce the unilateralist problem, but increase the chance of decisive strategic advantage due to the higher barrier to entry.)

EDIT: to clarify, those second two points are both conditioned on foom. Point being, the only thing which actually matters here is foom vs no foom:

  • if there's no foom, then we can probably iterate, and then we're probably fine anyway (regardless of goal-direction, coordination, etc).
  • if there's foom, then a lack of goal-direction won't help much, and coordination is unlikely to work.
[AN #80]: Why AI risk might be solved without additional intervention from longtermists
Part of the problem is that we have a really strong unilateralist's curse. It only takes 1, or a few people who don't realize the problem to make something really dangerous.

This is a foom-ish assumption; remember that Rohin is explicitly talking about a non-foom scenario.

[AN #80]: Why AI risk might be solved without additional intervention from longtermists

While I might agree with the three options at the bottom, I don't agree with the reasoning to get there.

Abstractions are pretty heavily determined by the territory. Humans didn't look at the world and pick out "tree" as an abstract concept because of a bunch of human-specific factors. "Tree" is a recurring pattern on earth, and even aliens would notice that same cluster of things, assuming they paid attention. Even on the empathic front, you don't need a human-like mind in order to notice the common patterns of human behavior (in humans) which we call "anger" or "sadness".

Load More