Towards_Keeperhood

I'm trying to prevent doom from AI. Currently trying to become sufficiently good at alignment research. Feel free to DM for meeting requests.

Posts

Sorted by New

0Simon Skade's Shortform

Wiki Contributions

Comments

Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility

Towards_Keeperhood1y00

Thank you! I'll likely read your paper and get back to you. (Hopefully within a week.)

From reading you comment my guess is that the main disagreement may be that I think powerful AGI will need to be consequentialist. Like, for e.g. achieving something that humans cannot do yet, you need to search for that target in some way, i.e. have some consequentialist cognition, i.e. do some optimization. (So what I mean by consequentialism is just having some goal to search for / update toward, in contrast to just executing fixed patterns. I think that's how Yudkowsky means it, but not sure if that's what most people mean when they use the term.) (Though not that this implies that you need so much consequentialism that we won't be able to shut down the AGI. But as I see it a theoretical solution to corrigibility needs to deal with consequentialism. I haven't looked into your paper yet, so it's well possible that my comment here might appear misguided.) E.g. if we just built a gigantic transformer and train it on all human knowledge (and say we have a higher sample efficiency or so), it is possible that it can do almost everything humans can do. But it won't be able to just one-shot solve quantum gravity or so when we give it the prompt "solve quantum gravity". There is no runtime updating/optimization going on, i.e. the transformer is non-consequentialist. All optimization happened through the training data or gradient descent. Either the human training data was already sufficient to encode a solution to quantum gravity in the patterns of the transformer, or it wasn't. It is theoretically possible that the transformer learns a bit deeper underlying patterns than humans have (though I do not expect that from sth like the transformer architecture), and is so able to generalize a bit further than humans. But it seems extremely unlikely that it learned so deep understanding to already have the solution to quantum gravity encoded, although it was never explicitly trained to learn that and just read physics papers. The transformer might be able to solve quantum gravity if it can recursively query itself to engineer better prompts, or if it can give itself feedback which is then somehow converted into gradient descent updates and then try multiple times. But in those cases there is consequentialist reasoning again. The key point: Consequentialism becomes necessary when you go beyond human level.

Out of interest, how much do you agree with what I just wrote?

Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility

Towards_Keeperhood1y10

Hi Koen, thank you very much for writing this list!

I must say I'm skeptical that the technical problem of corrigibility as I see it is really solved already. I see the problem of corrigibility as shaping consequentialist optimization in a corrigible way. (Yeah that's not at all a clear definition yet, I'm still deconfusing myself about that, and I'll likely publish a post clarifying the problem how I see it within the next month.)

So e.g. corrigibility from non-agenthood is not a possible solution to what I see as the core problem. I'd expect that the other solutions here may likewise only give you corrigible agents that cannot do new very impressive things (or if they can they might still kill us all).

But I may be wrong. I probably only have time to read one paper. So: What would you say is the strongest result we have here? If I looked at on paper/post and explained why this isn't a solution to corrigibility as I see it, for what paper would it be most interesting for you to see what I write? (I guess I'll do it sometime this week of you write me back, but no promises.)

Also, from your perspective, how big is the alignment tax for implementing corrigibility? E.g. is it mostly just more effort implementing and supervising? Or does it also take more compute to get the same impressive result done? If so, how much? (Best take an example task that is preferably a bit too hard for humans to do. That makes it harder to reason about it, but I think this is where the difficulty is.)