It didn't bug me ¯\_(ツ)_/¯
Thanks for the post! FWIW, I found this quote particularly useful:
Well, on my reading of history, that means that all sorts of crazy things will be happening, analogous to the colonialist conquests and their accompanying reshaping of the world economy, before GWP growth noticeably accelerates!
The fact that it showed up right before an eye-catching image probably helped :)
This may be out-of-scope for the writeup, but I would love to get more detail on how this might be an important problem for IDA.
Thanks for the writeup! This google doc (linked near "raised this general problem" above) appears to be private: https://docs.google.com/document/u/1/d/1vJhrol4t4OwDLK8R8jLjZb8pbUg85ELWlgjBqcoS6gs/edit
This seems like a useful lens -- thanks for taking the time to post it!
Thanks for writing this -- I think it's a helpful kind of reflection for people to do!
Ah, gotcha. I'll think about those points -- I don't have a good response. (Actually adding "think about"+(link to this discussion) to my todo list.)
It seems to me that in order to be able to make rigorous arguments about systems that are potentially subject to value drift, we have to understand metaphilosophy at a deep level.
Do you have a current best guess at an architecture that will be most amenable to us applying metaphilosophical insights to avoid value drift?
These objections are all reasonable, and 3 is especially interesting to me -- it seems like the biggest objection to the structure of the argument I gave. Thanks.
I'm afraid that the point I was trying to make didn't come across, or that I'm not understanding how your response bears on it. Basically, I thought the post was prematurely assuming that schemes like Paul's are not amenable to any kind of argument for confidence, and we will only ever be able to say "well, I ran out of ideas for how to break it", so I wanted to sketch an argument structure to explain why I thought we might be able to make positive arguments for safety.
Do you think it's unlikely that we'll be able to make positive arguments for the safety of schemes like Paul's? If so, I'd be really interested in why -- apologies if you've already tried to explain this and I just haven't figured that out.
"naturally occurring" means "could be inputs to this AI system from the rest of the world"; naturally occurring inputs don't need to be recognized, they're here as a base case for the induction. Does that make sense?
If there are other really powerful reasoners in the world, then they could produce value-corrupting single pages of text (and I would then worry about Soms becoming corrupted). If there aren't, I'd guess that possible input single pages of text aren't value-corrupting in an hour. (I would certainly want a much better answer than "I guess it's fine" if we were really running something like this.)
To clarify my intent here, I wanted to show a possible structure of an argument that could make us confident that value drift wasn't going to kill us. If you think it's really unlikely that any argument of this inductive form could be run, I'd be interested in that (or if Paul or someone else thought I'm on the wrong track / making some kind of fundamental mistake.)
This is extremely cool -- thank you, Peter and Owen! I haven't read most of it yet, let alone the papers, but I have high hopes that this will be a useful resource for me.