Daniel Dewey

Wiki Contributions


Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers

This is extremely cool -- thank you, Peter and Owen! I haven't read most of it yet, let alone the papers, but I have high hopes that this will be a useful resource for me.

Against GDP as a metric for timelines and takeoff speeds

Thanks for the post! FWIW, I found this quote particularly useful:

Well, on my reading of history, that means that all sorts of crazy things will be happening, analogous to the colonialist conquests and their accompanying reshaping of the world economy, before GWP growth noticeably accelerates!

The fact that it showed up right before an eye-catching image probably helped :)

Debate update: Obfuscated arguments problem

This may be out-of-scope for the writeup, but I would love to get more detail on how this might be an important problem for IDA.

Debate update: Obfuscated arguments problem

Thanks for the writeup! This google doc (linked near "raised this general problem" above) appears to be private: https://docs.google.com/document/u/1/d/1vJhrol4t4OwDLK8R8jLjZb8pbUg85ELWlgjBqcoS6gs/edit

Verification and Transparency

This seems like a useful lens -- thanks for taking the time to post it!

AI safety: three human problems and one AI issue

Thanks for writing this -- I think it's a helpful kind of reflection for people to do!

Where's the first benign agent?

Ah, gotcha. I'll think about those points -- I don't have a good response. (Actually adding "think about"+(link to this discussion) to my todo list.)

It seems to me that in order to be able to make rigorous arguments about systems that are potentially subject to value drift, we have to understand metaphilosophy at a deep level.

Do you have a current best guess at an architecture that will be most amenable to us applying metaphilosophical insights to avoid value drift?

Where's the first benign agent?

These objections are all reasonable, and 3 is especially interesting to me -- it seems like the biggest objection to the structure of the argument I gave. Thanks.

I'm afraid that the point I was trying to make didn't come across, or that I'm not understanding how your response bears on it. Basically, I thought the post was prematurely assuming that schemes like Paul's are not amenable to any kind of argument for confidence, and we will only ever be able to say "well, I ran out of ideas for how to break it", so I wanted to sketch an argument structure to explain why I thought we might be able to make positive arguments for safety.

Do you think it's unlikely that we'll be able to make positive arguments for the safety of schemes like Paul's? If so, I'd be really interested in why -- apologies if you've already tried to explain this and I just haven't figured that out.

Where's the first benign agent?

"naturally occurring" means "could be inputs to this AI system from the rest of the world"; naturally occurring inputs don't need to be recognized, they're here as a base case for the induction. Does that make sense?

If there are other really powerful reasoners in the world, then they could produce value-corrupting single pages of text (and I would then worry about Soms becoming corrupted). If there aren't, I'd guess that possible input single pages of text aren't value-corrupting in an hour. (I would certainly want a much better answer than "I guess it's fine" if we were really running something like this.)

To clarify my intent here, I wanted to show a possible structure of an argument that could make us confident that value drift wasn't going to kill us. If you think it's really unlikely that any argument of this inductive form could be run, I'd be interested in that (or if Paul or someone else thought I'm on the wrong track / making some kind of fundamental mistake.)

Load More