Mark Xu

Sequences

Intermittent Distllations

Comments

AMA: Paul Christiano, alignment researcher

How would you teach someone how to get better at the engine game?

AMA: Paul Christiano, alignment researcher

You've written multiple outer alignment failure stories. However, you've also commented that these aren't your best predictions. If you condition on humanity going extinct because of AI, why did it happen?

Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers

I'm curious what "put it in my SuperMemo" means. Quick googling only yielded SuperMemo as a language learning tool.

Transparency Trichotomy

I agree it's sort of the same problem under the hood, but I think knowing how you're going to go from "understanding understanding" to producing an understandable model controls what type of understanding you're looking for.

I also agree that this post makes ~0 progress on solving the "hard problem" of transparency, I just think it provides a potentially useful framing and creates a reference for me/others to link to in the future.

Open Problems with Myopia

One way of looking at DDT is "keeping it dumb in various ways." I think another way of thinking about is just designing a different sort of agent, which is "dumb" according to us but not really dumb in an intrinsic sense. You can imagine this DDT agent looking at agents that do do acausal trade and thinking they're just sacrificing utility for no reason.

There is some slight awkwardness in that the decision problems agents in this universe actually encounter means that UDT agents will get higher utility than DDT agents.

I agree that the maximum a posterior world doesn't help that much, but I think there is some sense in which "having uncertainty" might be undesirable.

Open Problems with Myopia

has been changed to imitation, as suggested by Evan.

Open Problems with Myopia

Yeah, you're right that it's obviously unsafe. The words "in theory" were meant to gesture at that, but it could be much better worded. Changed to "A prototypical example is a time-limited myopic approval-maximizing agent. In theory, such an agent has some desirable safety properties because a human would only approve safe actions (although we still would consider it unsafe)."

Load More