Yonadav Shavit

Wiki Contributions


Are limited-horizon agents a good heuristic for the off-switch problem?

To clarify, this is intended to be a test-time objective; I'm assuming the system was trained in simulation and/or by observing the environment. In general, this reward wouldn't need to be "trained" – it could just be hardcoded into the system. If you're asking how the system would understand its reward without having experienced it already, I'm assuming that sufficiently-advanced AIs have the ability to "understand" their reward function and optimize on that basis. For example, "create two identical strawberries on the cellular level" can only be plausibly achieved via understanding, rather than encountering the reward often enough in simulation to learn from it, since it'd be so rare even in simulation.

Modern reinforcement learning systems receive a large positive reward (or, more commonly, an end to negative rewards) when ending the episode, and this incentivizes them to end the episode quickly (sometimes suicidally). If you only provide this "shutdown reward", I'd expect to see the same behavior, but only after a certain time period.