Alex Ray — AI Alignment Forum

(disclaimer: I worked on Safety Gym)

I think I largely agree with the comments here, and I don't really have attachment to specific semantics around what exactly these terms mean. Here I'll try to use my understanding of evhub's meanings:

First: a disagreement on the separation.

A particular prediction I have now, but is weakly held, is that episode boundaries are weak and permeable, and will probably be obsolete at some point. There's a bunch of reasons I think this, but maybe the easiest to explain is that humans learn and are generally intelligent and we don't have episode boundaries.

Given this, I think the "within-episode exploration" and "across-episode exploration" relax into each other, and (as the distinction of episode boundaries fades) turn into the same thing, which I think is fine to call "safe exploration".

Second: a learning efficiency perspective.

Another prediction I have now (also weakly held) is that we'll have a smooth spectrum between "things we want" and "things we don't want" with lots of stuff inbetween, instead of a sharp boundary between "accident" and "not accident".

A small contemporary example is in robotics, often accidents are "robot arm crashes into the table", but a non-accident we still want to avoid is "motions which incur more long-term wear than alternative motions".

An important goal of safe exploration is that we would like to learn this quality with high efficiency. A joke about current methods is that we learn not to run into a wall by running into a wall millions of times.

With this in mind, I think for the moment sample efficiency gains are probably also safe exploration gains, but at some point we might have to trade off how to increase the learning efficiency of safety qualities by decreasing the learning efficiency of other qualities.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments