Personal Blog

I present a simple Deep-RL flavour idea for learning an agent's impact that I'm thinking of trying out. I don't, ATM, think it's very satisfying from a safety point of view, but I think it's at least a bit relevant, so I'm posting here for feedback, iyi.

IDEA: Instead of learning

with a single network, instead, learn it as:

.

The could mean mixing the distributions, adding the preactivations, or adding the samples from and . I think adding the samples probably makes the most sense in most cases.

Now, is trained to capture the agent's impact, and should learn the "passive dynamics". Apparently things like this have been tried before (not using DL, AFAIK, though), e.g.https://papers.nips.cc/paper/3002-linearly-solvable-markov-decision-problems.pdf

If we do a good job of disentangling an agent's impact from the passive dynamics, then we can do reduced-impact in a natural way.

This idea was inspired by internal discussions at MILA/RLLAB and the Advantage-function formulation of value-based RL.

New Comment
6 comments, sorted by Click to highlight new comments since: Today at 10:43 PM

This feels like it's the same thing as the low-impact paper.

https://www.dropbox.com/s/cjt5t6ny5gwpcd8/Low_impact_S%2BB.pdf?raw=1

There the AI must maximise where is a weight, is a positive goal, and is penalty function for impact.

Am I mistaken in thinking that is the same as and is the same as ?

It's not the same (but similar), because my proposal is just about learning a model of impact, and has nothing to do with the agent's utility function.

You could use the learned impact function, , to help measure (and penalize) impact, however.

This is a neat idea! I'd be interested to hear why you don't think it's satisfying from a safety point of view, if you have thoughts on that.

Seems to me like there are a bunch of challenges. For example you need extra structure on your space to add things or tell what's small; and you really want to keep track of long-term impact not just at the next time-step. Particularly the long-term one seems thorny (for low-impact in general, not just for this).

Nevertheless I think this idea looks promising enough to explore further, would also like to hear David's reasons.

I was mostly a gut-feeling when I posted, but let me try and articulate a few:

  1. It relies on having a good representation. Small problems with the representation might make it unworkable. Learning a good enough representation and verifying that you've done so doesn't seem very feasible. Impact may be missed if the representation doesn't properly capture unobserved things and long-term dependencies. Things like the creation of sub-agents seem likely to crop up in subtle, hard to learn, ways.

  2. I haven't looked into it, but ATM I have no theory about when this scheme could be expected to recover the "correct" model (I don't even know how that would be defined... I'm trying to "learn" my way around the problem :P)

To put #1 another way, I'm not sure that I've gained anything compared with proposals to penalize impact in the input space, or some learned representation space (with the learning not directed towards discovering impact).

On the other hand, I was inspired to consider this idea when thinking about Yoshua's proposal about causal disentangling mentioned at the end of his Asilomar talk here: https://www.youtube.com/watch?v=ZHYXp3gJCaI. This (and maybe some other similar work, e.g. on empowerment) seem to provide a way to direct an agent's learning towards maximizing its influence, which might help... although having an agent learn based on maximizing its influence seems like a bad idea... but I guess you might be able to then add a conflicting objective (like a regularizer) to actually limit the impact...

So then you'd end up with some sort of adversarial-ish set-up, where the agent is trying to both:

  1. maximize potential impact (i.e. by understanding its ability to influence the world)
  2. minimize actual impact (i.e. by refraining from taking actions which turn out (eventually) to have a large impact).

Having just finished typing this, I feel more optimistic about this last proposal than the original idea :D We want an agent to learn about how to maximize its impact in order to avoid doing so.

(How) can an agent confidently predict its potential impact without trying potentially impactful actions?
I think it certainly can, because humans can. We use a powerful predictive model of the world to do this. … and that’s all I have to say ATM

Yes, as Owen points out, there are general problems with reduced impact that apply to this idea, i.e. measuring long-term impacts.