Specification gaming: the flip side of AI ingenuity

Vlad Mikulik; Matthew Rahtz; tom4everitt; Zac Kenton; janleike

In the TAISU unconference the original poster asked for some feedback:

I recently wrote a blog post with some others from the DM safety team on specification gaming. We were aiming for a framing of the problem that makes sense to reinforcement learning researchers as well as AI safety researchers. Haven't received much feedback on it since it came out, so it would be great to hear whether people here found it useful / interesting.

My thoughts: I feel that engaging/reaching out to the wider community of RL researchers is an open problem, in terms of scaling work on AGI safety. So great to see a blog post that also tries to frame this particular problem for a RL researcher audience.

As a member of the AGI safety researcher audience, I echo the comments of johnswenthworth : well-written, great graphics, but mostly stuff that was already obvious. I do like picture 'spectrum of unexpected solutions' a lot, this is an interesting way of framing the issues. So, can I read this post as a call to action for AGI safety researchers? Yes, because it identifies two open problem areas, 'reward design' and 'avoidance of reward tampering', with links.

Can I read the post as a call to action for RL researchers? Short answer: no.

If try to read the post from the standpoint of an RL researcher, what I notice most is the implication that work on 'RL algorithm design', on the right in the `aligned RL agent design' illustrations has an arrow pointing to 'specification gaming is valid'. If I were an RL algorithm designer, I would read this as saying there is nothing I could contribute, if I stay in my own area of RL algorithm design expertise, to the goal of 'aligned RL agent design'.

So, is this the intended message that the blog post authors want to send to the RL researcher community? A non-call-to-action? Not sure. So this leaves me puzzled.

[Edited to add:]

In the TAISU discussion we concluded that there is indeed one call to action for RL algorithm designers: the message that, if they are ever making plans to deploy an RL-based system to the real world, it is a good idea to first talk to some AI/AGI safety people about specification gaming risks.

[-]Vika5y10

Thanks Koen for your feedback! You make a great point about a clearer call to action for RL researchers. I think an immediate call to action is to be aware of the following:

there is a broader scope of aligned RL agent design
there are difficult unsolved problems in this broader scope
for sufficiently advanced agents, these problems need general solutions rather than ad-hoc ones

Then a long-term call to action (if/when they are in the position to deploy an advanced AI system) is to consider the broader scope and look for general solutions to specification problems rather than deploying ad-hoc solutions. For those general solutions, they could refer to the safety literature and/or consult the safety community.

[-]adamShimi6y20

Like Koen, I'm here to give more detailed feedback on the post that was asked for by Victoria Krakovna at WebTaisu.

About the LEGO example, it's obvious after the second reading of the sentence, but I took some time to understand that "the bottom" didn't mean the face that was at lowest height, but the concave one (sort of). Also, I asked myself why the robot didn't put the brick upside down ON the other one, which would have maximized the height of the bottom. Is it because it was too costly compared to the local extremum of turning over the brick?

I like the two perspectives on specification gaming. One way I like to put it is that "we don't want to tell the agent how to do their task, but we still want them to accomplish it correctly".

For the coasters example, I think it would be clearer if the example was explained before the mention of potentials. Also, I would have liked a sentence or two explaining the potential part.

Lastly, I feel like the transition between the simulator bugs part and the reward tampering part is rough.

With all that being said, I still enjoyed the post, and I think it accomplish its goal, without any specification gaming. ;)

[-]Vika5y20

Thanks Adam for the feedback - glad you enjoyed the post!

For the Lego example, the agent received a fixed shaping reward for grasping the red brick if the bottom face was above a certain height (3cm), rather than being rewarded in proportion to the height of the bottom face. Thus, it found an easy way to collect the shaping reward by flipping the brick, while stacking it upside down on the blue brick would be a more difficult way to get the same shaping reward. The current description of the example in the post does make it sound like the reward is proportional to the height - I'll make a note to fix this in a later version of the post.

[-]adamShimi5y20

Ok, that makes much more sense. I was indeed assuming a proportional reward.

[-]lemonhope2y10

I would watch a ten hour video of this. (It may also be more persuasive to skeptics.)

[-]johnswentworth6y*10

I'm not sure who the intended audience is for this post.

I would guess that for most people on LW, the content is mostly stuff that was already obvious (that was certainly the case for me). The one potentially-novel part is highlighting three particular barriers (faithfully capture the human concept, avoid mistaken implicit assumptions, and reward tampering), but it's not clear that this is a particularly natural way to break up the problem. Why this break-down, rather than some other? (For instance, if we can faithfully capture the human concept, why would we ever need to worry about any other sub-problems at all? Or is it only supposed to be a sufficient condition for a solution, rather than a necessary condition?)

On the other hand, if the intended audience is e.g. an undergraduate deep learning class full of people who've never thought about Goodhart at all, then this post is awesome. It gives a very accessible explanation of the problem, well-written, with very vivid examples including great visuals.

(EDIT: None of this is to say that the post shouldn't be here; it's a great post. I left this comment just because I heard the authors wanted more feedback on the OP.)

[-]habryka6y*30

Note: This post was originally posted to the DeepMind blog, so presumably the target audience is a broader audience of Machine Learning researchers and people in that broad orbit. I pinged Vika about crossposting it because it also seemed like a good reference post that I expected would get linked to a bunch more frequently if it was available on LessWrong and the AIAF.

[-]Vika5y20

Thanks John for the feedback! As Oliver mentioned, the target audience is ML researchers (particularly RL researchers). The post is intended as an accessible introduction to the specification gaming problem for an ML audience that connects their perspective with a safety perspective on the problem. It is not intended to introduce novel concepts or a principled breakdown of the problem (I've made a note to clarify this in a later version of the post).

Regarding your specific questions about the breakdown, I think faithfully capturing the human concept of the task in a reward function is complementary to the other subproblems (mistaken assumptions and reward tampering). If we had a reward function that perfectly captures the task concept, we would still need to implement it based on correct assumptions about the environment, and make sure the agent does not tamper with its implementation in the environment. We could say that capturing the task concept happens at the design specification level, while the other subproblems happen at the implementation specification level, as given in this post.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

20

Specification gaming: the flip side of AI ingenuity

20