Research scientist at DeepMind working on AI safety, and cofounder of the Future of Life Institute. Website and blog:


Tradeoff between desirable properties for baseline choices in impact measures

It was not my intention to imply that semantic structure is never needed - I was just saying that the pedestrian example does not indicate the need for semantic structure. I would generally like to minimize the use of semantic structure in impact measures, but I agree it's unlikely we can get away without it. 

There are some kinds of semantic structure that the agent can learn without explicit human input, e.g. by observing how humans have arranged the world (as in the RLSP paper). I think it's plausible that agents can learn the semantic structure that's needed for impact measures through unsupervised learning about the world, without relying on human input. This information could be incorporated in the weights assigned to reaching different states or satisfying different utility functions by the deviation measure (e.g. states where pigeons / cats are alive). 

Tradeoff between desirable properties for baseline choices in impact measures

Looks great, thanks! Minor point: in the sparse reward case, rather than "setting the baseline to the last state in which a reward was achieved", we set the initial state of the inaction baseline to be this last rewarded state, and then apply noops from this initial state to obtain the baseline state (otherwise this would be a starting state baseline rather than an inaction baseline). 

Tradeoff between desirable properties for baseline choices in impact measures

I would say that impact measures don't consider these kinds of judgments. The "doing nothing" baseline can be seen as analogous to the agent never being deployed, e.g. in the Low Impact AI paper. If the agent is never deployed, and someone dies in the meantime, then it's not the agent's responsibility and is not part of the agent's impact on the world.

I think the intuition you are describing partly arises from the choice of language: "killing someone by not doing something" vs "someone dying while you are doing nothing". The word "killing" is an active verb that carries a connotation of responsibility. If you taboo this word, does your question persist?

Tradeoff between desirable properties for baseline choices in impact measures

Thanks Flo for pointing this out. I agree with your reasoning for why we want the Markov property. For the second modification, we can sample a rollout from the agent policy rather than computing a penalty over all possible rollouts. For example, we could randomly choose an integer N, roll out the agent policy and the inaction policy for N steps, and then compare the resulting states. This does require a complete environment model (which does make it more complicated to apply standard RL), while inaction rollouts only require a partial environment model (predicting the outcome of the noop action in each state). If you don't have a complete environment model, then you can still use the first modification (sampling a baseline state from the inaction rollout). 

Tradeoff between desirable properties for baseline choices in impact measures

I don't think the pedestrian example shows a need for semantic structure. The example is intended to illustrate that an agent with the stepwise inaction baseline has no incentive to undo the delayed effect that it has set up. We want the baseline to incentivize the agent to undo any delayed effect, whether it involves hitting a pedestrian or making a pigeon fly. 

The pedestrian and pigeon effects differ in the magnitude of impact, so it is the job of the deviation measure to distinguish between them and penalize the pedestrian effect more. Optionality-based deviation measures (AU and RR) capture this distinction because hitting the pedestrian eliminates more options than making the pigeon fly.

Tradeoff between desirable properties for baseline choices in impact measures

The baseline is not intended to indicate what should happen, but rather what happens by default. The role of the baseline is to filter out effects that were not caused by the agent, to avoid penalizing the agent for them (which would produce interference incentives). Explicitly specifying what should happen usually requires environment-specific human input, and impact measures generally try to avoid this.

Specification gaming: the flip side of AI ingenuity

Thanks Koen for your feedback! You make a great point about a clearer call to action for RL researchers. I think an immediate call to action is to be aware of the following:

  • there is a broader scope of aligned RL agent design
  • there are difficult unsolved problems in this broader scope
  • for sufficiently advanced agents, these problems need general solutions rather than ad-hoc ones

Then a long-term call to action (if/when they are in the position to deploy an advanced AI system) is to consider the broader scope and look for general solutions to specification problems rather than deploying ad-hoc solutions. For those general solutions, they could refer to the safety literature and/or consult the safety community.

Specification gaming: the flip side of AI ingenuity

Thanks John for the feedback! As Oliver mentioned, the target audience is ML researchers (particularly RL researchers). The post is intended as an accessible introduction to the specification gaming problem for an ML audience that connects their perspective with a safety perspective on the problem. It is not intended to introduce novel concepts or a principled breakdown of the problem (I've made a note to clarify this in a later version of the post).

Regarding your specific questions about the breakdown, I think faithfully capturing the human concept of the task in a reward function is complementary to the other subproblems (mistaken assumptions and reward tampering). If we had a reward function that perfectly captures the task concept, we would still need to implement it based on correct assumptions about the environment, and make sure the agent does not tamper with its implementation in the environment. We could say that capturing the task concept happens at the design specification level, while the other subproblems happen at the implementation specification level, as given in this post.

Specification gaming: the flip side of AI ingenuity

Thanks Adam for the feedback - glad you enjoyed the post!

For the Lego example, the agent received a fixed shaping reward for grasping the red brick if the bottom face was above a certain height (3cm), rather than being rewarded in proportion to the height of the bottom face. Thus, it found an easy way to collect the shaping reward by flipping the brick, while stacking it upside down on the blue brick would be a more difficult way to get the same shaping reward. The current description of the example in the post does make it sound like the reward is proportional to the height - I'll make a note to fix this in a later version of the post.

Possible takeaways from the coronavirus pandemic for slow AI takeoff

Thanks Matthew for your interesting points! I agree that it's not clear whether the pandemic is a good analogy for slow takeoff. When I was drafting the post, I started with an analogy with "medium" takeoff (on the time scale of months), but later updated towards the slow takeoff scenario being a better match. The pandemic response in 2020 (since covid became apparent as a threat) is most relevant for the medium takeoff analogy, while the general level of readiness for a coronavirus pandemic prior to 2020 is most relevant for the slow takeoff analogy.

I agree with Ben's response to your comment. Covid did not spring into existence in a world where pandemics are irrelevant, since there have been many recent epidemics and experts have been sounding the alarm about the next one. You make a good point that epidemics don't gradually increase in severity, though I think they have been increasing in frequency and global reach as a result of international travel, and the possibility of a virus escaping from a lab also increases the chances of encountering more powerful pathogens in the future. Overall, I agree that we can probably expect AI systems to increase in competence more gradually in a slow takeoff scenario, which is a reason for optimism.

Your objections to the parallel with covid not being taken seriously seem reasonable to me, and I'm not very confident in this analogy overall. However, one could argue that the experience with previous epidemics should have resulted in a stronger prior on pandemics being a serious threat. I think it was clear from the outset of the covid epidemic that it's much more contagious than seasonal flu, which should have produced an update towards it being a serious threat as well.

I agree that the direct economic effects of advanced AI would be obvious to observers, but I don't think this would necessarily translate into widespread awareness that much more powerful AI systems are imminent that could transform the world even more. People are generally bad at reacting to exponential trends, as we've seen in the covid response. If we had general-purpose household robots in every home, I would expect some people to take the risks of general AI more seriously, and some other people to say "I don't see my household robot trying to take over the world, so these concerns about general AI are overblown". Overall, as more advanced AI systems are developed and have a large economic impact, I would expect the proportion of people who take the risks of general AI seriously to increase steadily, but wouldn't expect widespread consensus until relatively late in the game.

Load More