Research scientist at DeepMind working on AI safety, and cofounder of the Future of Life Institute. Website and blog:

Wiki Contributions


ELK contest submission: route understanding through the human ontology

I think our proposal addresses the "simple steganography" problem, as described in "ELK prize results / First counterexample: simple steganography":

By varying the phrasing and syntax of an answer without changing its meaning, a reporter could communicate large amounts of information to the auxiliary model. Similarly, there are many questions where a human is unsure about the answer and the reporter knows it. A reporter could encode information by answering each of these questions arbitrarily. Unless the true answers have maximum entropy, this strategy could encode more information than direct translation. 

An entropy penalty on the reporter's output would discourage the spurious variations in answers described above (assuming that steganographic output has higher entropy than the true answers). I agree that the general case of steganography is not addressed by simple approaches like an entropy penalty, e.g. "Harder counterexample: steganography using flexible degrees of freedom". 

Possible takeaways from the coronavirus pandemic for slow AI takeoff

I generally endorse the claims made in this post and the overall analogy. Since this post was written, there are a few more examples I can add to the categories for slow takeoff properties. 

Learning from experience

  • The UK procrastinated on locking down in response to the Alpha variant due to political considerations (not wanting to "cancel Christmas"), though it was known that timely lockdowns are much more effective.
  • Various countries reacted to Omicron with travel bans after they already had community transmission (e.g. Canada and the UK), while it was known that these measures would be ineffective.

Warning signs

  • Since there is a non-negligible possibility that covid-19 originated in a lab, the covid pandemic can be viewed as a warning sign about the dangers of gain of function research. So far, as far as I know, this warning sign has not been acted upon (there is no significant new initiative to ban gain of function research). 
  • I think there was some improvement at acting on warning signs for subsequent variants (e.g. I believe that measures in response to Omicron were generally taken faster than measures for original covid). This gives me some hope that our institutions can get better at reacting to warning signs with practice (at least warning signs that are similar to those they have encountered before). This suggests that dealing with narrow AI disasters could potentially lead institutions to improve their ability to respond to warning signs.

Consensus on the problem

  • It took a long time to reach consensus on the importance of mask wearing and aerosol transmission.
  • We still don't seem to have widespread consensus that transmission through surfaces is insignificant, at least judging by the amount of effort that seems to go into disinfection and cleaning in various buildings that I visit.
More Is Different for AI

Really excited to read this sequence as well!

Optimization Concepts in the Game of Life

Ah I see, thanks for the clarification! The 'bottle cap' (block) example is robust to removing any one cell but not robust to adding cells next to it (as mentioned in Oscar's comment). So most random perturbations that overlap with the block will probably destroy it. 

Optimization Concepts in the Game of Life
  1. Actually, we realized that if we consider an empty board an optimizing system, then any finite pattern is an optimizing system (because it's similarly robust to adding non-viable collections of live cells), which is not very interesting. We have updated the post to reflect this.
Optimization Concepts in the Game of Life

Thanks for pointing this out! We realized that if we consider an empty board an optimizing system then any finite pattern is an optimizing system (because it's similarly robust to adding non-viable collections of live cells), which is not very interesting. We have updated the post to reflect this.

The 'bottle cap' example would be an optimizing system if it was robust to cells colliding / interacting with it, e.g. being hit by a glider (similarly to the eater). 

List of good AI safety project ideas?

Thanks Aryeh for collecting these! I added them to a new Project Ideas section in my AI Safety Resources list.

Classifying specification problems as variants of Goodhart's Law

Writing this post helped clarify my understanding of the concepts in both taxonomies - the different levels of specification and types of Goodhart effects. The parts of the taxonomies that I was not sure how to match up usually corresponded to the concepts I was most confused about. For example, I initially thought that adversarial Goodhart is an emergent specification problem, but upon further reflection this didn't seem right. Looking back, I think I still endorse the mapping described in this post.

I hoped to get more comments on this post proposing other ways to match up these concepts, and I think the post would have more impact if there was more discussion of its claims. The low level of engagement with this post was an update for me that the exercise of connecting different maps of safety problems is less valuable than I thought. 

Tradeoff between desirable properties for baseline choices in impact measures

It was not my intention to imply that semantic structure is never needed - I was just saying that the pedestrian example does not indicate the need for semantic structure. I would generally like to minimize the use of semantic structure in impact measures, but I agree it's unlikely we can get away without it. 

There are some kinds of semantic structure that the agent can learn without explicit human input, e.g. by observing how humans have arranged the world (as in the RLSP paper). I think it's plausible that agents can learn the semantic structure that's needed for impact measures through unsupervised learning about the world, without relying on human input. This information could be incorporated in the weights assigned to reaching different states or satisfying different utility functions by the deviation measure (e.g. states where pigeons / cats are alive). 

Tradeoff between desirable properties for baseline choices in impact measures

Looks great, thanks! Minor point: in the sparse reward case, rather than "setting the baseline to the last state in which a reward was achieved", we set the initial state of the inaction baseline to be this last rewarded state, and then apply noops from this initial state to obtain the baseline state (otherwise this would be a starting state baseline rather than an inaction baseline). 

Load More