[AN #57] Why we should focus on robustness in AI safety, and the analogous problems in programming

Rohin Shah

Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I'm always happy to hear feedback; you can send it to me by commenting on this post.

Highlights

Designing robust & reliable AI systems and how to succeed in AI (Rob Wiblin and Pushmeet Kohli): (As is typical for large content, I'm only summarizing the most salient points, and ignoring entire sections of the podcast that didn't seem as relevant.)

In this podcast, Rob delves into the details of Pushmeet's work on making AI systems robust. Pushmeet doesn't view AI safety and AI capabilities as particularly distinct -- part of building a good AI system is ensuring that the system is safe, robust, reliable, and generalizes well. Otherwise, it won't do what we want, so why would we even bother using it. He aims to improve robustness by actively searching for behaviors that violate the specification, or by formally verifying particular properties of the neural net. That said, he also thinks that one of the major challenges here is in figuring out the specification of what to verify in the first place.

He sees the problems in AI as being similar to the ones that arise in programming and computer security. In programming, it is often the case that the program that one writes down does not accurately match the intended specification, leading to bugs. Often we simply accept that these bugs happen, but for security critical systems such as traffic lights we can use techniques like testing, fuzzing, symbolic execution, and formal verification that allow us to find these failures in programs. We now need to develop these techniques for machine learning systems.

The analogy can go much further. Static analysis involves understanding properties of a program separately from any inputs, while dynamic analysis involves understanding a program with a specific input. Similarly, we can have "static" interpretability, which understands the model as a whole (as in Feature visualization), or "dynamic" interpretability, which explains the model's output for a particular input. Another example is that the technique of abstract interpretation of programs is analogous to a particular method for verifying properties of neural nets.

This analogy suggests that we have faced the problems of AI safety before, and have made substantial progress on them; the challenge is now in doing it again but with machine learning systems. That said, there are some problems that are unique to AGI-type systems; it's just not the specification problem. For example, it is extremely unclear how we should communicate with such a system, which may have its own concepts and models that are very different from those of humans. We could try to use natural language, but if we do we need to ground the natural language in the way that humans do, and it's not clear how we could do that, though perhaps we could test if the learned concepts generalize to new settings. We could also try to look at the weights of our machine learning model and analyze whether it has learned the concept -- but only if we already have a formal specification of the concept, which seems hard to get.

Rohin's opinion: I really like the analogy between programming and AI; a lot of my thoughts have been shaped by thinking about this analogy myself. I agree that the analogy implies that we are trying to solve problems that we've attacked before in a different context, but I do think there are significant differences now. In particular, with long-term AI safety we are considering a setting in which mistakes can be extremely costly, and we can't provide a formal specification of what we want. Contrast this to traffic lights, where mistakes can be extremely costly but I'm guessing we can provide a formal specification of the safety constraints that need to be obeyed. To be fair, Pushmeet acknowledges this and highlights specification learning as a key area of research, but to me it feels like a qualitative difference from previous problems we've faced, whereas I think Pushmeet would disagree with that (but I'm not sure why).

Technical AI alignment

Learning human intent

Perceptual Values from Observation (Ashley D. Edwards et al) (summarized by Cody): This paper proposes a technique for learning from raw expert-trajectory observations by assuming that the last state in the trajectory is the state where the goal was achieved, and that other states have value in proportion to how close they are to a terminal state in demonstration trajectories. They use this as a grounding to train models predicting value and action-value, and then use these estimated values to determine actions.

Cody's opinion: This idea definitely gets points for being a clear and easy-to-implement heuristic, though I worry it may have trouble with videos that don't match its goal-directed assumption.

Delegative Reinforcement Learning (Vanessa Kosoy): Consider environments that have “traps”: states that permanently curtail the long-term value that an agent can achieve. A world without humans could be one such trap. Traps could also happen after any irreversible action, if the new state is not as useful for achieving high rewards as the old state.

In such an environment, an RL algorithm could simply take no actions, in which case it incurs regret that is linear in the number of timesteps so far. (Regret is the difference between the expected reward under the optimal policy and the policy actually executed, so if the average reward per timestep of the optimal policy is 2 and doing nothing is always reward 0, then the regret will be ~2T where T is the number of timesteps, so regret is linear in the number of timesteps.) Can we find an RL algorithm that will guarantee regret sublinear in the number of timesteps, regardless of the environment?

Unsurprisingly, this is impossible, since during exploration the RL agent could fall into a trap, which leads to linear regret. However, let's suppose that we could delegate to an advisor who knows the environment: what must be true about the advisor for us to do better? Clearly, the advisor must be able to always avoid traps (otherwise the same problem occurs). However, this is not enough: getting sublinear regret also requires us to explore enough to eventually find the optimal policy. So, the advisor must have at least some small probability of being optimal, which the agent can then learn from. This paper proves that with these assumptions there does exist an algorithm that is guaranteed to get sublinear regret.

Rohin's opinion: It's interesting to see what kinds of assumptions are necessary in order to get AI systems that can avoid catastrophically bad outcomes, and the notion of "traps" seems like a good way to formalize this. I worry about there being a Cartesian boundary between the agent and the environment, though perhaps even here as long as the advisor is aware of problems caused by such a boundary, they can be modeled as traps and thus avoided.

Of course, if we want the advisor to be a human, both of the assumptions are unrealistic, but I believe Vanessa's plan is to make the assumptions more realistic in order to see what assumptions are actually necessary.

One thing I wonder about is whether the focus on traps is necessary. With the presence of traps in the theoretical model, one of the main challenges is in preventing the agent from falling into a trap due to ignorance. However, it seems extremely unlikely that an AI system manages to take some irreversible catastrophic action by accident -- I'm much more worried about the case where the AI system is adversarially optimizing against us and intentionally takes an irreversible catastrophic action.

Reward learning theory

By default, avoid ambiguous distant situations (Stuart Armstrong)

Handling groups of agents

PRECOG: PREdiction Conditioned On Goals in Visual Multi-Agent Settings (Nicholas Rhinehart et al) (summarized by Cody): This paper models a multi-agent self driving car scenario by developing a model of future states conditional on both its own action and the action of multiple humans, and picking the latent-space action that balances between the desiderata of reaching its goal and preferring trajectories seen in the expert multi-agent trajectories its shown (where, e.g., two human agents rarely crash into one another).

Miscellaneous (Alignment)

Reinforcement learning with imperceptible rewards (Vanessa Kosoy): Typically in reinforcement learning, the reward function is defined over observations and actions, rather than directly on states, which ensures that the reward can always be calculated. However, in reality we care about underlying aspects of the state that may not easily be computed from observations. We can't guarantee sublinear regret, since if you are unsure about the reward in some unobservable part of the state that your actions nonetheless affect, then you can never learn the reward and approach optimality.

To fix this, we can work with rewards that are restricted to instrumental states only. I don't understand exactly how these work, since I don't know the math used in the formalization, but I believe the idea is for the set of instrumental states to be defined such that for any two instrumental states, there exists some "experiment" that the agent can run in order to distinguish between the states in some finite time. The main point of this post is that we can establish a regret bound for MDPs (not POMDPs yet), assuming that there are no traps.

AI strategy and policy

Beijing AI Principles: These principles are a collaboration between Chinese academia and industry, and hit upon many of the problems surrounding AI discussed today, including fairness, accountability, transparency, diversity, job automation, responsibility, ethics, etc. Notably for long-termists, it specifically mentions control risks, AGI, superintelligence, and AI races, and calls for international collaboration in AI governance.

Other progress in AI

Deep learning

Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask (Hattie Zhou, Janice Lan, Rosanne Liu et al) (summarized by Cody): This paper runs a series of experimental ablation studies to better understand the limits of the Lottery Ticket Hypothesis, and investigate variants of the initial pruning and masking procedure under which its effects are more and less pronounced. It is first and foremost a list of interesting results, without any central theory tying them together. These results include the observation that keeping pruned weights the same sign as their "lottery ticket" initialization seems more important than keeping their exact initial magnitudes, that taking a mixed strategy of zeroing pruned weights or freezing them at initialization can get better results, and that applying a learned 0/1 mask to a re-initialized network can get surprisingly high accuracy even without re-training.

Cody's opinion: While it certainly would have been exciting to have a paper presenting a unified (and empirically supported) theoretical understanding of the LTH, I respect the fact that this is such a purely empirical work, that tries to do one thing - designing and running clean, clear experiments - and does it well, without trying to construct explanations just for the sake of having them. We still have a ways to go in understanding the optimization dynamics underlying lottery tickets, but these seem like important and valuable data points on the road to that understanding.

Read more: Cody's longer summary

Applications

Challenges of Real-World Reinforcement Learning (Gabriel Dulac-Arnold et al) (summarized by Cody): This paper is a fairly clear and well-done literature review focusing on the difficulties that will need to be overcome in order to train and deploy reinforcement learning on real-world problems. They describe each of these challenges - which range from slow simulation speeds, to the need to frequently learn off-policy, to the importance of safety in real world systems - and for each propose or refer to an existing metric to capture how well a given RL model addresses the challenge. Finally, they propose a modified version of a humanoid environment with some of these real-world-style challenges baked in, and encourage other researchers to test systems within this framework.

Cody's opinion: This is a great introduction and overview for people who want to better understand the gaps between current RL and practically deployable RL. I do wish the authors had spent more time explaining and clarifying the design of their proposed testbed system, since the descriptions of it are all fairly high level.

News

Offer of collaboration and/or mentorship (Vanessa Kosoy): This is exactly what it sounds like. You can find out more about Vanessa's research agenda from The Learning-Theoretic AI Alignment Research Agenda (AN #13), and I've summarized two of her recent posts in this newsletter.

Human-aligned AI Summer School (Jan Kulveit et al): The second Human-aligned AI Summer School will be held in Prague from July 25-28, with a focus on "optimization and decision-making". Applications are due June 15.

Open Phil AI Fellowship — 2019 Class: The Open Phil AI Fellows for this year have been announced! Congratulations to all of the fellows :)

TAISU - Technical AI Safety Unconference (Linda Linsefors)

Learning-by-doing AI Safety workshop (Linda Linsefors)

Hi Rohin, thank you for writing about my work! I want to address some issues you brought up regarding Delegative RL.

I worry about there being a Cartesian boundary between the agent and the environment, though perhaps even here as long as the advisor is aware of problems caused by such a boundary, they can be modeled as traps and thus avoided.

Yes. I think that the Cartesian boundary is part of the definition of the agent, and events that violate the Cartesian boundary should be thought of as destroying the agent. Destruction of the agent is a certainly a trap since from the POV of the agent it is irreversible. See also the "death of the agent" subsection of the imperceptible rewards essay.

One thing I wonder about is whether the focus on traps is necessary. With the presence of traps in the theoretical model, one of the main challenges is in preventing the agent from falling into a trap due to ignorance. However, it seems extremely unlikely that an AI system manages to take some irreversible catastrophic action by accident -- I'm much more worried about the case where the AI system is adversarially optimizing against us and intentionally takes an irreversible catastrophic action.

I think that most of the "intentional" catastrophic actions can be regarded as "due to ignorance" from an appropriate perspective (the main exception is probably non-Cartesian daemons). Consider two examples:

Example 1 is corrupt states, that I discussed here. These are states in which the specified reward function doesn't match the intended reward function (and also possibly the advisor becomes unreliable). We can equip the agent with a prior that accounts for the existence of such states. However, without further help, the agent doesn't have enough information to know when it could enter one. So, if the agent decides to e.g. hack its own reward channel, one perspective is that it is an intentional action against us, but another perspective is that it is due to the agent's ignorance of the true model of corruption. This problem is indeed fixed by Delegative RL (assuming that, in uncorrupt states, the advisor's actions don't lead to corruption).

Example 2 is malign hypotheses. The agent's prior may contain hypotheses that are agentic in themselves. Such a hypothesis can intentionally produce correct predictions up to a "traitorous turn" point, at which it produces predictions that manipulate the agent into an irreversible catastrophic action. From the perspective of the "outer" agent this is "ignorance", but from the perspective of the "inner" agent, this is intentional. Once again delegative RL fixes it: at the traitorous turn, a DRL agent detects the ambiguity in predictions and the critical need to take the right action, leading it delegate. Observing the advisor's action leads it to update away from the malign hypothesis.

I think that most of the "intentional" catastrophic actions can be regarded as "due to ignorance" from an appropriate perspective

This makes sense to me at a high level, but I'm struggling to connect it to the math. It seems like delegative RL as described in the post I read wouldn't solve reward hacking, because it has a specific reward function that it is trying to maximize, and it can't "learn" a new reward function that it should be optimizing. I suppose if the advisor never explores an area of state space then the agent will never go there, but it doesn't feel like much progress if our safety guarantees require the delegator to never explore anywhere that reward hacking could occur.

Example 1 is corrupt states, that I discussed here. These are states in which the specified reward function doesn't match the intended reward function (and also possibly the advisor becomes unreliable).

So to be clear, this setting requires a different algorithm, detailed in the other post, right? (I haven't read that post in detail.) Maybe that's the answer to my question above; that in fact delegative RL doesn't solve reward hacking, but this other post does.

Dealing with corrupt states requires a "different" algorithm, but the modification is rather trivial: for each hypothesis that includes dynamics and corruption, you need to replace the corrupt states by an inescapable state with reward zero and run the usual PSRL algorithm on this new prior. Indeed, the algorithm deals with corruption by never letting the agent go there. I am not sure I understand why you think this is not a good approach. Consider a corrupt state in which the human's brain has been somehow scrambled to make em give high rewards. Do you think such a state should be explored? Maybe your complaint is that in the real world corruption is continuous rather than binary, and the advisor avoids most of corruption but not all of it and not with 100% success probability. In this case, I agree, the current model is extremely simplified, but it still feels like progress. You can see this for a model of continuous corruption in DIRL, a simpler setting. More generally, I think that a better version of the formalism would build on ideas from quantilization and catastrophe mitigation to arrive at a setting where, you have a low rate of falling into traps or accumulating corruption as long as your policy remains "close enough" to the advisor policy w.r.t. some metric similar to infinity-Renyi divergence (and, as long as your corruption remains low).

Consider a corrupt state in which the human's brain has been somehow scrambled to make em give high rewards. Do you think such a state should be explored?

I agree that state shouldn't be explored.

Maybe your complaint is that in the real world corruption is continuous rather than binary, and the advisor avoids most of corruption but not all of it and not with 100% success probability.

That seems closer to my objection but not exactly it.

Indeed, the algorithm deals with corruption by never letting the agent go there.

For states that cause existential catastrophes this seems obviously desirable. Maybe my objection is more that with this sort of algorithm you need to have the right set of hypotheses in the first place, and that seems like the main difficulty?

Maybe I'm also saying that this feels vulnerable to nearest unblocked strategies. Suppose the AI has learned that its reward function is to maximize paperclips, and the advisor doesn't realize that a complicated gadget the AI has built is a self-replicating nanorobot that will autonomously convert atoms into paperclips. It doesn't seem like DRL saves us here.

Maybe another way of putting it -- is there additional safety conferred by this approach that you couldn't get by having a human review all of the AI's actions? If so, should I think of this as "we want a human to review actions, but that's expensive, DRL is a way to make it more sample efficient"?

For states that cause existential catastrophes this seems obviously desirable. Maybe my objection is more that with this sort of algorithm you need to have the right set of hypotheses in the first place, and that seems like the main difficulty?

Ultimately, the set of hypotheses should be something like the universal prior. More precisely, it should be whatever we need to use instead of the universal prior to get "general" intelligence that is computationally efficient. In other words, I don't think the choice of prior here is substantially different or more difficult from the choice of prior for AGI from a pure capability POV.

Maybe I'm also saying that this feels vulnerable to nearest unblocked strategies. Suppose the AI has learned that its reward function is to maximize paperclips, and the advisor doesn't realize that a complicated gadget the AI has built is a self-replicating nanorobot that will autonomously convert atoms into paperclips. It doesn't seem like DRL saves us here.

The way I imagine it will work, the advisor will not do something weird and complicated that ey don't understand emself. So the AI also cannot do something weird and complicated that the advisor doesn't understand. In order for this not to be too constraining, I imagine the advisor having some kind of "diary" where ey write down eir thoughts and reasoning, which is considered a part of normal interaction with the environment. The advisor will only act on complicated plans after laying down the reasoning. The AI will then have to write down in this diary an explanation for its plans that will be understandable by and convincing for the advisor. This is partially protected from manipulations, because the explanation has to come from the space of explanations that the advisor could produce. That said, I think this defense from manipulation is insufficient in itself (because the AI can write down all arguments for a given position that the advisor could come up with, without writing down any arguments against it), and I have a research direction based on the "debate" approach about how to strengthen it.

Maybe another way of putting it -- is there additional safety conferred by this approach that you couldn't get by having a human review all of the AI's actions? If so, should I think of this as "we want a human to review actions, but that's expensive, DRL is a way to make it more sample efficient"?

The current version of the formalism is more or less the latter, but you should imagine the review to be rather conservative (like in the nonorobot example). In the "soft" version it will become a limit on how much the AI policy deviates from the advisor policy, so it's not quite a review in the usual sense: there is no binary division between "legal" and "illegal" actions. I think of it more like, the AI should emulate an "improved" version of the advisor: do all the things the advisor would do on eir "best day".

In other words, I don't think the choice of prior here is substantially different or more difficult from the choice of prior for AGI from a pure capability POV.

This seems wrong to me, but I'm having trouble articulating why. It feels like for the actual "prior" we use there will be many more hypotheses for capable behavior than for safe, capable behavior.

A background fact that's probably relevant: I don't expect that we'll be using an explicit prior, and to the extent that we have an implicit prior, I doubt it will look anything like the universal prior.

The way I imagine it will work, the advisor will not do something weird and complicated that ey don't understand emself. [...] I have a research direction based on the "debate" approach about how to strengthen it.

Yeah, this seems good to me!

The current version of the formalism is more or less the latter, but you should imagine the review to be rather conservative (like in the nonorobot example).

Okay, that makes sense.

I focus mostly on formal properties algorithms can or cannot have, rather than the algorithms themselves. So, from my point of view, it doesn't matter whether the prior is "explicit" and I doubt it's even a well-defined question. What I mean by "prior" is, more or less, whatever probability measure has the best Bayesian regret bound for the given RL algorithm.

I think the prior will have to look somewhat like the universal prior. Occam's razor is a foundational principle of rationality, and any reasonable algorithm should have inductive bias towards simpler hypotheses. I think there's even some work trying to prove that deep learning already has such inductive bias. At the same time, the space of hypotheses has to be very rich (although still constrained by computational resources and some additional structural assumptions needed to make learning feasible).

I think that DRL doesn't require a prior (or, more generally, algorithmic building blocks) substantially different from what is needed for capabilities, since if your algorithm is superintelligent (in the sense that, it's relevant to either causing or mitigating X-risk) then it has to create sophisticated models of the world that include people, among other things, and therefore forcing it to model the advisor as well doesn't make the task substantially harder (well, it is harder in the sense that the regret bound is weaker, but that is not because of the prior).

I really like the analogy between programming and AI; a lot of my thoughts have been shaped by thinking about this analogy myself.

I'm interested to know what insights you gained from this analogy, aside from what Pushmeet talked about.

To be fair, Pushmeet acknowledges this and highlights specification learning as a key area of research

I'm curious to learn about specification learning but didn't see where in the transcript Pushmeet talked about it. Can you give a pointer to what Pushmeet said, or a blog post or paper?

The following quotes are from the interview transcript:

In some sense every machine learning practitioner should be thinking about the question of generalization. Does my system generalize? Is my system robust?

In the long run, it seems to me that generalization starts to look like solving metaphilosophy.

Science is a very broad area and it is one of the key topics which gives us a way to understand about the world that we live in and even who we are. In terms of the topics, we have no constraint on topics. We are looking for problems in the general area of science, whether it’s biology, whether it’s physics, whether it’s chemistry, where machine learning can help, and not just that machine learning can help, but a way of doing machine learning where you have a dedicated team which works with conviction towards a very challenging problem could help.

There are supposed to be regular meetings between FHI and DeepMind, but it seems like the leaders at DeepMind aren't familiar with or aren't convinced of Nick Bostrom's ideas about The Vulnerable World Hypothesis and Differential Technological Development.

RE meetings with FHI/DeepMind (etc.): I think "aren't familiar with or aren't convinced" is part of it, but there are also political elements to all of this.

In general, I think most everything that is said publicly about AI-Xrisk has some political element to it. And people's private views are inevitably shaped by their public views somewhat (in expectation) as well.

I find it pretty hard to account for the influence of politics, though. And I probably overestimate it somewhat.

I'm interested to know what insights you gained from this analogy, aside from what Pushmeet talked about.

I'm not sure there's any particular insights I can point to; it was more an analogy that helped see how we had tackled similar problems before. I don't think it's that useful for figuring out solutions -- there's a reason I haven't done any projects at the intersection of PL and AI, despite my huge absolute advantage at it.

An example of the analogy is in program synthesis, in which you give some specification of what a program should do, and then the computer figures out the program that meets that specification. When you specify a program via input-output examples, you need to have "capability control" to make sure that it doesn't come up with the program "if input == X: return Y elif ...". It also often "games" the specification; one of my favorite examples is the function for sorting an array -- often a new student will provide the specification as $\forall i, j : i < j ⟹ A [i] < A [j]$ , and then the outputted program is "return []" or "return range(len(A))".

There are supposed to be regular meetings between FHI and DeepMind, but it seems like the leaders at DeepMind aren't familiar with or aren't convinced of Nick Bostrom's ideas about The Vulnerable World Hypothesis and Differential Technological Development.

As context (not really disagreeing), afaik those meetings are between DeepMind's AGI safety team and FHI. Pushmeet is not on that team and so probably doesn't attend those meetings.

I'm curious to learn about specification learning but didn't see where in the transcript Pushmeet talked about it. Can you give a pointer to what Pushmeet said, or a blog post or paper?

Here's one section, I think there were others that also alluded to it. Reading it again it's maybe more accurate from this section to talk about the expressivity of our language of specifications, rather than specification learning:

Robert Wiblin: Yeah. That leads into my next question, which was, is it going to be possible to formally verify safety performance on the ML systems that we want to use?

Pushmeet Kohli: I think a more pertinent question is, would it be possible to specify what we want out of the system, because at the end of the day you can only verify what you can specify. I think technically there is nothing, of course this is a very hard problem, but fundamentally we have solved hard search problems and challenging optimization problems and so on. So it is something that we can work towards, but a more critical problem is specifying what do we want to verify? What do we want to formally verify? At the moment we verify, is my function consistent with the input-output examples, that I gave the machine learning system and that’s very easy. You can take all the inputs in the training set, you can compute the outputs and then check whether the outputs are the same or not. That’s a very simple thing. No rocket science needed.

Pushmeet Kohli: Now, you can have a more sophisticated specification saying, well, if I perturb the input in some way or transform the input and I expect the output to not change or change in a specific way, is it true? That’s a harder question and would be showing that we can try to make progress. But what other types of specifications or what other type of behavior or what kind of rich questions might people want to ask in the future? That is a more challenging problem to think about.

Robert Wiblin: Interesting. So then relative to other people you think it’s going to be figuring out what we want to verify that’s harder rather than the verification process itself?

Pushmeet Kohli: Yeah, like how do you specify what is the task? Like a task is not a data set..

Robert Wiblin: How do you? Do you have any thoughts on that?

Pushmeet Kohli: Yes. I think this is something that … It goes into like how this whole idea of, it’s a very philosophical thing, how do we specify tasks? When we talk about tasks, we talk about in human language. I can describe a task to you and because we share some notion of certain concepts, I can tell you, well, we should try to detect whether a car passes by and what is a car, a car has something which has four wheels and something, and can drive itself and so on. And a child with a scooter, which also has four wheels goes past and you say, “Oh that’s a car.” You say, “No, that’s not a car.” The car is slightly different, bigger, basically people can sit inside it and so on. I’m describing the task of detecting what is a car in these human concepts that I believe that you and I share a common understanding of.

Pushmeet Kohli: That’s a key assumption that I’ve made. Will I be able to also communicate with the machine in those same concepts? Does the machine understand those concepts? This is a key question that we have to try to think about. At the moment we’re just saying, oh input, this is the output, input this output that. This is a very poor form of teaching. If you’re trying to teach an intelligent system, just showing it examples is a very poor form of teaching. There’s a much more richer, like when we are talking about solving a task, we are talking in human language and human concepts.

Robert Wiblin: It seems like you might think that it would be reliability enhancing to have better natural language processing that, that’s going to be disproportionately useful?

Pushmeet Kohli: Natural processing would be useful, but the grounding problem of does a machine really understand-

Robert Wiblin: The concepts, or is it just pretending, or is it just aping it?

Pushmeet Kohli: Exactly.

Specification learning is also explicitly called out in Towards Robust and Verified AI: Specification Testing, Robust Training, and Formal Verification (AN #52):

Learning specifications: Specifications that capture “correct” behavior in AI systems are often difficult to precisely state. Building systems that can use partial human specifications and learn further specifications from evaluative feedback would be required as we build increasingly intelligent agents capable of exhibiting complex behaviors and acting in unstructured environments.

there’s a reason I haven’t done any projects at the intersection of PL and AI, despite my huge comparative advantage at it.

What's PL? Programming languages?

As context (not really disagreeing), afaik those meetings are between DeepMind’s AGI safety team and FHI. Pushmeet is not on that team and so probably doesn’t attend those meetings.

I guess I was imagining that people in the AGI safety team must know about the "AI for science" project that Pushmeet is heading up, and Pushmeet also heads up the ML safety team, which he says collaborates "very, very closely" with the AGI safety team, so they should have a lot of chances to talk. Perhaps they just talk about technical safety issues, and not about strategy.

Specification learning is also explicitly called out in Towards Robust and Verified AI: Specification Testing, Robust Training, and Formal Verification (AN #52)

Do you know if there are any further details about it somewhere, aside from just the bare idea of "maybe we can learn specifications from evaluative feedback"?

What's PL? Programming languages?

Yes, sorry for the jargon.

Do you know if there are any further details about it somewhere, aside from just the bare idea of "maybe we can learn specifications from evaluative feedback"?

Not to my knowledge.

@Rohin: Checking in, is the the alignment newsletter on summer break or something in that space? Just noticed that there hasn't been a newsletter in a bit.

For the last month or two, I've been too busy to get a newsletter out every week. It is still happening, just not on any consistent schedule at the moment.

Hi Rohin, thank you for writing about my work! I want to address some issues you brought up regarding Delegative RL.

I worry about there being a Cartesian boundary between the agent and the environment, though perhaps even here as long as the advisor is aware of problems caused by such a boundary, they can be modeled as traps and thus avoided.

One thing I wonder about is whether the focus on traps is necessary. With the presence of traps in the theoretical model, one of the main challenges is in preventing the agent from falling into a trap due to ignorance. However, it seems extremely unlikely that an AI system manages to take some irreversible catastrophic action by accident -- I'm much more worried about the case where the AI system is adversarially optimizing against us and intentionally takes an irreversible catastrophic action.

I think that most of the "intentional" catastrophic actions can be regarded as "due to ignorance" from an appropriate perspective

Example 1 is corrupt states, that I discussed here. These are states in which the specified reward function doesn't match the intended reward function (and also possibly the advisor becomes unreliable).

Consider a corrupt state in which the human's brain has been somehow scrambled to make em give high rewards. Do you think such a state should be explored?

I agree that state shouldn't be explored.

Maybe your complaint is that in the real world corruption is continuous rather than binary, and the advisor avoids most of corruption but not all of it and not with 100% success probability.

That seems closer to my objection but not exactly it.

Indeed, the algorithm deals with corruption by never letting the agent go there.

For states that cause existential catastrophes this seems obviously desirable. Maybe my objection is more that with this sort of algorithm you need to have the right set of hypotheses in the first place, and that seems like the main difficulty?

Maybe I'm also saying that this feels vulnerable to nearest unblocked strategies. Suppose the AI has learned that its reward function is to maximize paperclips, and the advisor doesn't realize that a complicated gadget the AI has built is a self-replicating nanorobot that will autonomously convert atoms into paperclips. It doesn't seem like DRL saves us here.

Maybe another way of putting it -- is there additional safety conferred by this approach that you couldn't get by having a human review all of the AI's actions? If so, should I think of this as "we want a human to review actions, but that's expensive, DRL is a way to make it more sample efficient"?

In other words, I don't think the choice of prior here is substantially different or more difficult from the choice of prior for AGI from a pure capability POV.

This seems wrong to me, but I'm having trouble articulating why. It feels like for the actual "prior" we use there will be many more hypotheses for capable behavior than for safe, capable behavior.

The way I imagine it will work, the advisor will not do something weird and complicated that ey don't understand emself. [...] I have a research direction based on the "debate" approach about how to strengthen it.

Yeah, this seems good to me!

The current version of the formalism is more or less the latter, but you should imagine the review to be rather conservative (like in the nonorobot example).

Okay, that makes sense.

I really like the analogy between programming and AI; a lot of my thoughts have been shaped by thinking about this analogy myself.

I'm interested to know what insights you gained from this analogy, aside from what Pushmeet talked about.

To be fair, Pushmeet acknowledges this and highlights specification learning as a key area of research

I'm curious to learn about specification learning but didn't see where in the transcript Pushmeet talked about it. Can you give a pointer to what Pushmeet said, or a blog post or paper?

The following quotes are from the interview transcript:

In some sense every machine learning practitioner should be thinking about the question of generalization. Does my system generalize? Is my system robust?

In the long run, it seems to me that generalization starts to look like solving metaphilosophy.

Science is a very broad area and it is one of the key topics which gives us a way to understand about the world that we live in and even who we are. In terms of the topics, we have no constraint on topics. We are looking for problems in the general area of science, whether it’s biology, whether it’s physics, whether it’s chemistry, where machine learning can help, and not just that machine learning can help, but a way of doing machine learning where you have a dedicated team which works with conviction towards a very challenging problem could help.

RE meetings with FHI/DeepMind (etc.): I think "aren't familiar with or aren't convinced" is part of it, but there are also political elements to all of this.

I find it pretty hard to account for the influence of politics, though. And I probably overestimate it somewhat.

I'm interested to know what insights you gained from this analogy, aside from what Pushmeet talked about.

There are supposed to be regular meetings between FHI and DeepMind, but it seems like the leaders at DeepMind aren't familiar with or aren't convinced of Nick Bostrom's ideas about The Vulnerable World Hypothesis and Differential Technological Development.

As context (not really disagreeing), afaik those meetings are between DeepMind's AGI safety team and FHI. Pushmeet is not on that team and so probably doesn't attend those meetings.

I'm curious to learn about specification learning but didn't see where in the transcript Pushmeet talked about it. Can you give a pointer to what Pushmeet said, or a blog post or paper?

Robert Wiblin: Yeah. That leads into my next question, which was, is it going to be possible to formally verify safety performance on the ML systems that we want to use?

Pushmeet Kohli: I think a more pertinent question is, would it be possible to specify what we want out of the system, because at the end of the day you can only verify what you can specify. I think technically there is nothing, of course this is a very hard problem, but fundamentally we have solved hard search problems and challenging optimization problems and so on. So it is something that we can work towards, but a more critical problem is specifying what do we want to verify? What do we want to formally verify? At the moment we verify, is my function consistent with the input-output examples, that I gave the machine learning system and that’s very easy. You can take all the inputs in the training set, you can compute the outputs and then check whether the outputs are the same or not. That’s a very simple thing. No rocket science needed.

Pushmeet Kohli: Now, you can have a more sophisticated specification saying, well, if I perturb the input in some way or transform the input and I expect the output to not change or change in a specific way, is it true? That’s a harder question and would be showing that we can try to make progress. But what other types of specifications or what other type of behavior or what kind of rich questions might people want to ask in the future? That is a more challenging problem to think about.

Robert Wiblin: Interesting. So then relative to other people you think it’s going to be figuring out what we want to verify that’s harder rather than the verification process itself?

Pushmeet Kohli: Yeah, like how do you specify what is the task? Like a task is not a data set..

Robert Wiblin: How do you? Do you have any thoughts on that?

Pushmeet Kohli: Yes. I think this is something that … It goes into like how this whole idea of, it’s a very philosophical thing, how do we specify tasks? When we talk about tasks, we talk about in human language. I can describe a task to you and because we share some notion of certain concepts, I can tell you, well, we should try to detect whether a car passes by and what is a car, a car has something which has four wheels and something, and can drive itself and so on. And a child with a scooter, which also has four wheels goes past and you say, “Oh that’s a car.” You say, “No, that’s not a car.” The car is slightly different, bigger, basically people can sit inside it and so on. I’m describing the task of detecting what is a car in these human concepts that I believe that you and I share a common understanding of.

Pushmeet Kohli: That’s a key assumption that I’ve made. Will I be able to also communicate with the machine in those same concepts? Does the machine understand those concepts? This is a key question that we have to try to think about. At the moment we’re just saying, oh input, this is the output, input this output that. This is a very poor form of teaching. If you’re trying to teach an intelligent system, just showing it examples is a very poor form of teaching. There’s a much more richer, like when we are talking about solving a task, we are talking in human language and human concepts.

Robert Wiblin: It seems like you might think that it would be reliability enhancing to have better natural language processing that, that’s going to be disproportionately useful?

Pushmeet Kohli: Natural processing would be useful, but the grounding problem of does a machine really understand-

Robert Wiblin: The concepts, or is it just pretending, or is it just aping it?

Pushmeet Kohli: Exactly.

Specification learning is also explicitly called out in Towards Robust and Verified AI: Specification Testing, Robust Training, and Formal Verification (AN #52):

Learning specifications: Specifications that capture “correct” behavior in AI systems are often difficult to precisely state. Building systems that can use partial human specifications and learn further specifications from evaluative feedback would be required as we build increasingly intelligent agents capable of exhibiting complex behaviors and acting in unstructured environments.

there’s a reason I haven’t done any projects at the intersection of PL and AI, despite my huge comparative advantage at it.

What's PL? Programming languages?

As context (not really disagreeing), afaik those meetings are between DeepMind’s AGI safety team and FHI. Pushmeet is not on that team and so probably doesn’t attend those meetings.

Specification learning is also explicitly called out in Towards Robust and Verified AI: Specification Testing, Robust Training, and Formal Verification (AN #52)

Do you know if there are any further details about it somewhere, aside from just the bare idea of "maybe we can learn specifications from evaluative feedback"?

What's PL? Programming languages?

Yes, sorry for the jargon.

Do you know if there are any further details about it somewhere, aside from just the bare idea of "maybe we can learn specifications from evaluative feedback"?

Not to my knowledge.

@Rohin: Checking in, is the the alignment newsletter on summer break or something in that space? Just noticed that there hasn't been a newsletter in a bit.

For the last month or two, I've been too busy to get a newsletter out every week. It is still happening, just not on any consistent schedule at the moment.