The idea that "Agents are systems that would adapt their policy if their actions influenced the world in a different way." works well on mechanised CIDs whose variables are neatly divided into object-level and mechanism nodes: we simply check for a path from a utility function F_U to a policy Pi_D. But to apply this to a physical system, we would need a way to obtain such a partition those variables. Specifically, we need to know (1) what counts as a policy, and (2) whether any of its antecedents count as representations of "influence" on the world (and after all, antecedents A of the policy can only be 'representations' of the influence, because in the real world, the agent's actions cannot influence themselves by some D->A->Pi->D loop). Does a spinal reflex count as a policy? Does an ant's decision to fight come from a representation of a desire to save its queen? How accurate does its belief about the forthcoming battle have to be before this representation counts? I'm not sure the paper answers these questions formally, nor am I sure that it's even possible to do so. These questions don't seem to have objectively right or wrong answers.
So we don't really have any full procedure for "identifying agents". I do think we gain some conceptual clarity. But on my reading, this clear definition serves to crystallise how hard it is to identify agents, moreso than it shows practically how it can be done.
(NB. I read this paper months ago, so apologies if I've got any of the details wrong.)
The title suggests (weakly perhaps) that the estimates themselves peer-reviewed. Would be clearer to write "building on" peer reviewed argument, or similar.
Some reports are not publicised in order not to speed up timelines. And ELK is a bit rambly - I wonder if it will get subsumed by much better content within 2yr. But I do largely agree.
It would be useful to have a more descriptive title, like "Chinchilla's implications for data bottlenecks" or something.
It's noteworthy that the safety guarantee relies on the "hidden cost" (:= proxy_utility - actual_utility) of each action being bounded above. If it's unbounded, then the theoretical guarantee disappears.
For past work on causal conceptions of corrigibility, you should check out this by Jessica Taylor. Quite similar.
Transformer models (like GPT-3) are generators of human-like text, so they can be modeled as quantilizers. However, any quantiliser guarantees are very weak, because they quantilise with very low q, equal to the likelihood that a human would generate that prompt.
I imagine you could catch useful work with i) models of AI safety, or ii) analysis of failure modes, or something, though I'm obviously biased here.
The implication seems to be that this RFP is for AIS work that is especially focused on DL systems. Is there likely to be a future RFP for AIS research that applies equally well to DL and non-DL systems? Regardless of where my research lands, I imagine a lot of useful and underfunded research fits in the latter category.
Thanks for these thoughts about the causal agenda. I basically agree with you on the facts, though I have a more favourable interpretation of how they bear on the potential of the causal incentives agenda. I've paraphrased the three bullet points, and responded in reverse order:
3) Many important incentives are not captured by the approach - e.g. sometimes an agent has an incentive to influence a variable, even if that variable does not cause reward attainment.
-> Agreed. We're starting to study "side-effect incentives" (improved name pending), which have this property. We're still figuring out whether we should just care about the union of SE incentives and control incentives, or whether SE or when, SE incentives should be considered less dangerous. Whether the causal style of incentive analysis captures much of what we care about, I think will be borne out by applying it and alternatives to a bunch of safety problems.
2) sometimes we need more specific quantities, than just D affects A.
-> Agreed. We've privately discussed directional quantities like "do(D=d) causes A=a" as being more safety-relevant, and are happy to hear other ideas.
1) eliminating all control-incentives seems unrealistic
-> Strongly agree it's infeasibile to remove CIs on all variables. My more modest goal would be to prove that for particular variables (or classes of variables) such as a shut down button, or a human's values, we can either: 1) prove how to remove control (+ side-effect) incentives, or 2) why this is impossible, given realistic assumptions. If (2), then that theoretical case could justify allocation of resources to learning-oriented approaches.
Overall, I concede that we haven't engaged much on safety issues in the last year. Partly, it's that the projects have had to fit within people's PhDs. Which will also be true this year. But having some of the framework stuff behind us, we should still be able to study safety more, and gain a sense of how addressable concerns like these are, and to what extent causal decision problems/games are a really useful ontology for AI safety.