Ryan Carey

Wiki Contributions


The title suggests (weakly perhaps) that the estimates themselves peer-reviewed. Would be clearer to write "building on" peer reviewed argument, or similar.

Some reports are not publicised in order not to speed up timelines. And ELK is a bit rambly - I wonder if it will get subsumed by much better content within 2yr. But I do largely agree.

It would be useful to have a more descriptive title, like "Chinchilla's implications for data bottlenecks" or something.

It's noteworthy that the safety guarantee relies on the "hidden cost" (:= proxy_utility - actual_utility)  of each action being bounded above. If it's unbounded, then the theoretical guarantee disappears. 

For past work on causal conceptions of corrigibility, you should check out this by Jessica Taylor. Quite similar.

Transformer models (like GPT-3) are generators of human-like text, so they can be modeled as quantilizers. However, any quantiliser guarantees are very weak, because they quantilise with very low q, equal to the likelihood that a human would generate that prompt.

I imagine you could catch useful work with i) models of AI safety, or ii) analysis of failure modes, or something, though I'm obviously biased here.

The implication seems to be that this RFP is for AIS work that is especially focused on DL systems. Is there likely to be a future RFP for AIS research that applies equally well to DL and non-DL systems? Regardless of where my research lands, I imagine a lot of useful and underfunded research fits in the latter category.

Thanks for these thoughts about the causal agenda. I basically agree with you on the facts, though I have a more favourable interpretation of how they bear on the potential of the causal incentives agenda. I've paraphrased the three bullet points, and responded in reverse order:

3) Many important incentives are not captured by the approach - e.g. sometimes an agent has an incentive to influence a variable, even if that variable does not cause reward attainment. 

-> Agreed. We're starting to study "side-effect incentives" (improved name pending), which have this property. We're still figuring out whether we should just care about the union of SE incentives and control incentives, or whether SE or when, SE incentives should be considered less dangerous. Whether the causal style of incentive analysis captures much of what we care about, I think will be borne out by applying it and alternatives to a bunch of safety problems.

2) sometimes we need more specific quantities, than just D affects A.

-> Agreed. We've privately discussed directional quantities like "do(D=d) causes A=a" as being more safety-relevant, and are happy to hear other ideas.

1) eliminating all control-incentives seems unrealistic

-> Strongly agree it's infeasibile to remove CIs on all variables. My more modest goal would be to prove that for particular variables (or classes of variables) such as a shut down button, or a human's values, we can either: 1) prove how to remove control (+ side-effect) incentives, or 2) why this is impossible, given realistic assumptions. If (2), then that theoretical case could justify allocation of resources to learning-oriented approaches.

Overall, I concede that we haven't engaged much on safety issues in the last year. Partly, it's that the projects have had to fit within people's PhDs. Which will also be true this year. But having some of the framework stuff behind us, we should still be able to study safety more, and gain a sense of how addressable concerns like these are, and to what extent causal decision problems/games are a really useful ontology for AI safety.

One alternative would be to try to raise funds (e.g. perhaps from the EA LTF fund) to pay reviewers to perform reviews.

Load More