Ryan Carey

Wiki Contributions


RyanCarey's Shortform

Transformer models (like GPT-3) are generators of human-like text, so they can be modeled as quantilizers. However, any quantiliser guarantees are very weak, because they quantilise with very low q, equal to the likelihood that a human would generate that prompt.

Provide feedback on Open Philanthropy’s AI alignment RFP

I imagine you could catch useful work with i) models of AI safety, or ii) analysis of failure modes, or something, though I'm obviously biased here.

Provide feedback on Open Philanthropy’s AI alignment RFP

The implication seems to be that this RFP is for AIS work that is especially focused on DL systems. Is there likely to be a future RFP for AIS research that applies equally well to DL and non-DL systems? Regardless of where my research lands, I imagine a lot of useful and underfunded research fits in the latter category.

AMA: Paul Christiano, alignment researcher

Thanks for these thoughts about the causal agenda. I basically agree with you on the facts, though I have a more favourable interpretation of how they bear on the potential of the causal incentives agenda. I've paraphrased the three bullet points, and responded in reverse order:

3) Many important incentives are not captured by the approach - e.g. sometimes an agent has an incentive to influence a variable, even if that variable does not cause reward attainment. 

-> Agreed. We're starting to study "side-effect incentives" (improved name pending), which have this property. We're still figuring out whether we should just care about the union of SE incentives and control incentives, or whether SE or when, SE incentives should be considered less dangerous. Whether the causal style of incentive analysis captures much of what we care about, I think will be borne out by applying it and alternatives to a bunch of safety problems.

2) sometimes we need more specific quantities, than just D affects A.

-> Agreed. We've privately discussed directional quantities like "do(D=d) causes A=a" as being more safety-relevant, and are happy to hear other ideas.

1) eliminating all control-incentives seems unrealistic

-> Strongly agree it's infeasibile to remove CIs on all variables. My more modest goal would be to prove that for particular variables (or classes of variables) such as a shut down button, or a human's values, we can either: 1) prove how to remove control (+ side-effect) incentives, or 2) why this is impossible, given realistic assumptions. If (2), then that theoretical case could justify allocation of resources to learning-oriented approaches.

Overall, I concede that we haven't engaged much on safety issues in the last year. Partly, it's that the projects have had to fit within people's PhDs. Which will also be true this year. But having some of the framework stuff behind us, we should still be able to study safety more, and gain a sense of how addressable concerns like these are, and to what extent causal decision problems/games are a really useful ontology for AI safety.

The Case for a Journal of AI Alignment

One alternative would be to try to raise funds (e.g. perhaps from the EA LTF fund) to pay reviewers to perform reviews.

The Case for a Journal of AI Alignment

I don't (and perhaps shouldn't) have a guaranteed trigger - probably I will learn a lot more about what the trigger should be over the next couple years. But my current picture would be that the following are mostly true:

  • The AIS field is publishing 3-10x more papers per year as the causal inference field is now.
  • We have ~3 highly aligned tenured professors at top-10 schools, and ~3 mostly-aligned tenured professors with ~10k citations, who want to be editors of the journal
  • The number of great papers that can't get into other top AI journals is >20 per year. I figure it's currently like ~2.
  • The chance that some other group creates a similar (worse) journal for safety in the subsequent 3 years is >20%
The Case for a Journal of AI Alignment

This idea has been discussed before. Though it's an important one, so I don't think it's a bad thing for us to bring it up again. My perspective now and previously is that this would be fairly bad at the moment, but might be good in a couple of years time.

My background understanding is that the purpose of a conference or journal in this case (and in general) is primarily to certify the quality of some work (and to a lesser extent, the field of inquiry). This in-turn helps with growing the AIS field, and the careers of AIS researchers.

This is only effective if the conference or journal is sufficiently prestigious. Presently, publishing AI safety papers in Neurips, AAAI, JMLR, JAIR serves to certify the validity of the work, and boosts the field of AI safety whereas publishing in (for example) Futures or AGI doesn't. If you create a new publication venue, by default, its prestige would be comparable to, or less than Futures or AGI, and so wouldn't really help to serve the role of a journal.

Currently, the flow of AIS papers into the likes of Neurips and AAAI (and probably soon JMLR, JAIR) is rapidly improving. New keywords have been created there at several conferences, along the lines of "AI safety and trustworthiness" (I forget the exact wording) so that you can nowadays expect, on average, to receive reviewer who average out to neutral, or even vaguely sympathetic to AIS research. Ten or so papers were published in such journals in the last year, and all these authors will become reviewers under that keyword when the conference comes around next year. Yes, things like "Logical Inductors" or "AI safety via debate" are very hard to publish. There's some pressure to write research that's more "normie". All of that sucks, but it's an acceptable cost for being in a high-prestige field. And overall, things are getting easier, fairly quickly.

If you create a too low-prestige journal, you can generate blowback. For example, there was some criticism on Twitter about Pearl's "Journal of Causal Inference", even though his field is somewhat more advanced than hours.

In 1.5-3 years time, I think the risk-benefit calculus will probably change. The growth of AIS work (which has been fast) may outpace the virtuous cycle that's currently happening with AI conferences and journals, such that a lot of great papers are getting rejected. There could be enough tenure-track professors at top schools to make the journal decently high-status (moreso than Futures and AGI). We might even be nearing the point where some unilateral actor will go and make a worse journal if we don't make one. I'd say when a couple of those things are true, that's when we should pull the trigger and make this kind of conference/journal.

Comparing reward learning/reward tampering formalisms

It would be nice to draw out this distinction in more detail. One guess:

  • Uninfluencability seems similar to requiring zero individual treatment effect of D on R.
  • Riggability (from the paper) would then correspond to zero average treatment effect of D on R
Defining AI wireheading

Seems like the idea is that wireheading denotes specification gaming that is egregious in its focus on the measurement channel. I'm inclined to agree..

Load More