Ryan Carey

Comments

The Case for a Journal of AI Alignment

One alternative would be to try to raise funds (e.g. perhaps from the EA LTF fund) to pay reviewers to perform reviews.

The Case for a Journal of AI Alignment

I don't (and perhaps shouldn't) have a guaranteed trigger - probably I will learn a lot more about what the trigger should be over the next couple years. But my current picture would be that the following are mostly true:

  • The AIS field is publishing 3-10x more papers per year as the causal inference field is now.
  • We have ~3 highly aligned tenured professors at top-10 schools, and ~3 mostly-aligned tenured professors with ~10k citations, who want to be editors of the journal
  • The number of great papers that can't get into other top AI journals is >20 per year. I figure it's currently like ~2.
  • The chance that some other group creates a similar (worse) journal for safety in the subsequent 3 years is >20%
The Case for a Journal of AI Alignment

This idea has been discussed before. Though it's an important one, so I don't think it's a bad thing for us to bring it up again. My perspective now and previously is that this would be fairly bad at the moment, but might be good in a couple of years time.

My background understanding is that the purpose of a conference or journal in this case (and in general) is primarily to certify the quality of some work (and to a lesser extent, the field of inquiry). This in-turn helps with growing the AIS field, and the careers of AIS researchers.

This is only effective if the conference or journal is sufficiently prestigious. Presently, publishing AI safety papers in Neurips, AAAI, JMLR, JAIR serves to certify the validity of the work, and boosts the field of AI safety whereas publishing in (for example) Futures or AGI doesn't. If you create a new publication venue, by default, its prestige would be comparable to, or less than Futures or AGI, and so wouldn't really help to serve the role of a journal.

Currently, the flow of AIS papers into the likes of Neurips and AAAI (and probably soon JMLR, JAIR) is rapidly improving. New keywords have been created there at several conferences, along the lines of "AI safety and trustworthiness" (I forget the exact wording) so that you can nowadays expect, on average, to receive reviewer who average out to neutral, or even vaguely sympathetic to AIS research. Ten or so papers were published in such journals in the last year, and all these authors will become reviewers under that keyword when the conference comes around next year. Yes, things like "Logical Inductors" or "AI safety via debate" are very hard to publish. There's some pressure to write research that's more "normie". All of that sucks, but it's an acceptable cost for being in a high-prestige field. And overall, things are getting easier, fairly quickly.

If you create a too low-prestige journal, you can generate blowback. For example, there was some criticism on Twitter about Pearl's "Journal of Causal Inference", even though his field is somewhat more advanced than hours.

In 1.5-3 years time, I think the risk-benefit calculus will probably change. The growth of AIS work (which has been fast) may outpace the virtuous cycle that's currently happening with AI conferences and journals, such that a lot of great papers are getting rejected. There could be enough tenure-track professors at top schools to make the journal decently high-status (moreso than Futures and AGI). We might even be nearing the point where some unilateral actor will go and make a worse journal if we don't make one. I'd say when a couple of those things are true, that's when we should pull the trigger and make this kind of conference/journal.

Comparing reward learning/reward tampering formalisms

It would be nice to draw out this distinction in more detail. One guess:

  • Uninfluencability seems similar to requiring zero individual treatment effect of D on R.
  • Riggability (from the paper) would then correspond to zero average treatment effect of D on R
Defining AI wireheading

Seems like the idea is that wireheading denotes specification gaming that is egregious in its focus on the measurement channel. I'm inclined to agree..

IRL in General Environments
Aside from yourself, the other CHAI grad students don't seem to have written up their perspectives of what needs to be done about AI risk. Are they content to just each work on their own version of the problem?

I think this is actually pretty strategically reasonable.

CHAI students would have high returns to their probability of attaining a top professorship by writing papers, which is quite beneficial for later recruiting top talent to work on AI safety, and quite structurally beneficial for the establishment of AI safety as a field of research. The time they might spend writing up their research strategy does not help with their this, nor with recruiting help with their line of work (because other nearby researchers face similar pressures, and because academia is not structured to have PhD students lead large teams).

Moreover, if they are pursuing academic success, they face strong incentives to work on particular problems, and so their research strategies may be somewhat distorted by these incentives, decreasing the quality of a research agenda written in that context.

When I look at CHAI research students, I see some pursuing IRL, some pursuing game theory, some pursuing the research areas of their supervisors (all of which could lead to professorships), and some pursuing projects of other research leaders like MIRI or Paul. This seems healthy to me.

Problems with Counterfactual Oracles
Therefore, if epsilon is small enough (comparable to the probability of hitting an escape message at random), then the learning gets extremely slow and the oracle might shoot at the escape action at random.

The escape action being randomly called should not be a problem if it is a text string that is only read if r=1, and is ineffectual otherwise...

Problems with Counterfactual Oracles
The biggest assumption is that the AI only cares about the reward it gets for the current timestep. A sufficiently intelligent agent would understand that after having being shutdown, an (almost) identical version of itself will probably be facing a similar question. Therefore, it it wants future versions of itself to succeed at (almost) exactly the same task, it should still want to escape.

I don't see why this has to be true, given that we get to choose the AI's value function. Why can't we just make the agent act-based?

My main concern about the counterfactual oracle is that it doesn't prevent the AI from sending fatal escape messages. Indeed, it spends most of its time in exploratory mode at the beginning (as it is only rewarded with probability
ϵ
) and might stumble upon an escape message/action then.

If the agent is model-based, then you should be able to gather a dataset of (prediction, world_state, accuracy) tuples with random actions (as random actions will practically never make the bad prediction) and random decision of whether to read the response. And then just ask the agent to maximize the natural direct effect of its prediction, treating world_state as the mediator, and a null prediction as the default action. (this equates to asking what the world would be like if a null action was outputted - I'll release my current work on direct effects in AI safety soon, and feel free to ask for it in the meantime). I don't see how this has this particular bad consequence (actually making the bad self-confirming prediction) in either training or deployment...

The rest of the design (providing rewards of 0, shutting it down, etc.) appears to be over-engineering.

In particular, shutting down the system is just a way of saying "only maximize reward in the current timestep, i.e. be an act-based agent. This can be just incorporated into the reward function.

Indeed, when reading the predictions of the counterfactual oracle we're not in the counterfactual world (=training distribution) anymore, so the predictions can get arbitrarily wrong (depending on how much the predictions are manipulative and how many people peek at it).

The hope is that since the agent is not trying to find self-confirming prophecies, then hopefully the accidental effects of self-confirmation are sufficiently small...

TAISU - Technical AI Safety Unconference

There is now, and it's this thread! I'll also go if a couple of other researchers do ;)

Load More