Existential AI Safety is NOT separate from near-term applications

Non X-risks from AI are still intrinsically important AI safety issues.

I want to push back on this - I think it's true as stated, but that emphasising it can be misleading.

Concretely, I think that there can be important near-term, non-X-risk AI problems that meet the priority bar to work on. But the standard EA mindset of importance, tractability and neglectedness still applies. And I think often near-term problems are salient and politically charged, in a way that makes these harder to evaluate.

I think these are most justified on problems with products that are very widely used and without much corporate incentive to fix the issues (recommender system alignment is the most obvious example here)

I broadly agree with and appreciate the rest of this post though! And want to distinguish between "this is not a cause area that I think EAs should push on on the margin" and "this cause area does not matter" - I think work to make systems less deceptive, racist, and otherwise harmful seems pretty great.

[-]scasper3y23

No disagreements substance-wise. But I'd add that I think work to avoid scary autonomous weapons is likely at least as important as recommender systems. If this post's reason #1 were the only reason for working on nerartermist AI stuff, then it would probably be like a lot of other very worthy but likely not top-tier impactful issues. But I see it as emphasis-worthy icing on the cake given #2 and #3.

[-]Neel Nanda3y10

Cool, agreed. Maybe my main objection is just that I'd have put it last not first, but this is a nit-pick

[-]paulfchristiano3y107

I think that some near-future applications of AI alignment are plausible altruistic top priorities. Moreover, even when people disagree with me about prioritization, I think that people who want to use AI to accomplish contemporary objectives are important users. It's good to help them, understand the difficulties they encounter, and so on, both to learn from their experiences and make friends.

So overall I think I agree with the most important claims in this post.

Despite that, I think it's important for me personally (and for ARC) to be clear about what I care about---both because having clear research goals is important for doing good research, and because there's a significant risk of inadvertently misleading people about my priorities and views.

As an example, I've recently been writing about mechanistic anomaly detection. But if people think that I mean the kind of anomaly detection that is most helpful for avoiding self-driving car crashes, I think that would lead to them evaluating my research inappropriately. Many of my technical decisions seem very weird on this perspective, and if I tried to satisfy a reviewer with that perspective I think it would push me in a very unhealthy direction. On the other side, this equivocation might lead some people to be more sympathetic to or excited about my work for reasons I regard as kind of dishonest.

This example is related to your post, but not really to my OP (since in my OP I'm mostly talking about contemporary applications of alignment, which are more closely technically connected to my day-to-day work). I have similar views about the other examples you give in the "valuable lessons" section:

I think that inferring subtleties of human values isn't a significant part of my alignment research. Almost all the action I care is about very simple aspects of our preferences. So if someone evaluates my research through that lens, they are likely to be confused, and if they find it exciting it might be for the wrong reasons.
I think that regulations appropriate for managing catastrophic AI risk are extremely different from those appropriate for autonomous weapons, or for managing harms from discrimination. Maybe some repurposing is possible, but at this point I am personally much more interested in directly contributing to the kind of governance that is more directly applicable.

This isn't to say that I don't think such work is meaningful or interesting, or that I don't want to be friends with people who do it. Just that it's not what I'm doing, and it's worth being clear about that so that my work is being evaluated in an appropriate way.

The other extreme would be deliberately muddying the waters about distinctions between different technical problems, and while I don't think anyone is necessarily advocating for that I do think it would be very unwise.

[-]scasper3y20

Hi Paul, thanks. Nice reading this reply. I like your points here.

Some of what I say here might reveal a lack of keeping up well with ARC. But as someone who works primarily on interpretability, the thought of mechanistic anomaly detection techniques that are not useful for use in today's vision or language models seems surprising to me. Is there anything you can point me to to help me understand why an interpretability/anomaly detection tool that's useful for ASI or something might not be useful for cars?

[-]paulfchristiano3y52

We are mostly thinking about interpretability and anomaly detection designed to resolve two problems (see here):

Maybe the AI thinks about the world in a wildly different way than humans and translates into human concepts by asking "what would a human say?" instead of "what is actually true?" This leads to bad generalization when we consider cases where the AI system plans to achieve a goal and has the option of permanently fooling humans. But that problem is very unlikely to be serious for self-driving cars, because we can acquire ground truth data for the relevant queries. On top of that it just doesn't seem they will think about physical reality in such an alien way.
Maybe an AI system explicitly understands that it is being evaluated, and then behaves differently if it later comes to believe that it is free to act arbitrarily in the world.

We do hope that these are just special cases and that our methods will resolve a broader set of problems, but these two special cases loom really large. On the other hand, it's much less clear whether realistic failures for self-driving cars will involve the kind of mechanism distinctions we are trying to detect.

Also: there are just way more pressing problems for self-driving cars. And on top of all that, we are taking a very theoretical approach precisely because we are worried it may be difficult to study these problems until a future time very close to when they become catastrophic.

Overall I think that if someone looked at our research agenda and viewed it as an attempt to respond to reliability failures in existing models, the correct reaction should be more like "Why are they doing all of this instead of just empirically investigating which failures are most important for self-driving cars and then thinking about how to address those?" There's still a case for doing more fundamental theoretical research even if you are interested in more prosaic reliability failures, but (i) it's qualitatively much worse and I don't really believe it, (ii) this isn't what such research should look like. So I think it's pretty bad if someone is evaluating us from that perspective.

(In contrast I think this is a more plausible framing for e.g. work on adversarial evaluation and training. It might still lead an evaluator astray, but at least it's a very plausible research direction to focus on for prosaic reliability as well as being something we might want to apply to future systems.)

[-]scasper3y10

Thanks!

We do hope that these are just special cases and that our methods will resolve a broader set of problems.

I hope so too. And I would expect this to be the case for good solutions.

Whether they are based on mechanistic interpretability, probing, other interpretability tools, adversaries, relaxed adversaries, or red-teaming, I would expect methods that are good at detecting goal misgeneralization or deceptive alignment to also be useful for self-driving cars and other issues. At the end of the day, any misaligned model will have a bug -- some set of environments or inputs that will make it do bad things. So I struggle to think of an example of a tool that is good for finding insidious misaligning bugs but not others.

So I'm inclined to underline the key point of my original post. I want to emphasize the value of (1) engaging more with the rest of the community that doesn't identify themselves as "AI Safety" researchers and (2) being clear that we care about alignment for all of the right reasons. Albeit this should be discussed with the appropriate amount of clarity which was your original point.

[-]RobertKirk3y10

So I struggle to think of an example of a tool that is good for finding insidious misaligning bugs but not others.

One example: A tool that is designed to detect whether a model is being truthful (correctly representing what it knows about the world to the user), perhaps based on specific properties of truthfulness (for example see https://www.alignmentforum.org/posts/L4anhrxjv8j2yRKKp/how-discovering-latent-knowledge-in-language-models-without) doesn't seem like it would be useful for improving the reliability of self-driving cars, as likely self-driving cars aren't misaligned in the sense that they could drive perfectly safely but choose not to, but rather are just unable to drive perfectly safely because some of their internal (learned) systems aren't sufficiently robust.

[-]scasper3y10

Thanks for the comment. I'm inclined to disagree though. The application was for detecting deceptiveness. But the method was a form of contrastive probing. And one could use a similar approach for significantly different applications.

[-]RobertKirk3y10

This feels kind of like a semantic disagreement to me. To ground it, it's probably worth considering whether further research on the CCS-style work I posted would also be useful for self-driving cards (or other applications). I think that would depend on whether the work improves the robustness of the contrastive probing regardless of what is being probed for (which would be generically useful), or whether it improves the probing specifically for truthfulness in systems that have a conception of truthfulness , possibly by improving the constraints or adding additional constraints (less useful for other systems). I think both would be good, but I'm uncertain which would be more useful to pursue if one is motivated by reducing x-risk from misalignment.

[-]scasper3y20

I see your point here about generally strengthening methods versus specifically refining them for an application. I think this is a useful distinction. But I'm all about putting a good amount of emphasis on connections between this work and different applications. I feel like at this point, we have probably cruxed.

[-]RobertKirk3y10

Not Paul, but some possibilities why ARC's work wouldn't be relevant for self-driving cars:

The stuff Paul said about them aiming at understanding quite simple human values (don't kill us all, maintain our decision-making power) rather than subtle things. It's likely for self-driving cars we're more concerned with high reliability and hence would need to be quite specific. E.g., maybe ARC's approach could discern whether a car understands whether it's driving on the road or not (seems like a fairly simple concept), but not whether it's driving in a riskier way than humans in specific scenarios.
One of the problems that I think ARC is worried about is ontology identification, which seems like a meaningfully different problem for sub-human systems (whose ontologies are worse than ours, so in theory could be injected into ours) than for human-level or super-human systems (where that may not hold). Hence focusing on the super-human case would look weird and possibly not helpful for the subhuman case, although it would be great if they could solve all the cases in full generality.
Maybe once it works ARC's approach could inform empirical work which helps with self-driving cars, but if you were focused on actually doing the thing for cars you'd just aim directly at that, whereas ARC's approach would be a very roundabout and needlessly complex and theoretical way of solving the problem (this may or may not actually be the case, maybe solving this for self-driving cars is actually fundamentally difficult in the same way as for ASI, but it seems less likely).

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

15

Existential AI Safety is NOT separate from near-term applications

15

Non X-risks from AI are still intrinsically important AI safety issues.

There are valuable lessons to learn from near-term applications.

Making allies and growing the AI safety field is useful