Some AI research areas and their relevance to existential safety

I'd believe the claim if I thought that alignment was easy enough that AI products that pass internal product review and which don't immediately trigger lawsuits would be aligned enough to not end the world through alignment failure. But I don't think that's the case, unfortunately.

It seems like we'll have to put special effort into both single/single alignment and multi/single "alignment", because the free market might not give it to us.

Some AI research areas and their relevance to existential safety

I'd like more discussion of the claim that alignment research is unhelpful-at-best for existential safety because of it accelerating deployment. It seems to me that alignment research has a couple paths to positive impact which might balance the risk:

  1. Tech companies will be incentivized to deploy AI with slipshod alignment, which might then take actions that no one wants and which pose existential risk. (Concretely, I'm thinking of out with a whimper and out with a bang scenarios.) But the existence of better alignment techniques might legitimize governance demands, i.e. demands that tech companies don't make products that do things that literally no one wants.

  2. Single/single alignment might be a prerequisite to certain computational social choice solutions. E.g., once we know how to build an agent that "does what [human] wants", we can then build an agent that "helps [human 1] and [human 2] draw up incomplete contracts for mutual benefit subject to the constraints in the [policy] written by [human 3]". And slipshod alignment might not be enough for this application.

Learning the prior

In this case humans are doing the job of transferring from to , and the training algorithm just has to generalize from a representative sample of to the test set.

Book report: Theory of Games and Economic Behavior (von Neumann & Morgenstern)

Thanks for the references! I now know that I'm interested specifically in cooperative game theory, and I see that Shoham & Leyton-Brown has a chapter on "coalitional game theory", so I'll take a look.

Why you should minimax in two-player zero-sum games

A proof of the lemma :

Multi-agent safety

Ah, ok. When you said "obedience" I imagined too little agency — an agent that wouldn't stop to ask clarifying questions. But I think we're on the same page regarding the flavor of the objective.

Multi-agent safety

Might not intent alignment (doing what a human wants it to do, being helpful) be a better target than obedience (doing what a human told it to do)?

Vanessa Kosoy's Shortform

My takeaway from this is that if we're doing policy selection in an environment that contains predictors, instead of applying the counterfactual belief that the predictor is always right, we can assume that we get rewarded if the predictor is wrong, and then take maximin.

How would you handle Agent Simulates Predictor? Is that what TRL is for?

An environment for studying counterfactuals

The observation can provide all sorts of information about the universe, including whether exploration occurs. The exact set of possible observations depends on the decision problem.

and can have any relationship, but the most interesting case is when one can infer from with certainty.

Beliefs at different timescales

Thanks, I made this change to the post.

Load More