Rohin Shah

PhD student at the Center for Human-Compatible AI. Creator of the Alignment Newsletter.

Rohin Shah's Comments

Rohin Shah on reasons for AI optimism
So it seems like, unless we expect the relevant actors to act in accordance with something close to impartial altruism, we should expect them to avoid risks somewhat to avoid existential risks (or extinction specifically), but far less than they really should. (Roughly this argument is made in The Precipice, and I believe by 80k.)

I agree that actors will focus on x-risk far less than they "should" -- that's exactly why I work on AI alignment! This doesn't mean that x-risk is high in an absolute sense, just higher than it "should" be from an altruistic perspective. Presumably from an altruistic perspective x-risk should be very low (certainly below 1%), so my 10% estimate is orders of magnitude higher than what it "should" be.

Also, re: Precipice, it's worth noting that Toby and I don't disagree much -- I estimate 1 in 10 conditioned on no action from longtermists; he estimates 1 in 5 conditioned on AGI being developed this century. Let's say that action from longtermists can halve the risk; then my unconditional estimate would be 1 in 20, and would be very slightly higher if we condition on AGI being developed this century (because we'd have less time to prepare), so overall there's a 4x difference, which given the huge uncertainty is really not very much.

Rohin Shah on reasons for AI optimism

MAD-style strategies happen when:

1. There are two (or more) actors that are in competition with each other

2. There is a technology such that if one actor deploys it and the other actor doesn't, the first actor remains the same and the second actor is "destroyed".

3. If both actors deploy the technology, then both actors are "destroyed".

(I just made these up right now; you could probably get better versions from papers about MAD.)

Condition 2 doesn't hold for accident risk from AI: if any actor deploys an unaligned AI, then both actors are destroyed.

I agree I didn't explain this well in the interview -- when I said

if the destruction happens, that affects you too

I should have said something like

if you deploy a dangerous AI system, that affects you too

which is not true for nuclear weapons (deploying a nuke doesn't affect you in and of itself).

Thinking About Filtered Evidence Is (Very!) Hard
the Bayesian notion of belief doesn't allow us to make the distinction you are pointing to

Sure, that seems reasonable. I guess I saw this as the point of a lot of MIRI's past work, and was expecting this to be about honesty / filtered evidence somehow.

I also think this result has nothing to do with "you can't have a perfect model of Carol". Part of the point of my assumptions is that they are, individually, quite compatible with having a perfect model of Carol amongst the hypotheses.

I think we mean different things by "perfect model". What if I instead say "you can't perfectly update on X and Carol-said-X , because you can't know why Carol said X, because that could in the worst case require you to know everything that Carol will say in the future"?

Thinking About Filtered Evidence Is (Very!) Hard

Yeah, I feel like while honesty is needed to prove the impossibility result, the problem arose with the assumption that the agent could effectively reason now about all the outputs of a recursively enumerable process (regardless of honesty). Like, the way I would phrase this point is "you can't perfectly update on X and Carol-said-X , because you can't have a perfect model of Carol"; this applies whether or not Carol is honest. (See also this comment.)

Thinking About Filtered Evidence Is (Very!) Hard

I still don't get it but probably not worth digging further. My current confusion is that even under the behaviorist interpretation, it seems like just believing condition 2 implies knowing all the things Carol would ever say (or Alice has a mistaken belief). Probably this is a confusion that would go away with enough formalization / math, but it doesn't seem worth doing that.

Deconfusing Human Values Research Agenda v1
Some examples of actions taken by dictators that I think were well intentioned and meant to further goals that seemed laudable and not about power grabbing to the dictator but had net negative outcomes for the people involved and the world:

What's your model for why those actions weren't undone?

To pop back up to the original question -- if you think making your friend 10x more intelligent would be net negative, would you make them 10x dumber? Or perhaps it's only good to make them 2x smarter, but after that more marginal intelligence is bad?

It would be really shocking if we were at the optimal absolute level of intelligence, so I assume that you think we're at the optimal relative level of intelligence, that is, the best situation is when your friends are about as intelligent as you are. In that case, let's suppose that we increase/decrease all of your friends and your intelligence by a factor of X. For what range of X would you expect this intervention is net positive?

(I'm aware that intelligence is not one-dimensional, but I feel like this is still a mostly meaningful question.)

Just to be clear about my own position, a well intentioned superintelligent AI system totally could make mistakes. However, it seems pretty unlikely that they'd be of the existentially-catastrophic kind. Also, the mistake could be net negative, but the AI system overall should be net positive.

An Analytic Perspective on AI Alignment
Do you think I'm wrong?

No, which is why I want to stop using the example.

(The counterfactual I was thinking of was more like "imagine we handed a laptop to 19th-century scientists, can they mechanistically understand it?" But even that isn't a good analogy, it overstates the difficulty.)

Alignment as Translation
Let me know if this analogy sounds representative of the strategies you imagine.

Yeah, it does. I definitely agree that this doesn't get around the chicken-and-egg problem, and so shouldn't be expected to succeed on the first try. It's more like you get to keep trying this strategy over and over again until you eventually succeed, because if everything goes wrong you just unplug the AI system and start over.

the chicken-and-egg problem is a ground truth problem. If we have enough data to estimate X to within 5%, then doing clever things with that data is not going reduce that error any further.

I think you get "ground truth data" by trying stuff and seeing whether or not the AI system did what you wanted it to do.

(This does suggest that you wouldn't ever be able to ask your AI system to do something completely novel without having a human along to ensure it's what we actually meant, which seems wrong to me, but I can't articulate why.)

Alignment as Translation

Yeah, this could be a way that things are. My intuition is that it wouldn't be this way, but I don't have any good arguments for it.

An Analytic Perspective on AI Alignment

Yup, that seems like a pretty reasonable estimate to me.

Note that my default model for "what should be the input to estimate difficulty of mechanistic transparency" would be the number of parameters, not number of neurons. If a neuron works over a much larger input (leading to more parameters), wouldn't that make it harder to mechanistically understand?

Load More