Rohin Shah

PhD student at the Center for Human-Compatible AI. Creator of the Alignment Newsletter.


Value Learning

Alignment Newsletter


Three mental images from thinking about AGI debate & corrigibility
1) If we have an AGI that is corrigible, it will not randomly drift to be not corrigible, because it will proactively notice and correct potential errors or loss of corrigibility.
2) If we have an AGI that is partly corrigible, it will help us 'finish out' the definition of corrigibility / edit itself to be more corrigible, because we want it to be more corrigible and it's trying to do what we want.

Good point on distinguishing these two arguments. It sounds like we agree on 1. I also thought the OP was talking about 1.

For 2, I don't think we can make a dimensionality argument (as in the OP), because we're talking about edits that are the ones that the AI chooses for itself. You can't apply dimensionality arguments to choices made by intelligent agents (e.g. presumably you wouldn't argue that every glass in my house must be broken because the vast majority of ways of interacting with glasses breaks them). Or put another way, the structural similarity is just "the AI wouldn't choose to do <bad thing #N>", in all cases because it's intelligent and understands what it's doing.

Now the question of "how right do we need to get the initial definition of corrigibility" is much less obvious. If you told me we got the definition wrong in a million different ways, I would indeed be worried and probably wouldn't expect it to self-correct (depending on the meaning of "different"). But like... really? We get it wrong a million different ways? I don't see why we'd expect that.

Three mental images from thinking about AGI debate & corrigibility

Right, so it's basically goal drift from corrigibility to something else, in this case caused by an incorrect belief that S's preferences about B are not going to change. I think this is a reasonable thing to be worried about but I don't see why it's specific to corrigibility -- for any objective, an incorrect belief can prevent you from successfully pursuing that objective.

Like, even if we trained an AI system on the loss function of "make money", I would still expect it to possibly stop making money if it e.g. decides that it would be more effective at making money if it experience intrinsic joy at its work, and then self-modifies to do that, and then ends up working constantly for no pay.

I'd definitely support the goal of "figure out how to prevent goal drift", but it doesn't seem to me to be a reason to be (differentially) pessimistic about corrigibility.

Three mental images from thinking about AGI debate & corrigibility
I also tried to give specific examples, see the "friends" example in my other comment

Ah, I hadn't seen that. I don't feel convinced, because it assumes that the AI system has a "goal" that isn't "be corrigible". Or perhaps the argument is that the goal moves from "be corrigible" to "care for the operator's friends"? Or maybe that the goal stays as "be corrigible / help the user" but the AI system has a firm unshakeable belief that the user wants her friends to be cared for?

we'll make powerful systems

But... why can't I apply the argument to "powerful", and say that it is extremely unlikely for an AI system to be powerful? Predictive, sure, but powerful?

My model of you responds "powerful is upstream of goal-accomplishing" or "powerful is downstream of goal-directedness which is upstream of goal-accomplishing", but it seems like you could say that for corrigibility too: "corrigibility is upstream of effectively helping the user".

As for "why can we hope to solve it", I can imagine lots of possible solution directions

Thanks, that was convincing (that even under radical uncertainty there are still avenues to pursue).

Three mental images from thinking about AGI debate & corrigibility
There's a huge structural similarity between the proof that '1 + 1 != 3' and '1+1 != 4'; like, both are generic instances of the class '1 + 1 != n \forall n != 2'. We can increase the number of numbers without decreasing the plausibility of this claim (like, consider it in Z/4, then Z/8, then Z/16, then...).

I feel like that's exactly my point? Showing that something is a conjunction of a bunch of claims should not always make you think that claim is low probability, because there could be structural similarity between those claims such that a single argument is enough to argue for all of them.

(The claims "If X drifts away from corrigibility along dimension {N}, it will get pulled back" are clearly structurally similar, and the broad basin of corrigibility argument is meant to be an argument that argues for all of them.)

Similarly, if we make an argument that something is an attractor in N-dimensional space, that does actually grow less plausible the more dimensions there are, since there are more ways for the thing to have a derivative that points away from the 'attractor,' if we think the dimensions aren't all symmetric.

1. Why aren't the dimensions symmetric?

2. I somewhat buy the differential argument (more dimensions => less plausible) but not the absolute argument (therefore not plausible); this post is arguing for the absolute version:

it starts to feel awfully unlikely that corrigibility is really a broad basin of attraction after all

3. I'm not sure where the idea of a "derivative" is coming from -- I thought we were talking about small random edits to the weights of a neural network. If we're training the network on some objective that doesn't incentivize corrigibility then certainly it won't stay corrigible.

Three mental images from thinking about AGI debate & corrigibility

I disagree with this position but it does seem consistent. I don't really know what to say other than "this is a conjunction of a million things" type arguments are not automatically persuasive, e.g. I could argue against "1 + 1 = 2" by saying that it's an infinite conjunction of "1 + 1 != 3" AND "1 + 1 != 4" AND ... and so it can't possibly be true.

I'm curious why you think AI risk is worth working on given this extreme cluelessness (both "why is there any risk" and "why can we hope to solve it").

"Go west, young man!" - Preferences in (imperfect) maps

Planned summary for the Alignment Newsletter:

This post argues that by default, human preferences are strong views built upon poorly defined concepts, that may not have any coherent extrapolation in new situations. To put it another way, humans build mental maps of the world, and their preferences are defined on those maps, and so in new situations where the map no longer reflects the world accurately, it is unclear how preferences should be extended. As a result, anyone interested in preference learning should find some incoherent moral intuition that other people hold, and figure out how to make it coherent, as practice for the case we will face where our own values will be incoherent in the face of new situations.

Planned opinion:

This seems right to me -- we can also see this by looking at the various paradoxes found in the philosophy of ethics, which involve taking everyday moral intuitions and finding extreme situations in which they conflict, and it is unclear which moral intuition should “win”.
Inner Alignment: Explain like I'm 12 Edition

Planned summary for the Alignment Newsletter:

This post summarizes and makes accessible the paper <@Risks from Learned Optimization in Advanced Machine Learning Systems@>.
Three mental images from thinking about AGI debate & corrigibility

Fair enough for better predictive algorithm, and plausibly we can say intelligence correlates strongly enough with better prediction, but why can't I apply your argument to "riskiness", or "incorrigibility", or "goal-directed"?

Learning the prior and generalization
I think it's interesting to think about what the relaxation from requiring training/validation/deployment to be i.i.d. to just requiring that validation/deployment be i.i.d. gets us, though. First, it's an argument that we shouldn't be that worried about whether the training data is i.i.d. relative to the validation/deployment data.

Fair enough. In practice you still want training to also be from the same distribution because that's what causes your validation performance to be high. (Or put differently, training/validation i.i.d. is about capabilities, and validation/deployment i.i.d. is about safety.)

That is, if the model can provide a justification which convinces you that your ground truth generator would have produced the same output, that's just as good as actually checking against the ground truth.

This seems to rely on an assumption that "human is convinced of X" implies "X"? Which might be fine, but I'm surprised you want to rely on it.

I'm curious what an algorithm might be that leverages this relaxation.

Competition: Amplify Rohin’s Prediction on AGI researchers & Safety Concerns

All of these seem like good reasons to be optimistic, though it was a bit hard for me to update on it given that these were already part of my model. (EDIT: Actually, not the younger researchers part. That was a new-to-me consideration.)

Load More