2y70

my current best guess is that gradient descent is going to want to make our models deceptive

Can you quantify your credence in this claim?

Also, how much optimization pressure do you think that we will need to make models not deceptive? More specifically, how would your credence in the above change if we trained with a system that exerted 2x, 4x, ... optimization pressure against deception?

If you don't like these or want a more specific operationalization of this question, I'm happy with whatever you think is likely or filling out more details.

2y1811

I think it really depends on the specific training setup. Some are much more likely than others to lead to deceptive alignment, in my opinion. Here are some numbers off the top of my head, though please don't take these too seriously:

- ~90%: if you keep scaling up RL in complex environments ad infinitum, eventually you get deceptive alignment.
- ~80%: conditional on RL in complex environments being the first path to transformative AI, there will be deceptively aligned RL models.
- ~70%: if you keep scaling up GPT-style language modeling ad infinitum, eventuall

Thanks you for this thoughtful response, I didn't know about most of these projects. I've linked this comment in the DeepMind section, as well as done some modifications for both clarity and including a bit more.

I think you can talk about the agendas of specific people on the DeepMind safety teams but there isn't really one "unified agenda".

This is useful to know.

1y54

Thanks Thomas for the helpful overview post! Great to hear that you found the AGI ruin opinions survey useful.

I agree with Rohin's summary of what we're working on. I would add "understanding / distilling threat models" to the list, e.g. "refining the sharp left turn" and "will capabilities generalize more".

Some corrections for your overall description of the DM alignment team:

- I would count ~20-25 FTE on the alignment + scalable alignment teams (this does not include the AGI strategy & governance team)
- I would put DM alignment in the "fairly hard"

There is also the ontology identification problem. The two biggest things are: we don't know how to specify exactly what a diamond is because we don't know the true base level ontology of the universe. We also don't know how diamonds will be represented in the AI's model of the world.

I personally don't expect coding a diamond maximizing AGI to be hard, because I think that diamonds is a sufficiently natural concept that doing normal gradient descent will extrapolate in the desired way, without inner alignment failures. If the agent discovers more bas...

42y

(Unsure whether to mark "agree" for the first two paragraphs, or "disagree" for the last line. Leaving this comment instead.)

Thanks for your response! I'm not sure I communicated what I meant well, so let me be a bit more concrete. Suppose our loss is parabolic , where . This is like a 2d parabola (but it's convex hull / volume below a certain threshold is 3D). In 4D space, which is where the graph of this function lives and hence where I believe we are talking about basin volume, this has 0 volume. The hessian is the matrix:

This is conveniently already diagonal, and the 0 eigenvalue comes from the component , which...

42y

The loss is defined over all dimensions of parameter space, so L(x)=x21+x22 is still a function of all 3 x's. You should think of it as L(x)=x21+x22+0x23. It's thickness in the x3 direction is infinite, not zero.
Here's what a zero-determinant Hessian corresponds to:
The basin here is not lower dimensional; it is just infinite in some dimension. The simplest way to fix this is to replace the infinity with some large value. Luckily, there is a fairly principled way to do this:
1. Regularization / weight decay provides actual curvature, which should be added in to the loss, and doing this is the same as adding λIn to the Hessian.
2. The scale of the initialization distribution provides a natural scale for how much volume an infinite sweep should count as (very roughly, the volume only matters if it overlaps with the initialization distribution, and the distance of sweep for which this is true is on the order of σ, the standard deviation of the initialization).
So the (λ+kσ2)In is a fairly principled correction, and much better than just "throwing out" the other dimensions. "Throwing out" dimensions is unprincipled, dimensionally incorrect, numerically problematic, and should give worse results.

02y

Note that this is equivalent to replacing "size 1/0" with "size 1". Issues with this become apparent if the scale of your system is much smaller or larger than 1. A better try might be to replace 0 with the average of the other eigenvalues, times a fudge factor. But still quite unprincipled - maybe better is to try to look at higher derivatives first or do nonlocal numerical estimation like described in the post.

2y20

I am a bit confused how you deal with the problem of 0 eigenvalues in the Hessian. It seems like the reason that these 0 eigenvalues exist is because the basin volume is 0 as a subset of parameter space. My understanding right now of your fix is that you are adding along the diagonal to make the matrix full rank (and this quantity is coming from the regularization plus a small quantity). Geometrically, this seems like drawing a narrow ellipse around the subspace of which we are trying to estimate the volume.

But this doesn't seem na...

12y

The hessian is just a multi-dimensional second derivative, basically. So a zero eigenvalue is a direction along which the second derivative is zero (flatter-bottomed than a parabola).
So the problem is that estimating basin size this way will return spurious infinities, not zeros.

Thank you so much for your detailed reply. I'm still thinking this through, but this is awesome. A couple things:

- I don't see the problem at the bottom. I thought we were operating in the setting where Nirvana meant infinite reward? It seems like of course if N is small, we will get weird behavior because the agent will sometimes reason over logically impossible worlds.
- Is Parfit's Hitchiker with a perfect predictor unsalvageable because it violates this fairness criteria?
- The fairness criterion in your comment is the pseudocausality co

12y

So, if you make Nirvana infinite utility, yes, the fairness criterion becomes "if you're mispredicted, you have any probability at all of entering the situation where you're mispredicted" instead of "have a significant probability of entering the situation where you're mispredicted", so a lot more decision-theory problems can be captured if you take Nirvana as infinite utility. But, I talk in another post in this sequence (I think it was "the many faces of infra-beliefs") about why you want to do Nirvana as 1 utility instead of infinite utility.
Parfit's Hitchiker with a perfect predictor is a perfectly fine acausal decision problem, we can still represent it, it just cannot be represented as an infra-POMDP/causal decision problem.
Yes, the fairness criterion is tightly linked to the pseudocausality condition. Basically, the acausal->pseudocausal translation is the part where the accuracy of the translation might break down, and once you've got something in pseudocausal form, translating it to causal form from there by adding in Nirvana won't change the utilities much.

2y0

Half baked confusion:

How does Parfit's Hitchiker fit into the Infra-Bayes formalism? I was hoping that disutility the agent receives from getting stuck in the desert would be easily representable as negative off-branch utility. I am stuck trying to reconcile that with the actual update rule:

Here, I interpret as our utility function. Thus: gives us the expected utility tracked from the offbranch event. The probability and the expectation are just a scale and shift. This update is appli...

12y

So, the flaw in your reasoning is after updating we're in the city, e2 doesn't go "logically impossible, infinite utility". We just go "alright, off-history measure gets converted to 0 utility", a perfectly standard update. So e2 updates to (0,0) (ie, there's 0 probability I'm in this situation in the first place, and my expected utility for not getting into this situation in the first place is 0, because of probably dying in the desert)
As for the proper way to do this analysis, it's a bit finicky. There's something called "acausal form", which is the fully general way of representing decision-theory problems. Basically, you just give an infrakernel Θ:Π→□(A×O)ω that tells you your uncertainty over which history will result, for each of your policies.
So, you'd have
Θ(pay)=(0.99δalive,poor+0.01δdead,0)
Θ(nopay)=(0.99δdead+0.01δalive,rich,0)
Ie, if you pay, 99 percent chance of ending up alive but paying and 1 percent chance of dying in the desert, if you don't pay, 99 percent chance of dying in the desert and 1 percent chance of cheating them, no extra utility juice on either one.
You update on the event "I'm alive". The off-event utility function is like "being dead would suck, 0 utility". So, your infrakernel updates to (leaving off the scale-and-shift factors, which doesn't affect anything)
Θ(pay)=(0.99δalive,poor,0)
Θ(nopay)=(0.01δalive,rich,0)
Because, the probability mass on "die in desert" got burned and turned into utility juice, 0 of it since it's the worst thing. Let's say your utility function assigns 0.5 utility to being alive and rich, and 0.4 utility to being alive and poor. So the utility of the first policy is 0.99⋅0.4=0.396, and the utility of the second policy is 0.01⋅0.5=0.005, so it returns the same answer of paying up. It's basically thinking "if I don't pay, I'm probably not in this situation in the first place, and the utility of "I'm not in this situation in the first place" is also about as low as possible."
BUT
There's a very mathemat

Agree with both aogara and Eli's comment.

For me this reading between the lines is hard: I spent ~2 hours reading academic papers/websites yesterday and while I could quite quickly summarize the work itself, it was quite hard to me to figure out the motivations.

There's a lot of work that could be relevant for x-risk but is not

motivatedby it. Some of it is more relevant than work thatismotivated by it. An important challenge for this community (to facilitate scaling of research funding, etc.) is to move away from evaluating work based on motivations, and towards evaluating work based on technical content.