# All of Vivek Hebbar's Comments + Replies

As I understand Vivek's framework, human value shards explain away the need to posit alignment to an idealized utility function. A person is not a bunch of crude-sounding subshards (e.g. "If food nearby and hunger>15, then be more likely to go to food") and then also a sophisticated utility function (e.g. something like CEV). It's shards all the way down, and all the way up.[10]

This read to me like you were saying "In Vivek's framework, value shards explain away .." and I was confused.  I now think you mean "My take on Vivek's is that value s...

2Alex Turner7d
Reworded, thanks.

Makes perfect sense, thanks!

"Well, what if I take the variables that I'm given in a Pearlian problem and I just forget that structure? I can just take the product of all of these variables that I'm given, and consider the space of all partitions on that product of variables that I'm given; and each one of those partitions will be its own variable.

How can a partition be a variable?  Should it be "part" instead?

2Ramana Kumar9d
Partitions (of some underlying set) can be thought of as variables like this: * The number of values the variable can take on is the number of parts in the partition. * Every element of the underlying set has some value for the variable, namely, the part that that element is in. Another way of looking at it: say we're thinking of a variablev:S→Das a function from the underlying setStov's domainD. Then we can equivalently think ofvas the partition{{s∈S∣v(s)=d}∣d∈D}∖∅ofSwith (up to)|D|parts. In what you quoted, we construct the underlying set by taking all possible combinations of values for the "original" variables. Then we take all partitions of that to produce all "possible" variables on that set, which will include the original ones and many more.

ETA: Koen recommends reading Counterfactual Planning in AGI Systems first (or instead of) Corrigibility with Utility Preservation)

Update: I started reading your paper "Corrigibility with Utility Preservation".[1]  My guess is that readers strapped for time should read {abstract, section 2, section 4} then skip to section 6.  AFAICT, section 5 is just setting up the standard utility-maximization framework and defining "superintelligent" as "optimal utility maximizer".

Quick thoughts after reading less than half:

AFAICT,[2] this is a mathematica...

2Koen Holtman9d
Corrigibility with Utility Preservation is not the paper I would recommend you read first, see my comments included in the list I just posted. To comment on your quick thoughts: * My later papers spell out the ML analog of the solution in Corrigibility with' more clearly. * On your question of Do you have an account of why MIRI's supposed impossibility results (I think these exist?) are false?: Given how re-tellings in the blogosphere work to distort information into more extreme viewpoints, I am not surprised you believe these impossibility results of MIRI exist, but MIRI does not have any actual mathematically proven impossibility results about corrigibility. The corrigibility paper proves that one approach did not work, but does not prove anything for other approaches. What they have is that 2022 Yudkowsky is on record expressing strongly held beliefs that corrigibility is very very hard, and (if I recall correctly) even saying that nobody has made any progress on it in the last ten years. Not everybody on this site shares these beliefs. If you formalise corrigibility in a certain way, by formalising it as producing a full 100% safety, no 99.999% allowed, it is trivial to prove that a corrigible AI formalised that way can never provably exist, because the humans who will have to build, train, and prove it are fallible. Roman Yampolskiy has done some writing about this, but I do not believe that this kind or reasoning is at the core of Yudkowsky's arguments for pessimism. * On being misleadingly optimistic in my statement that the technical problems are mostly solved: as long as we do not have an actual AGI in real life, we can only ever speculate about how difficult it will be to make it corrigible in real life. This speculation can then lead to optimistic or pessimistic conclusions. Late-stage Yudkowsky is of course well-known for speculating that everybody who shows some opt

To be more specific about the technical problem being mostly solved: there are a bunch of papers outlining corrigibility methods that are backed up by actual mathematical correctness proofs

Can you link these papers here?  No need to write anything, just links.

4Koen Holtman9d
1Vivek Hebbar10d
1. Try to improve my evaluation process so that I can afford to do wider searches without taking excessive risk.

Improve it with respect to what?

My attempt at a framework where "improving one's own evaluator" and "believing in adversarial examples to one's own evaluator" make sense:

• The agent's allegiance is to some idealized utility function  (like CEV).  The agent's internal evaluator  is "trying" to approximate  by reasoning heuristically.  So now we ask Eval to evaluate the plan "do argmax w.r.t
...
3Alex Turner8d
Vivek -- I replied to your comment in appendix C of today's follow-up post, Alignment allows imperfect decision-influences and doesn't require robust grading [https://www.lesswrong.com/posts/rauMEna2ddf26BqiE/alignment-allows-imperfect-decision-influences-and-doesn-t] .
The way you write this (especially the last sentence) makes me think that you see this attempt as being close to the only one that makes sense to you atm. Which makes me curious: * Do you think that you are internally trying to approximate your ownUideal? * Do you think that you have ever made the decision (either implicitly or explicitly) to not eval all or most plans because you don't trust your ability to do so for adversarial examples (as opposed to tractability issues for example)? * Can you think of concrete instances where you improved your own Eval? * Can you think of concrete instances where you thought you improved you own Eval but then regretted it later? * Do you think that your own changes to your eval have been moving in the direction of yourUideal?
8Wei Dai9d

Yeah, the right column should obviously be all 20s.  There must be a bug in my code[1] :/

I like to think of the argmax function as something that takes in a distribution on probability distributions on  with different sigma algebras, and outputs a partial probability distribution that is defined on the set of all events that are in the sigma algebra of (and given positive probability by) one of the components.

Take the following hypothesis :

If I add this into  with weight , then the middle column is still near...

Now, let's consider the following modification: Each hypothesis is no longer a distribution on , but instead a distribution on some coarser partition of . Now  is still well defined

Playing around with this a bit, I notice a curious effect (ETA: the numbers here were previously wrong, fixed now):

The reason the middle column goes to zero is that hypothesis A puts 60% on the rightmost column, and hypothesis B puts 40% on the leftmost, and neither cares about the middle column specifically.

But philosophically, what d...

2Scott Garrabrant11d
I think your numbers are wrong, and the right column on the output should say 20% 20% 20%. The output actually agrees with each of the components on every event in that component's sigma algebra. The input distributions don't actually have any conflicting beliefs, and so of course the output chooses a distribution that doesn't disagree with either. I agree that the 0s are a bit unfortunate. I think the best way to think of the type of the object you get out is not a probability distribution onW,but what I am calling a partial probability distribution onW. A partial probability distribution is a partial function from2 W→[0,1]that can be completed to a full probability distribution onW(with some sigma algebra that is a superset of the domain of the partial probability distribution. I like to think of the argmax function as something that takes in a distribution on probability distributions onWwith different sigma algebras, and outputs a partial probability distribution that is defined on the set of all events that are in the sigma algebra of (and given positive probability by) one of the components. One nice thing about this definition is that it makes it so the argmax always takes on a unique value. (proof omitted.) This doesn't really make it that much better, but the point here is that this framework admits that it doesn't really make much sense to ask about the probability of the middle column. You can ask about any of the events in the original pair of sigma algebras, and indeed, the two inputs don't disagree with the output at all on any of these sets.

A framing I wrote up for a debate about "alignment tax":

1. "Alignment isn't solved" regimes:
1. Nobody knows how to make an AI which is {safe, general, and broadly superhuman}, with any non-astronomical amount of compute
2. We know how to make an aligned AGI with 2 to 25 OOMs more compute than making an unaligned one
2. "Alignment tax" regimes:
1. We can make an aligned AGI, but it requires a compute overhead in the range 1% - 100x.  Furthermore, the situation remains multipolar and competitive for a while.
2. The alignment tax is <0.001%, so it's not a concern.
...

If the system is modular, such that the part of the system representing the goal is separate from the part of the system optimizing the goal, then it seems plausible that we can apply some sort of regularization to the goal to discourage it from being long term.

What kind of regularization could this be?  And are you imagining an AlphaZero-style system with a hardcoded value head, or an organically learned modularity?

• Pr π∈ξ [U(⌈G⌉,π) ≥ U(⌈G⌉,G*)] is the probability that, for a random policy π∈ξ, that policy has worse utility than the policy G* its program dictates; in essence, how good G's policies are compared to random policy selection

What prior over policies?

given g(G|U), we can infer the probability that an agent G has a given utility function U, as Pr[U] ∝ 2^-K(U) / Pr π∈ξ [U(⌈G⌉,π) ≥ U(⌈G⌉,G*)]) where ∝ means "is proportional to" and K(U) is the kolmogorov complexity of utility function U`.

Suppose the prior over policies is max-entropy (uniform over all action seq...

I have seen one person be surprised (I think twice in the same convo) about what progress had been made.

ETA: Our observations are compatible.  It could be that people used to a poor and slow-moving state of interpretability are surprised by the recent uptick, but that the absolute progress over 6 years is still disappointing.

All in all, I don't think my original post held up well.  I guess I was excited to pump out the concept quickly, before the dust settled.  Maybe this was a mistake?  Usually I make the ~opposite error of never getting around to posting things.

The perspective and the computations that are presented here (which in my opinion are representative of the mathematical parts of the linked posts and of various other unnamed posts) do not use any significant facts about neural networks or their architecture.

You're correct that the written portion of the Information Loss --> Basin flatness post doesn't use any non-trivial facts about NNs.  The purpose of the written portion was to explain some mathematical groundwork, which is then used for the non-trivial claim.  (I did not know at the time ...

2Vivek Hebbar1mo
All in all, I don't think my original post held up well. I guess I was excited to pump out the concept quickly, before the dust settled. Maybe this was a mistake? Usually I make the ~opposite error of never getting around to posting things.

Note that, for rational *altruists* (with nothing vastly better to do like alignment), voting can be huge on CDT grounds -- if you actually do the math for a swing state, the leverage per voter is really high.  In fact, I think the logically counterfactual impact-per-voter tends to be lower than the impact calculated by CDT, if the election is very close.

Exciting work!  A minor question about the paper:

Does this mean that it writes a projection of S1's positional embedding to S2's residual stream?  Or is it meant to say "writing to the position [residual stream] of [S2]"?  Or something else?

falls somewhere between 3rd and 10th place on my list of most urgent alignment problems to fix

What is your list of problems by urgency, btw?  Would be curious to know.

From this paper, "Theoretical work limited to ReLU-type activation functions, showed that in overparameterized networks, all global minima lie in a connected manifold (Freeman & Bruna, 2016; Nguyen, 2019)"

So for overparameterized nets, the answer is probably:

• There is only one solution manifold, so there are no separate basins.  Every solution is connected.
• We can salvage the idea of "basin volume" as follows:
• In the dimensions perpendicular to the manifold, calculate the basin cross-section using the Hessian.
• In the dimensions parallel to the manifol
...

The loss is defined over all dimensions of parameter space, so   is still a function of all 3 x's.  You should think of it as .  It's thickness in the  direction is infinite, not zero.

Here's what a zero-determinant Hessian corresponds to:

The basin here is not lower dimensional; it is just infinite in some dimension.  The simplest way to fix this is to replace the infinity with some large value.  Luckily, there is a fairly principled way to do this:

1. Regularization / weight decay prov
...

I will split this into a math reply, and a reply about the big picture / info loss interpretation.

Thanks for fleshing out the calculus rigorously; admittedly, I had not done this.  Rather, I simply assumed MSE loss and proceeded largely through visual intuition.

I agree that assuming MSE, and looking at a local minimum, you have

This is still false!  Edit: I am now confused, I don't know if it is false or not.

You are conflating  and .  Adding disa...

1Spencer Becker-Kahn6mo
Thanks again for the reply. In my notation, something like∇lorJfare functions in and of themselves. The function∇levaluates to zero at local minima ofl. In my notation, there isn't any such thing as∇fl. But look, I think that this is perhaps getting a little too bogged down for me to want to try to neatly resolve in the comment section, and I expect to be away from work for the next few days so may not check back for a while. Personally, I would just recommend going back and slowly going through the mathematical details again, checking every step at the lowest level of detail that you can and using the notation that makes most sense to you.

I feel it ought to be pointed out that what is referred to here as the key result is a standard fact in differential geometry called (something like) the submersion theorem, which in turn is essentially an application of the implicit function theorem.

Ah nice, didn't know what it was called / what field it's from.  I should clarify that "key result" here just meant "key result of the math so far -- pay attention", not "key result of the whole post" or "profound/original".

The Jacobian matrix is what you call

...
2Spencer Becker-Kahn6mo
Thanks for the substantive reply. First some more specific/detailed comments: Regarding the relationship with the loss and with the Hessian of the loss, my concern sort of stems from the fact that the domains/codomains are different and so I think it deserves to be spelled out. The loss of a model with parametersθ∈Θcan be described by introducing the actual function that maps the behavior to the real numbers, right? i.e. given some actual functionl:Ok→Rwe have: L:Θf⟶Okl⟶Ri.e. it'slthat might be something like MSE, but the function 'L' is of course more mysterious because it includes the way that parameters are actually mapped to a working model. Anyway, to perform some computations with this, we are looking at an expression like L(θ)=l(f(θ))We want to differentiate this twice with respect toθessentially. Firstly, we have ∇L(θ)=∇l(f(θ))Jf(θ)where - just to keep track of this - we've got: (1×N)vector=[(1×k)vector][(k×N)matrix]Or, using 'coordinates' to make it explicit: ∂∂θiL(θ)=∇l(f(θ))⋅∂f∂θi=k∑p=1∇pl(f(θ))⋅∂fp∂θifori=1,…,N. Then forj=1,…,Nwe differentiate again: ∂2∂θj∂θiL(θ)=k∑p=1k∑q=1∇q∇pl(f(θ))∂fq∂θj∂fp∂θi+k∑p=1∇pl(f(θ))∂fp∂θj∂θiOr, Hess(L)(θ)=Jf(θ)T[Hess(l)(f(θ))]Jf(θ)+∇l(f(θ))D2f(θ)This is now at the level of(N×N)matrices. Avoiding getting into any depth about tensors and indices, theD2fterm is basically a(N×N×k)tensor-type object and it's paired with∇lwhich is a(1×k)vector to give something that is(N×N). So what I think you are saying now is that if we are at a local minimum forl, then the second term on the right-hand side vanishes (because the term includes the first derivatives ofl, which are zero at a minimum). You can see however that if the Hessian oflis not a multiple of the identity (like it would be for MSE), then the claimed relationship does not hold, i.e. it is not the case that in general, at a minima ofl, the Hessian of the loss is equal to a constant times(Jf)TJf. So maybe you really do want to explicitly assume something like MSE. I ag

I'm pretty sure my framework doesn't apply to grokking.  I usually think about training as ending once we hit zero training loss, whereas grokking happens much later.

I'll reply to the rest of your comment later today when I have some time

About the contours:  While the graphic shows a finite number of contours with some spacing, in reality there are infinite contour planes and they completely fill space (as densely as the reals, if we ignore float precision).  So at literally every point in space there is a blue contour, and a red one which exactly coincides with it.

1Vivek Hebbar6mo
I'll reply to the rest of your comment later today when I have some time

Yeah, this seems roughly correct, and similar to what I was thinking.  There is probably even a direct connection to the "asymptotic equipartitioning" math, via manifold counts containing terms like "A choose B" from permutations of neurons.

Yep, I am assuming MSE loss generally, but as you point out, any smooth and convex loss function will be locally approximately quadratic.  "Saddle points all the way down" isn't possible if a global min exists, since a saddle point implies the existence of an adjacent lower point.  As for asymptotes, this is indeed possible, especially in classification tasks.  I have basically ignored this and stuck to regression here.

I might return to the issue of classification / solutions at infinity in a later post, but for now I will say this: &...

Is the claim here that the 2^200 "persuasive ideas" would actually pass the scrutiny of top human researchers (for example, Paul Christiano studies one of them for a week and concludes that it is probably a full solution)?  Or do you just mean that they would look promising in a shorter evaluation done for training purposes?

Oh, this is definitely not what I meant.

"Betting odds" == Your actual belief after factoring in other people's opinions

"Inside view" == What your models predict, before factoring in other opinions or the possibility of being completely wrong

Though I understood what you meant, perhaps a clearer terminology is all-things-considered beliefs vs. independent impressions.

Checked; the answer is no: https://www.lesswrong.com/posts/Lq6jo5j9ty4sezT7r/teaser-hard-coding-transformer-models?commentId=ET24eiKK6FSJNef7G

Nice!  Do you know if the author of that post was involved in RASP?

1Vivek Hebbar9mo
Checked; the answer is no: https://www.lesswrong.com/posts/Lq6jo5j9ty4sezT7r/teaser-hard-coding-transformer-models?commentId=ET24eiKK6FSJNef7G
2gwern9mo
They apparently reinvented RASP independently.

What probability do you assign to the proposition "Prosaic alignment will fail"?

1. Purely based on your inside view model
2. After updating on everyone else's views

Same question for:

"More than 50% of the prosaic alignment work done by the top 7 researchers is nearly useless"

Is a metal bar an optimizer?  Looking at the temperature distribution, there is a clear set of target states (states of uniform temperature) with a much larger basin of attraction (all temperature distributions that don't vaporize the bar).

I suppose we could consider the second law of thermodynamics to be the true optimizer in this case.  The consequence is that any* closed physical system is trivially an optimizing system towards higher entropy.

In general, it seems like this optimization criterion is very easy to satisfy if we don't specify what...