## AI ALIGNMENT FORUMAF

Spencer Becker-Kahn

Sorted by New

# Wiki Contributions

I may come back to comment more or incorporate this post into something else I write but wanted to record my initial reaction which is that I basically believe the claim. I also think that the 'unrelated bonus reason' at the end is potentially important and probably deserves more thought.

I agree that the space  may well miss important concepts and perspectives. As I say, it is not my suggestion to look at it, but rather just something that was implicitly being done in another post. The space  may well be a more natural one. (It's of course the space of functions , and so a space in which 'model space' naturally sits in some sense. )

I wrote out the Hessian computation in a comment to one of Vivek's posts. I actually had a few concerns with his version and I could be wrong but I also think that there are some issues here. (My notation is slightly different because  for me the sum over  was included in the function I called "", but it doesn't affect my main point).

I think the most concrete thing is that the function  - i.e. the input-output' function of a neural network - should in general have a vector output, but you write things like

without any further explanation or indices. In your main computation it seems like it's being treated as a scalar.

Since we never change the labels or the dataset, on can drop the explicit dependence on  from our notation for . Then if the network has  neurons in the final layer, the codomain of the function  is  (unless I've misunderstood what you are doing?). So to my mind we have:

Going through the computation in full using the chain rule (and a local minimum of the loss function ) one would get something like:

Vivek wanted to suppose that  were equal to the identity matrix, or a multiple thereof, which is the case for mean squared loss. But without such an assumption, I don't think that the term

appears (this is the matrix you describe as containing "the  inner products of the features over the training data set.")

Another (probably more important but higher-level) issue is basically: What is your definition of 'feature'? I could say: Have you not essentially just defined feature' to be something like `an entry of '? Is the example not too contrived in that sense it clearly supposes that  has a very special form (in particular it is linear in the  variables so that the derivatives are not functions of .)

Thanks very much Geoffrey; glad you liked the post. And thanks for the interesting extra remarks.

In my notation, something like   or  are functions in and of themselves. The function  evaluates to zero at local minima of

In my notation, there isn't any such thing as .

But look, I think that this is perhaps getting a little too bogged down for me to want to try to neatly resolve in the comment section, and I expect to be away from work for the next few days so may not check back for a while. Personally, I would just recommend going back and slowly going through the mathematical details again, checking every step at the lowest level of detail that you can and using the notation that makes most sense to you.

First some more specific/detailed comments: Regarding the relationship with the loss and with the Hessian of the loss, my concern sort of stems from the fact that the domains/codomains are different and so I think it deserves to be spelled out.  The loss of a model with parameters  can be described by introducing the actual function that maps the behavior to the real numbers, right? i.e. given some actual function  we have:

i.e. it's  that might be something like MSE, but the function '' is of course more mysterious because it includes the way that parameters are actually mapped to a working model. Anyway, to perform some computations with this, we are looking at an expression like

We want to differentiate this twice with respect to  essentially. Firstly, we have

where - just to keep track of this - we've got:

Or, using 'coordinates' to make it explicit:

for . Then for  we differentiate again:

Or,

This is now at the level of  matrices. Avoiding getting into any depth about tensors and indices, the  term is basically a  tensor-type object and it's paired with  which is a  vector to give something that is .

So what I think you are saying now is that if we are at a local minimum for , then the second term on the right-hand side vanishes (because the term includes the first derivatives of , which are zero at a minimum). You can see however that if the Hessian of  is not a multiple of the identity (like it would be for MSE), then the claimed relationship does not hold, i.e. it is not the case that in general, at a minima of , the Hessian of the loss is equal to a constant times . So maybe you really do want to explicitly assume something like MSE.

I agree that assuming MSE, and looking at a local minimum, you have  .

(In case it's of interest to anyone, googling turned up this recent paper https://openreview.net/forum?id=otDgw7LM7Nn which studies pretty much exactly the problem of bounding the rank of the Hessian of the loss. They say: "Flatness: A growing number of works [59–61] correlate the choice of regularizers, optimizers, or hyperparameters, with the additional flatness brought about by them at the minimum. However, the significant rank degeneracy of the Hessian, which we have provably established, also points to another source of flatness — that exists as a virtue of the compositional model structure —from the initialization itself. Thus, a prospective avenue of future work would be to compare different architectures based on this inherent kind of flatness.")

Some broader remarks: I think these are nice observations but unfortunately I think generally I'm a bit confused/unclear about what else you might get out of going along these lines. I don't want to sound harsh but just trying to be plain: This is mostly because, as we can see, the mathematical part of what you have said is all very simple, well-established facts about smooth functions and so it would be surprising (to me at least) if some non-trivial observation about deep learning came out from it. In a similar vein, regarding the "cause" of low-rank G, I do think that one could try to bring in a notion of "information loss" in neural networks, but for it to be substantive one needs to be careful that it's not simply a rephrasing of what it means for the Jacobian to have less-than-full rank. Being a bit loose/informal now: To illustrate, just imagine for a moment a real-valued function on an interval. I could say it 'loses information' where its values cannot distinguish between a subset of points. But this is almost the same as just saying: It is constant on some subset...which is of course very close to just saying the derivative vanishes on some subset.  Here, if you describe the phenomena of information loss as concretely as being the situation where some inputs can't be distinguished, then (particularly given that you have to assume these spaces are actually some kind of smooth/differentiable spaces to do the theoretical analysis), you've more or less just built into your description of information loss something that looks a lot like the function being constant along some directions, which means there is a vector in the kernel of the Jacobian. I don't think it's somehow incorrect to point to this but it becomes more like just saying 'perhaps one useful definition of information loss is low rank G' as opposed to linking one phenomenon to the other.

Sorry for the very long remarks. Of course this is actually because I found it well worth engaging with. And I have a longer-standing personal interest in zero sets of smooth functions!

This was pretty interesting and I like the general direction that the analysis goes in. I feel it ought to be pointed out that what is referred to here as the key result is a standard fact in differential geometry called (something like) the submersion theorem, which in turn is essentially an application of the implicit function theorem.

I think that your setup is essentially that there is an -dimensional parameter space, let's call it  say, and then for each element  of the training set, we can consider the function  which takes in a set of parameters (i.e. a model) and outputs whatever the model does on training data point . We are thinking of both  and  as smooth (or at least sufficiently differentiable) spaces (I take it).

A contour plane is a level set of one of the , i.e. a set of the form

for some  and . A behavior manifold is a set of the form

for some .

A more concise way of viewing this is to define a single function  and then a behavior manifold is simply a level set of this function. The map  is a submersion at  if the Jacobian matrix at  is a surjective linear map. The Jacobian matrix is what you call  I think (because the Jacobian is formed with each row equal to a gradient vector with respect to one of the output coordinates). It doesn't matter much because what matters to check the surjectivity is the rank. Then the standard result implies that given , if  is a submersion in a neighbourhood of a point , then  is a smooth -dimensional submanifold in a neighbourhood of  .

Essentially, in a neighbourhood of a point at which the Jacobian of  has full rank, the level set through that point is an -dimensional smooth submanifold.

Then, yes, you could get onto studying in more detail the degeneracy when the Jacobian does not have full rank. But in my opinion I think you would need to be careful when you get to claim 3. I think the connection between loss and behavior is not spelled out in enough detail: Behaviour can change while loss could remain constant, right? And more generally, in exactly which directions do the implications go? Depending on exactly what you are trying to establish, this could actually be a bit of a 'tip of the iceberg' situation though. (The study of this sort of thing goes rather deep; Vladimir Arnold et al. wrote in their 1998 book: "The theory of singularities of smooth maps is an apparatus for the study of abrupt, jump-like phenomena - bifurcations, perestroikas (restructurings), catastrophes, metamorphoses - which occur in systems depending on parameters when the parameters vary in a smooth manner".)

Similarly when you say things like "Low rank  indicates information loss", I think some care is needed because the paragraphs that follow seem to be getting at something more like: If there is a certain kind of information loss in the early layers of the network, then this leads to low rank . It doesn't seem clear that low rank  is necessarily indicative of information loss?

I broadly agree with Richard's main point, but I also do agree with this comment in the sense that I am not confident that the example of Turing compared with e.g. Einstein is completely fair/accurate.

One thing I would say in response to your comment, Adam, is that I don't usually see the message of your linked post as being incompatible with Richard's main point. I think one usually does have or does need productive mistakes that don't necessarily or obviously look like they are robust partial progress. But still, often when there actually is a breakthrough, I think it can be important to look for this "intuitively compelling" explanation. So one thing I have in mind is that I think it's usually good to be skeptical if a claimed breakthrough seems to just 'fall out' of a bunch of partial work without there being a compelling explanation after the fact.

I agree i.e. I also (fairly weakly) disagree with the value of thinking of 'distilling'  as a separate thing. Part of me wants to conjecture that it's comes from thinking of alignment work predominantly as mathematics or a hard science in which the standard 'unit' is a an original theorem or original result which might be poorly written up but can't really be argued against much. But if we think of the area (I'm thinking predominantly about more conceptual/theoretical alignment) as a 'softer', messier, ongoing discourse full of different arguments from different viewpoints and under different assumptions, with counter-arguments, rejoinders, clarifications, retractions etc. that takes place across blogs, papers, talks, theorems, experiments etc that all somehow slowly works to produce progress, then it starts to be less clear what this special activity called 'distilling' really is.

Another relevant point, but one which I won't bother trying to expand on much here, is that a research community assimilating - and then eventually building on - complex ideas can take a really long time.

[At risk of extending into a rant, I also just think the term is a bit off-putting. Sure, I can get the sense of what it means from the word and the way it is used - it's not completely opaque or anything - but I'd not heard it used regularly in this way until I started looking at the alignment forum. What's really so special about alignment that we need to use this word? Do we think we have figured out some new secret activity that is useful for intellectual progress that other fields haven't figured out? Can we not get by using words like "writing" and "teaching" and "explaining"?]

It could also work here. But I do feel like pointing out that the bounty format has other drawbacks. Maybe it works better when you want a variety of bitesize contributions, like various different proposals? I probably wouldn't do work like Abram proposes - quite a long and difficult project, I expect - for the chance of winning a prize, particularly if the winner(s) were decided by someone's subjective judgement.