carboniferous_umbraculum — AI Alignment Forum

I think that perhaps as a result of a balance of pros and cons, I initially was not very motivated to comment (and haven't been very motivated to engage much with ARC's recent work). But I decided maybe it's best to comment in a way that gives a better signal than silence.

I've generally been pretty confused about Formalizing the presumption of Independence and, as the post sort of implies, this is sort of the main advert that ARC have at the moment for the type of conceptual work that they are doing, so most of what I have to say is meta stuff about that.

Disclaimer a) I have not spent a lot of time trying to understand everything in the paper. and b) As is often the case, this comment may come across as overly critical, but it seems highest leverage to discuss my biggest criticisms, i.e. the things that if they were addressed may cause me to update to the point I would more strongly recommend people applying etc.

I suppose the tldr is that the main contribution of the paper claims to be the framing of a set of open problems, but I did not find the paper able to convince me that the problems are useful ones or that they would be interesting to answer.

I can try to explain a little more: It seemed odd that the "potential" applications to ML were mentioned very briefly in the final appendix of the paper, when arguably the potential impact or usefulness of the paper really hinges on this. As a reader, it might seem natural to me that the authors would have already asked and answered - before writing the paper - questions like "OK so what if I had this formal heuristic estimator? What exactly can I use it for? What can I actually (or even practically) do with it?" Some of what was said in the paper was fairly vague stuff like:

If successful, it may also help improve our ability to verify reasoning about complex questions, like those emerging in modern machine learning, for which we expect formal proof to be impossible.

In my opinion, it's also important to bear in mind that the criteria of a problem being 'open' is a poor proxy for things like usefulness/interestingness. (obviously those famous number theory problems are open, but so are loads of random mathematical statements). The usefulness/interestingness of course comes because people recognize various other valuable things too like: That the solution would seem to require new insights into X and therefore a proof would 'have to be' deeply interesting in its own right; or that the truth of the statement implies all sorts of other interesting things; or that the articulation of the problem itself has captured and made rigorous some hitherto messy confusion, or etc. etc. Perhaps more of these things need to be made explicit in order to argue more effectively that ARC's stating of these open problems about heuristic estimators is an interesting contribution in itself?

To be fair, in the final paragraph of the paper there are some remarks that sort of admit some of what I'm saying:

Neither of these applications [to avoiding catastrophic failures or to ELK] is straightforward, and it should not be obvious that heuristic arguments would allow us to achieve either goal.

But practically it means that when I ask myself something like: 'Why would I drop whatever else I'm working on and work on this stuff?' I find it quite hard to answer in a way that's not basically just all deference to some 'vision' that is currently undeclared (or as the paper says "mostly defer[red]" to "future articles").

Having said all this I'll reiterate again that there are lots of clear pros to a job like this and I do think that there is important work to be done that is probably not too dissimilar from the kind being talked about in Formalizing the presumption of Independence and in this post.

"Cars and Elephants": a handwavy argument/analogy against mechanistic interpretability

carboniferous_umbraculum3y22

I may come back to comment more or incorporate this post into something else I write but wanted to record my initial reaction which is that I basically believe the claim. I also think that the 'unrelated bonus reason' at the end is potentially important and probably deserves more thought.

Behaviour Manifolds and the Hessian of the Total Loss - Notes and Criticism

carboniferous_umbraculum3y21

I agree that the space may well miss important concepts and perspectives. As I say, it is not my suggestion to look at it, but rather just something that was implicitly being done in another post. The space $O^{R^{n}}$ may well be a more natural one. (It's of course the space of functions $R^{n} \to O$ , and so a space in which 'model space' naturally sits in some sense. )

Basin broadness depends on the size and number of orthogonal features

carboniferous_umbraculum3y20

I wrote out the Hessian computation in a comment to one of Vivek's posts. I actually had a few concerns with his version and I could be wrong but I also think that there are some issues here. (My notation is slightly different because for me the sum over was included in the function I called " $l$ ", but it doesn't affect my main point).

I think the most concrete thing is that the function $f$ - i.e. the `input-output' function of a neural network - should in general have a vector output, but you write things like

\frac{d^{2} l}{d f^{2}}

without any further explanation or indices. In your main computation it seems like it's being treated as a scalar.

Since we never change the labels or the dataset, on can drop the explicit dependence on $y (x)$ from our notation for $l$ . Then if the network has $p$ neurons in the final layer, the codomain of the function $f$ is $R^{p}$ (unless I've misunderstood what you are doing?). So to my mind we have:

l (f (x)) = l (f^{1} (x), f^{2} (x), \dots, f^{p} (x)) .

Going through the computation in full using the chain rule (and a local minimum of the loss function $l$ ) one would get something like:

H e s s (L) (θ) = \sum x J f (θ)^{T} [H e s s (l) (f (θ))] J f (θ)

Vivek wanted to suppose that $H e s s (l)$ were equal to the identity matrix, or a multiple thereof, which is the case for mean squared loss. But without such an assumption, I don't think that the term

\sum x J f (θ)^{T} J f (θ)

appears (this is the matrix you describe as containing "the $L_{2}$ inner products of the features over the training data set.")

Another (probably more important but higher-level) issue is basically: What is your definition of 'feature'? I could say: Have you not essentially just defined `feature' to be something like `an entry of $J f (θ)$ '? Is the example not too contrived in that sense it clearly supposes that $f$ has a very special form (in particular it is linear in the $Θ$ variables so that the derivatives are not functions of $Θ$ .)

Notes on Learning the Prior

carboniferous_umbraculum3y42

Thanks very much Geoffrey; glad you liked the post. And thanks for the interesting extra remarks.

Information Loss --> Basin flatness

carboniferous_umbraculum4y10

Thanks again for the reply.

In my notation, something like or $J f$ are functions in and of themselves. The function $\nabla l$ evaluates to zero at local minima of $l$ .

In my notation, there isn't any such thing as $\nabla_{f} l$ .

But look, I think that this is perhaps getting a little too bogged down for me to want to try to neatly resolve in the comment section, and I expect to be away from work for the next few days so may not check back for a while. Personally, I would just recommend going back and slowly going through the mathematical details again, checking every step at the lowest level of detail that you can and using the notation that makes most sense to you.

Information Loss --> Basin flatness

carboniferous_umbraculum4y20

Thanks for the substantive reply.

First some more specific/detailed comments: Regarding the relationship with the loss and with the Hessian of the loss, my concern sort of stems from the fact that the domains/codomains are different and so I think it deserves to be spelled out. The loss of a model with parameters can be described by introducing the actual function that maps the behavior to the real numbers, right? i.e. given some actual function $l : O^{k} \to R$ we have:

L : Θ f ⟶ O^{k} l ⟶ R

i.e. it's $l$ that might be something like MSE, but the function ' $L$ ' is of course more mysterious because it includes the way that parameters are actually mapped to a working model. Anyway, to perform some computations with this, we are looking at an expression like

L (θ) = l (f (θ))

We want to differentiate this twice with respect to $θ$ essentially. Firstly, we have

\nabla L (θ) = \nabla l (f (θ)) J f (θ)

where - just to keep track of this - we've got:

(1 \times N) vector = [(1 \times k) vector] [(k \times N) matrix]

Or, using 'coordinates' to make it explicit:

\frac{\partial}{\partial θ_{i}} L (θ) = \nabla l (f (θ)) \cdot \frac{\partial f}{\partial θ_{i}} = k \sum p = 1 \nabla^{p} l (f (θ)) \cdot \frac{\partial f^{p}}{\partial θ_{i}}

for $i = 1, \dots, N$ . Then for $j = 1, \dots, N$ we differentiate again:

\frac{\partial^{2}}{\partial θ_{j} \partial θ_{i}} L (θ) = k \sum p = 1 k \sum q = 1 \nabla^{q} \nabla^{p} l (f (θ)) \frac{\partial f^{q}}{\partial θ_{j}} \frac{\partial f^{p}}{\partial θ_{i}} + k \sum p = 1 \nabla^{p} l (f (θ)) \frac{\partial f^{p}}{\partial θ_{j} \partial θ_{i}}

Or,

H e s s (L) (θ) = J f (θ)^{T} [H e s s (l) (f (θ))] J f (θ) + \nabla l (f (θ)) D^{2} f (θ)

This is now at the level of $(N \times N)$ matrices. Avoiding getting into any depth about tensors and indices, the $D^{2} f$ term is basically a $(N \times N \times k)$ tensor-type object and it's paired with $\nabla l$ which is a $(1 \times k)$ vector to give something that is $(N \times N)$ .

So what I think you are saying now is that if we are at a local minimum for $l$ , then the second term on the right-hand side vanishes (because the term includes the first derivatives of $l$ , which are zero at a minimum). You can see however that if the Hessian of $l$ is not a multiple of the identity (like it would be for MSE), then the claimed relationship does not hold, i.e. it is not the case that in general, at a minima of $l$ , the Hessian of the loss is equal to a constant times $(J f)^{T} J f$ . So maybe you really do want to explicitly assume something like MSE.

I agree that assuming MSE, and looking at a local minimum, you have $r a n k (H e s s (L)) = r a n k (J f)$ .

(In case it's of interest to anyone, googling turned up this recent paper https://openreview.net/forum?id=otDgw7LM7Nn which studies pretty much exactly the problem of bounding the rank of the Hessian of the loss. They say: "Flatness: A growing number of works [59–61] correlate the choice of regularizers, optimizers, or hyperparameters, with the additional flatness brought about by them at the minimum. However, the significant rank degeneracy of the Hessian, which we have provably established, also points to another source of flatness — that exists as a virtue of the compositional model structure —from the initialization itself. Thus, a prospective avenue of future work would be to compare different architectures based on this inherent kind of flatness.")

Some broader remarks: I think these are nice observations but unfortunately I think generally I'm a bit confused/unclear about what else you might get out of going along these lines. I don't want to sound harsh but just trying to be plain: This is mostly because, as we can see, the mathematical part of what you have said is all very simple, well-established facts about smooth functions and so it would be surprising (to me at least) if some non-trivial observation about deep learning came out from it. In a similar vein, regarding the "cause" of low-rank G, I do think that one could try to bring in a notion of "information loss" in neural networks, but for it to be substantive one needs to be careful that it's not simply a rephrasing of what it means for the Jacobian to have less-than-full rank. Being a bit loose/informal now: To illustrate, just imagine for a moment a real-valued function on an interval. I could say it 'loses information' where its values cannot distinguish between a subset of points. But this is almost the same as just saying: It is constant on some subset...which is of course very close to just saying the derivative vanishes on some subset. Here, if you describe the phenomena of information loss as concretely as being the situation where some inputs can't be distinguished, then (particularly given that you have to assume these spaces are actually some kind of smooth/differentiable spaces to do the theoretical analysis), you've more or less just built into your description of information loss something that looks a lot like the function being constant along some directions, which means there is a vector in the kernel of the Jacobian. I don't think it's somehow incorrect to point to this but it becomes more like just saying 'perhaps one useful definition of information loss is low rank G' as opposed to linking one phenomenon to the other.

Sorry for the very long remarks. Of course this is actually because I found it well worth engaging with. And I have a longer-standing personal interest in zero sets of smooth functions!

Information Loss --> Basin flatness

carboniferous_umbraculum4y50

This was pretty interesting and I like the general direction that the analysis goes in. I feel it ought to be pointed out that what is referred to here as the key result is a standard fact in differential geometry called (something like) the submersion theorem, which in turn is essentially an application of the implicit function theorem.

I think that your setup is essentially that there is an -dimensional parameter space, let's call it $Θ$ say, and then for each element $x_{i}$ of the training set, we can consider the function $f_{i} : Θ ⟶ Output Space =: O$ which takes in a set of parameters (i.e. a model) and outputs whatever the model does on training data point $x_{i}$ . We are thinking of both $Θ$ and $O^{k}$ as smooth (or at least sufficiently differentiable) spaces (I take it).

A contour plane is a level set of one of the $f_{i}$ , i.e. a set of the form

{θ \in Θ : f_{i} (θ) = o},

for some $o \in O$ and $i \in {1, \dots, k}$ . A behavior manifold is a set of the form

k ⋂ i = 1 {θ \in Θ : f_{i} = o}

for some $o \in O$ .

A more concise way of viewing this is to define a single function $f : Θ ⟶ O^{k}$ and then a behavior manifold is simply a level set of this function. The map $f$ is a submersion at $θ \in Θ$ if the Jacobian matrix at $θ$ is a surjective linear map. The Jacobian matrix is what you call $G^{T}$ I think (because the Jacobian is formed with each row equal to a gradient vector with respect to one of the output coordinates). It doesn't matter much because what matters to check the surjectivity is the rank. Then the standard result implies that given $o \in O$ , if $f$ is a submersion in a neighbourhood of a point $θ_{0} \in f^{- 1} (o)$ , then $f^{- 1} (o)$ is a smooth $(N - k)$ -dimensional submanifold in a neighbourhood of $θ_{0}$ .

Essentially, in a neighbourhood of a point at which the Jacobian of $f$ has full rank, the level set through that point is an $(N - k)$ -dimensional smooth submanifold.

Then, yes, you could get onto studying in more detail the degeneracy when the Jacobian does not have full rank. But in my opinion I think you would need to be careful when you get to claim 3. I think the connection between loss and behavior is not spelled out in enough detail: Behaviour can change while loss could remain constant, right? And more generally, in exactly which directions do the implications go? Depending on exactly what you are trying to establish, this could actually be a bit of a 'tip of the iceberg' situation though. (The study of this sort of thing goes rather deep; Vladimir Arnold et al. wrote in their 1998 book: "The theory of singularities of smooth maps is an apparatus for the study of abrupt, jump-like phenomena - bifurcations, perestroikas (restructurings), catastrophes, metamorphoses - which occur in systems depending on parameters when the parameters vary in a smooth manner".)

Similarly when you say things like "Low rank $G$ indicates information loss", I think some care is needed because the paragraphs that follow seem to be getting at something more like: If there is a certain kind of information loss in the early layers of the network, then this leads to low rank $G$ . It doesn't seem clear that low rank $G$ is necessarily indicative of information loss?

Intuitions about solving hard problems

carboniferous_umbraculum4y00

I broadly agree with Richard's main point, but I also do agree with this comment in the sense that I am not confident that the example of Turing compared with e.g. Einstein is completely fair/accurate.

One thing I would say in response to your comment, Adam, is that I don't usually see the message of your linked post as being incompatible with Richard's main point. I think one usually does have or does need productive mistakes that don't necessarily or obviously look like they are robust partial progress. But still, often when there actually is a breakthrough, I think it can be important to look for this "intuitively compelling" explanation. So one thing I have in mind is that I think it's usually good to be skeptical if a claimed breakthrough seems to just 'fall out' of a bunch of partial work without there being a compelling explanation after the fact.

Call For Distillers

carboniferous_umbraculum4y40

I agree i.e. I also (fairly weakly) disagree with the value of thinking of 'distilling' as a separate thing. Part of me wants to conjecture that it's comes from thinking of alignment work predominantly as mathematics or a hard science in which the standard 'unit' is a an original theorem or original result which might be poorly written up but can't really be argued against much. But if we think of the area (I'm thinking predominantly about more conceptual/theoretical alignment) as a 'softer', messier, ongoing discourse full of different arguments from different viewpoints and under different assumptions, with counter-arguments, rejoinders, clarifications, retractions etc. that takes place across blogs, papers, talks, theorems, experiments etc that all somehow slowly works to produce progress, then it starts to be less clear what this special activity called 'distilling' really is.

Another relevant point, but one which I won't bother trying to expand on much here, is that a research community assimilating - and then eventually building on - complex ideas can take a really long time.

[At risk of extending into a rant, I also just think the term is a bit off-putting. Sure, I can get the sense of what it means from the word and the way it is used - it's not completely opaque or anything - but I'd not heard it used regularly in this way until I started looking at the alignment forum. What's really so special about alignment that we need to use this word? Do we think we have figured out some new secret activity that is useful for intellectual progress that other fields haven't figured out? Can we not get by using words like "writing" and "teaching" and "explaining"?]

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments