AI ALIGNMENT FORUM
AF

carboniferous_umbraculum
Ω234860
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
ARC is hiring theoretical researchers
carboniferous_umbraculum2y2313

I think that perhaps as a result of a balance of pros and cons, I initially was not very motivated to comment (and haven't been very motivated to engage much with ARC's recent work).  But I decided maybe it's best to comment in a way that gives a better signal than silence. 

I've generally been pretty confused about Formalizing the presumption of Independence and, as the post sort of implies, this is sort of the main advert that ARC have at the moment for the type of conceptual work that they are doing, so most of what I have to say is meta stuff about that. 

Disclaimer a) I have not spent a lot of time trying to understand everything in the paper. and b) As is often the case, this comment may come across as overly critical, but it seems highest leverage to discuss my biggest criticisms, i.e. the things that if they were addressed may cause me to update to the point I would more strongly recommend people applying etc.

I suppose the tldr is that the main contribution of the paper claims to be the framing of a set of open problems, but I did not find the paper able to convince me that the problems are useful ones or that they would be interesting to answer.

I can try to explain a little more: It seemed odd that the "potential" applications to ML were mentioned very briefly in the final appendix of the paper, when arguably the potential impact or usefulness of the paper really hinges on this. As a reader, it might seem natural to me that the authors would have already asked and answered - before writing the paper - questions like "OK so what if I had this formal heuristic estimator? What exactly can I use it for? What can I actually (or even practically) do with it?" Some of what was said in the paper was fairly vague stuff like: 

If successful, it may also help improve our ability to verify reasoning about complex questions, like those emerging in modern machine learning, for which we expect formal proof to be impossible. 

In my opinion, it's also important to bear in mind that the criteria of a problem being 'open' is a poor proxy for things like usefulness/interestingness. (obviously those famous number theory problems are open, but so are loads of random mathematical statements). The usefulness/interestingness of course comes because people recognize various other valuable things too like:  That the solution would seem to require new insights into X and therefore a proof would 'have to be' deeply interesting in its own right; or that the truth of the statement implies all sorts of other interesting things; or that the articulation of the problem itself has captured and made rigorous some hitherto messy confusion, or etc. etc.  Perhaps more of these things need to be made explicit in order to argue more effectively that ARC's stating of these open problems about heuristic estimators is an interesting contribution in itself?

To be fair, in the final paragraph of the paper there are some remarks that sort of admit some of what I'm saying:

Neither of these applications [to avoiding catastrophic failures or to ELK] is straightforward, and it should not be obvious that heuristic arguments would allow us to achieve either goal.

But practically it means that when I ask myself something like: 'Why would I drop whatever else I'm working on and work on this stuff?' I find it quite hard to answer in a way that's not basically just all deference to some 'vision' that is currently undeclared (or as the paper says "mostly defer[red]" to "future articles").

Having said all this I'll reiterate again that there are lots of clear pros to a job like this and I do think that there is important work to be done that is probably not too dissimilar from the kind being talked about in Formalizing the presumption of Independence and in this post.

 

Reply
"Cars and Elephants": a handwavy argument/analogy against mechanistic interpretability
carboniferous_umbraculum3y22

I may come back to comment more or incorporate this post into something else I write but wanted to record my initial reaction which is that I basically believe the claim. I also think that the 'unrelated bonus reason' at the end is potentially important and probably deserves more thought.

Reply
Behaviour Manifolds and the Hessian of the Total Loss - Notes and Criticism
carboniferous_umbraculum3y21

I agree that the space OD may well miss important concepts and perspectives. As I say, it is not my suggestion to look at it, but rather just something that was implicitly being done in another post. The space ORn may well be a more natural one. (It's of course the space of functions Rn→O, and so a space in which 'model space' naturally sits in some sense. )

Reply
Basin broadness depends on the size and number of orthogonal features
carboniferous_umbraculum3y20

I wrote out the Hessian computation in a comment to one of Vivek's posts. I actually had a few concerns with his version and I could be wrong but I also think that there are some issues here. (My notation is slightly different because  for me the sum over x was included in the function I called "l", but it doesn't affect my main point).

I think the most concrete thing is that the function f - i.e. the `input-output' function of a neural network - should in general have a vector output, but you write things like 

d2ldf2

without any further explanation or indices. In your main computation it seems like it's being treated as a scalar. 

Since we never change the labels or the dataset, on can drop the explicit dependence on y(x) from our notation for l. Then if the network has p neurons in the final layer, the codomain of the function f is Rp (unless I've misunderstood what you are doing?). So to my mind we have:

l(f(x))=l(f1(x),f2(x),…,fp(x)).

Going through the computation in full using the chain rule (and a local minimum of the loss function l) one would get something like: 

Hess(L)(θ)=∑xJf(θ)T[Hess(l)(f(θ))]Jf(θ)

Vivek wanted to suppose that Hess(l) were equal to the identity matrix, or a multiple thereof, which is the case for mean squared loss. But without such an assumption, I don't think that the term 

∑xJf(θ)TJf(θ)

appears (this is the matrix you describe as containing "the L2 inner products of the features over the training data set.")

Another (probably more important but higher-level) issue is basically: What is your definition of 'feature'? I could say: Have you not essentially just defined `feature' to be something like `an entry of Jf(θ)'? Is the example not too contrived in that sense it clearly supposes that f has a very special form (in particular it is linear in the Θ variables so that the derivatives are not functions of Θ.)

Reply
Notes on Learning the Prior
carboniferous_umbraculum3y42

Thanks very much Geoffrey; glad you liked the post. And thanks for the interesting extra remarks.

Reply
Information Loss --> Basin flatness
carboniferous_umbraculum3y10

Thanks again for the reply.

In my notation, something like ∇l  or Jf are functions in and of themselves. The function ∇l evaluates to zero at local minima of l. 

In my notation, there isn't any such thing as ∇fl.

But look, I think that this is perhaps getting a little too bogged down for me to want to try to neatly resolve in the comment section, and I expect to be away from work for the next few days so may not check back for a while. Personally, I would just recommend going back and slowly going through the mathematical details again, checking every step at the lowest level of detail that you can and using the notation that makes most sense to you. 

Reply
Information Loss --> Basin flatness
carboniferous_umbraculum3y20

Thanks for the substantive reply.

First some more specific/detailed comments: Regarding the relationship with the loss and with the Hessian of the loss, my concern sort of stems from the fact that the domains/codomains are different and so I think it deserves to be spelled out.  The loss of a model with parameters θ∈Θ can be described by introducing the actual function that maps the behavior to the real numbers, right? i.e. given some actual function l:Ok→R we have: 

L :Θf⟶Okl⟶R

i.e. it's l that might be something like MSE, but the function 'L' is of course more mysterious because it includes the way that parameters are actually mapped to a working model. Anyway, to perform some computations with this, we are looking at an expression like 

L(θ)=l(f(θ))

We want to differentiate this twice with respect to θ essentially. Firstly, we have 

∇L(θ)=∇l(f(θ))Jf(θ)

where - just to keep track of this - we've got: 

(1×N) vector=[(1×k) vector] [(k×N) matrix]

Or, using 'coordinates' to make it explicit: 

∂∂θiL(θ)=∇l(f(θ))⋅∂f∂θi=k∑p=1∇pl(f(θ))⋅∂fp∂θi

for i=1,…,N. Then for j=1,…,N we differentiate again:

∂2∂θj∂θiL(θ)=k∑p=1k∑q=1∇q∇pl(f(θ))∂fq∂θj∂fp∂θi+k∑p=1∇pl(f(θ))∂fp∂θj∂θi

Or, 

Hess(L)(θ)=Jf(θ)T[Hess(l)(f(θ))]Jf(θ)+∇l(f(θ))D2f(θ)

This is now at the level of (N×N) matrices. Avoiding getting into any depth about tensors and indices, the D2f term is basically a (N×N×k) tensor-type object and it's paired with ∇l which is a (1×k) vector to give something that is (N×N).

So what I think you are saying now is that if we are at a local minimum for l, then the second term on the right-hand side vanishes (because the term includes the first derivatives of l, which are zero at a minimum). You can see however that if the Hessian of l is not a multiple of the identity (like it would be for MSE), then the claimed relationship does not hold, i.e. it is not the case that in general, at a minima of l, the Hessian of the loss is equal to a constant times (Jf)TJf. So maybe you really do want to explicitly assume something like MSE.

I agree that assuming MSE, and looking at a local minimum, you have rank(Hess(L))=rank(Jf) . 

(In case it's of interest to anyone, googling turned up this recent paper https://openreview.net/forum?id=otDgw7LM7Nn which studies pretty much exactly the problem of bounding the rank of the Hessian of the loss. They say: "Flatness: A growing number of works [59–61] correlate the choice of regularizers, optimizers, or hyperparameters, with the additional flatness brought about by them at the minimum. However, the significant rank degeneracy of the Hessian, which we have provably established, also points to another source of flatness — that exists as a virtue of the compositional model structure —from the initialization itself. Thus, a prospective avenue of future work would be to compare different architectures based on this inherent kind of flatness.")

Some broader remarks: I think these are nice observations but unfortunately I think generally I'm a bit confused/unclear about what else you might get out of going along these lines. I don't want to sound harsh but just trying to be plain: This is mostly because, as we can see, the mathematical part of what you have said is all very simple, well-established facts about smooth functions and so it would be surprising (to me at least) if some non-trivial observation about deep learning came out from it. In a similar vein, regarding the "cause" of low-rank G, I do think that one could try to bring in a notion of "information loss" in neural networks, but for it to be substantive one needs to be careful that it's not simply a rephrasing of what it means for the Jacobian to have less-than-full rank. Being a bit loose/informal now: To illustrate, just imagine for a moment a real-valued function on an interval. I could say it 'loses information' where its values cannot distinguish between a subset of points. But this is almost the same as just saying: It is constant on some subset...which is of course very close to just saying the derivative vanishes on some subset.  Here, if you describe the phenomena of information loss as concretely as being the situation where some inputs can't be distinguished, then (particularly given that you have to assume these spaces are actually some kind of smooth/differentiable spaces to do the theoretical analysis), you've more or less just built into your description of information loss something that looks a lot like the function being constant along some directions, which means there is a vector in the kernel of the Jacobian. I don't think it's somehow incorrect to point to this but it becomes more like just saying 'perhaps one useful definition of information loss is low rank G' as opposed to linking one phenomenon to the other. 

Sorry for the very long remarks. Of course this is actually because I found it well worth engaging with. And I have a longer-standing personal interest in zero sets of smooth functions!  

Reply
Information Loss --> Basin flatness
carboniferous_umbraculum3y50

This was pretty interesting and I like the general direction that the analysis goes in. I feel it ought to be pointed out that what is referred to here as the key result is a standard fact in differential geometry called (something like) the submersion theorem, which in turn is essentially an application of the implicit function theorem.

I think that your setup is essentially that there is an N-dimensional parameter space, let's call it Θ say, and then for each element xi of the training set, we can consider the function fi:Θ⟶Output Space=:O which takes in a set of parameters (i.e. a model) and outputs whatever the model does on training data point xi. We are thinking of both Θ and Ok as smooth (or at least sufficiently differentiable) spaces (I take it). 

A contour plane is a level set of one of the fi, i.e. a set of the form

{θ∈Θ:fi(θ)=o},

for some o∈O and i∈{1,…,k}. A behavior manifold is a set of the form 

k⋂i=1{θ∈Θ:fi=o}

for some o∈O.

A more concise way of viewing this is to define a single function f:Θ⟶Ok and then a behavior manifold is simply a level set of this function. The map f is a submersion at θ∈Θ if the Jacobian matrix at θ is a surjective linear map. The Jacobian matrix is what you call GT I think (because the Jacobian is formed with each row equal to a gradient vector with respect to one of the output coordinates). It doesn't matter much because what matters to check the surjectivity is the rank. Then the standard result implies that given o∈O, if f is a submersion in a neighbourhood of a point θ0∈f−1(o), then f−1(o) is a smooth (N−k)-dimensional submanifold in a neighbourhood of θ0 .

Essentially, in a neighbourhood of a point at which the Jacobian of f has full rank, the level set through that point is an (N−k)-dimensional smooth submanifold.  

Then, yes, you could get onto studying in more detail the degeneracy when the Jacobian does not have full rank. But in my opinion I think you would need to be careful when you get to claim 3. I think the connection between loss and behavior is not spelled out in enough detail: Behaviour can change while loss could remain constant, right? And more generally, in exactly which directions do the implications go? Depending on exactly what you are trying to establish, this could actually be a bit of a 'tip of the iceberg' situation though. (The study of this sort of thing goes rather deep; Vladimir Arnold et al. wrote in their 1998 book: "The theory of singularities of smooth maps is an apparatus for the study of abrupt, jump-like phenomena - bifurcations, perestroikas (restructurings), catastrophes, metamorphoses - which occur in systems depending on parameters when the parameters vary in a smooth manner".)

Similarly when you say things like "Low rank G indicates information loss", I think some care is needed because the paragraphs that follow seem to be getting at something more like: If there is a certain kind of information loss in the early layers of the network, then this leads to low rank G. It doesn't seem clear that low rank G is necessarily indicative of information loss?

Reply
Intuitions about solving hard problems
carboniferous_umbraculum3y00

I broadly agree with Richard's main point, but I also do agree with this comment in the sense that I am not confident that the example of Turing compared with e.g. Einstein is completely fair/accurate. 

One thing I would say in response to your comment, Adam, is that I don't usually see the message of your linked post as being incompatible with Richard's main point. I think one usually does have or does need productive mistakes that don't necessarily or obviously look like they are robust partial progress. But still, often when there actually is a breakthrough, I think it can be important to look for this "intuitively compelling" explanation. So one thing I have in mind is that I think it's usually good to be skeptical if a claimed breakthrough seems to just 'fall out' of a bunch of partial work without there being a compelling explanation after the fact.

Reply
Call For Distillers
carboniferous_umbraculum3y40

I agree i.e. I also (fairly weakly) disagree with the value of thinking of 'distilling'  as a separate thing. Part of me wants to conjecture that it's comes from thinking of alignment work predominantly as mathematics or a hard science in which the standard 'unit' is a an original theorem or original result which might be poorly written up but can't really be argued against much. But if we think of the area (I'm thinking predominantly about more conceptual/theoretical alignment) as a 'softer', messier, ongoing discourse full of different arguments from different viewpoints and under different assumptions, with counter-arguments, rejoinders, clarifications, retractions etc. that takes place across blogs, papers, talks, theorems, experiments etc that all somehow slowly works to produce progress, then it starts to be less clear what this special activity called 'distilling' really is. 

Another relevant point, but one which I won't bother trying to expand on much here, is that a research community assimilating - and then eventually building on - complex ideas can take a really long time. 

[At risk of extending into a rant, I also just think the term is a bit off-putting. Sure, I can get the sense of what it means from the word and the way it is used - it's not completely opaque or anything - but I'd not heard it used regularly in this way until I started looking at the alignment forum. What's really so special about alignment that we need to use this word? Do we think we have figured out some new secret activity that is useful for intellectual progress that other fields haven't figured out? Can we not get by using words like "writing" and "teaching" and "explaining"?]

Reply
Load More
8[Linkpost] Remarks on the Convergence in Distribution of Random Neural Networks to Gaussian Processes in the Infinite Width Limit
2y
0
54Short Remark on the (subjective) mathematical 'naturalness' of the Nanda--Lieberum addition modulo 113 algorithm
2y
1
11A Neural Network undergoing Gradient-based Training as a Complex System
2y
0
6Notes on the Mathematics of LLM Architectures
2y
0
31On Developing a Mathematical Theory of Interpretability
2y
0
13Some Notes on the mathematics of Toy Autoencoding Problems
3y
0
23Behaviour Manifolds and the Hessian of the Total Loss - Notes and Criticism
3y
5
11A brief note on Simplicity Bias
3y
0
15Notes on Learning the Prior
3y
2
21An observation about Hubinger et al.'s framework for learned optimization
3y
5
Load More