If a universality statement like the above holds for neural networks, it would tell us that most of the details of the parameter-function map are irrelevant.
I suppose this depends on what you mean by "most". DNNs and CNNs have noticeable and meaningful differences in their (macroscopic) generalisation behaviour, and these differences are due to their parameter-function map. This is also true of LSTMs vs transformers, and so on. I think it's fairly likely that these kinds of differences could have a large impact on the probability that a given type m... (read more)
I'm not sure, but I think this example is pathological.
Yes, it's artificial and cherry-picked to make a certain rhetorical point as simply as possible.
This is the more relevant and interesting kind of symmetry, and it's easier to see what this kind of symmetry has to do with functional simplicity: simpler functions have more local degeneracies.¨
This is probably true for neural networks in particular, but mathematically speaking, it completely depends on how you parameterise the functions. You can create a parameterisation in which this is not true.
The assumption that small neural networks are a good match for the actual data generating process of the world, is equivalent to the assumption that neural networks have an inductive bias that gives large weight to the actual data generating process of the world, if we also append the claim that neural networks have an inductive bias that gives large weight to functions which can be described by small neural networks (and this latter claim is not too difficult to justify, I think).
Does this not essentially amount to just assuming that the inductive bias of neural networks in fact matches the prior that we (as humans) have about the world?
This is basically a justification of something like your point 1, but AFAICT it's closer to a proof in the SLT setting than in your setting.
I think it could probably be turned into a proof in either setting, at least if we are allowed to help ourselves to assumptions like "the ground truth function is generated by a small neural net" and "learning is done in a Bayesian way", etc.
That's interesting, thank you for this!
I think the broad strokes are mostly similar, but that a bunch of relevant details are different.
Yes, a large collective of near-human AI that is allowed to interact freely over a (subjectively) long period of time is presumably roughly as hard to understand and control as a Bostrom/Yudkowsky-esque God in a box. However, in this scenario, we have the option to not allow free interaction between multiple instances, while still being able to extract useful work from them. It is also probably much easier to align a system that is not of overwhelming intellige... (read more)
Yes, I agree with this. I mean, even if we assume that the AIs are basically equivalent to human simulations, they still get obvious advantages from the ability to be copy-pasted, the ability to be restored to a checkpoint, the ability to be run at higher clock speeds, and the ability to make credible pre-commitments, etc etc. I therefore certainly don't think there is any plausible scenario in which unchecked AI systems wouldn't end up with most of the power on earth. However, there is a meaningful difference between the scenario where their advantages ma... (read more)
To clarify, the proposal is not (necessarily) to use an LLM to create an interpretable AI system that is isomorphic to the LLM -- their internal structure could be completely different. The key points are that the generated program is interpretable and trustworthy, and that it can solve some problem we are interested in.
""The kinds of humans that we are worried about are the kinds of humans that can do original scientific research and autonomously form plans for taking over the world. Human brains learn to take actions and plans that previously led to high rewards (outcomes like eating food when hungry, having sex, etc)*. These two things are fundamentally not the same thing. Why, exactly, would we expect that a system that is good at the latter necessarily would be able to do the former?"""This feels like a bit of a digression, but we do have concrete examples of systems... (read more)
The general rule I'm following is "if the argument would say false things about humans, then don't update on it".
Yes, this is of course very sensible. However, I don't see why these arguments would apply to humans, unless you make some additional assumption or connection that I am not making. Considering the rest of the conversation, I assume the difference is that you draw a stronger analogy between brains and deep learning systems than I do?
I want to ask a question that goes something like "how correlated is your credence that arguments 5-10 apply to hum... (read more)
But for all of them except argument 6, it seems like the same argument would imply that humans would not be generally intelligent.
Why is that?
Because text on the Internet sometimes involves people using logic, reasoning, hypothesis generation, analyzing experimental evidence, etc, and so plausibly the simplest program that successfully predicts that text would do so by replicating that logic, reasoning etc, which you could then chain together to make scientific progress.What does the argument say in response?
Because text on the Internet sometimes involves people using logic, reasoning, hypothesis generation, analyzing experimental evidence, etc, and so plausibly the simplest program that successfully predicts that text would do so by replicating that logic, reasoning etc, which you could then chain together to make scientific progress.
What does the argument say in response?
There are a few ways to respond.
First of all, wh... (read more)
What I'm suggesting is that volume in high-dimensions can concentrate on the boundary.
Yes. I imagine this is why overtraining doesn't make a huge difference.
Falsifiable Hypothesis: Compare SGD with overtaining to the random sampling algorithm. You will see that functions that are unlikely to be generated by random sampling will be more likely under SGD with overtraining. Moreover, functions that are more likely with random sampling will be become less likely under SGD with overtraining.
See e.g., page 47 in the main paper.
(Maybe you weren't disagreeing with Zach and were just saying the same thing a different way?)
I'm honestly not sure, I just wasn't really sure what he meant when he said that the Bayesian and the Kolmogorov complexity stuff were "distractions from the main point".
This feels similar to:Saying that MLK was a "criminal" is one way of saying that MLK thought and acted as though he had a moral responsibility to break unjust laws and to take direct action.(This is an exaggeration but I think it is directionally correct. Certainly when I read the title "neural ne
This feels similar to:
Saying that MLK was a "criminal" is one way of saying that MLK thought and acted as though he had a moral responsibility to break unjust laws and to take direct action.
(This is an exaggeration but I think it is directionally correct. Certainly when I read the title "neural ne
I agree with your summary. I'm mainly just clarifying what my view is of the strength and overall role of the Algorithmic Information Theory arguments, since you said you found them unconvincing.
I do however disagree that those arguments can be applied to "literally any machine learning algorithm", although they certainly do apply to a much larger class of ML algorithms than just neural networks. However, I also don't think this is necessarily a bad thing. The picture that the AIT arguments give makes it reasonably unsurprising that you would get the... (read more)
I have a few comments on this:
Fundamentally the point here is that generalization performance is explained much more by the neural network architecture, rather than the structure of stochastic gradient descent, since we can see that stochastic gradient descent tends to behave similarly to (an approximation of) random sampling. The paper talks a bunch about things like SGD being (almost) Bayesian and the neural network prior having low Kolmogorov complexity; I found these to be distractions from the main point.
The main point, as I see it, is essentially tha... (read more)
This part of the argument is indeed quite general, but it’s not vacuous. For the argument to apply you need it to be the case that the function space is overparameterised, that the parameter-function map has low complexity relative to the functions in the function space, and that parameter-function map is biased in some direction. This will not be the case for all learning algorithms.
But, I do agree that the Levin bound argument doesn’t really capture the “mechanism” of what’s going on here. I can of course only speak for myself (and not the other people i... (read more)
Like Rohin, I'm not impressed with the information theoretic side of this work.Specifically, I'm wary of the focus on measuring complexity for functions between finite sets, such as binary functions.Mostly, we care about NN generalization on problems where the input space is continuous, generally R^n. The authors argue that the finite-set results are relevant to these problems, because one can always discretize R^n to get a finite set. I don't think this captures the kinds of function complexity we care about for NNs.
Like Rohin, I'm not impressed with the information theoretic side of this work.
Specifically, I'm wary of the focus on measuring complexity for functions between finite sets, such as binary functions.
Mostly, we care about NN generalization on problems where the input space is continuous, generally R^n. The authors argue that the finite-set results are relevant to these problems, because one can always discretize R^n to get a finite set. I don't think this captures the kinds of function complexity we care about for NNs.
We’re not saying that discr... (read more)
Yes, it does of course apply in that sense.
I guess the question then basically is which level of abstraction we think would be the most informative or useful for understanding what's going on here. I mean, we could for example also choose to take into account the fact that any actual computer program runs on a physical computer, which is governed by the laws of electromagnetism (in which case the parameter-space might count as continuous again).
I'm not sure if accounting for the floating-point implementation is informative or not in this case.
There is a proof of it in "An Introduction to Kolmogorov Complexity and Its Applications" by Ming Li & Paul Vitanyi.
Ah, I certainly agree with this.
I do not wish to claim that all functions with low Kolmogorov complexity have large volumes in the parameter-space of a sufficiently large neural network. In fact, I can point to several concrete counterexamples to this claim. To give one example, the identity function certainly has a low Kolmogorov complexity, but it's very difficult for a (fully connected feed-forward) neural network to learn this function (if the input and output is represented in binary form by a bit string). If you try to learn this function by training... (read more)
I'm not sure I agree with this -- I think Kolmogorov complexity is a relevant notion of complexity in this context.
Yes, I agree with this.
I should note that SGD definitely does make a contribution to the inductive bias, but this contribution does seem to be quite small compared to the contribution that comes from the implicit prior embedded in the parameter-function map. For example, if you look at Figure 6 in the Towards Data Science post, you can see that different versions of SGD give a slightly different inductive bias. It's also well-known that the inductive bias of neural networks is affected by things like how long you train for, and what step size and bat... (read more)
I believe Chris has now updated the Towards Data Science blog post to be more clear, based on the conversation you had, but I'll make some clarifications here as well, for the benefit of others:
The key claim, that Pβ(f∣S)≈Popt(f∣S), is indeed not (meant to be) dependent on any test data per se. The test data comes into the picture because we need a way to granuralise the space of possible functions if we want to compare these two quantities empirically. If we take "the space of functions" to be all the functions that a given neural network can express... (read more)
We do have empirical data which shows that the neural network "prior" is biased towards low-complexity functions, and some arguments for why it would make sense to expect this to be the case -- see this new blog post, and my comment here. Essentially, low-complexity functions correspond to larger volumes in the parameter-space of neural networks. If functions with large volumes also have large basins of attraction, and if using SGD is roughly equivalent to going down a random basin (weighted by its size), then this would essentially explain why neural netw... (read more)
We do have a lot of empirical evidence that simple functions indeed have bigger volumes in parameter-space (exponentially bigger volumes, in fact). See especially the 2nd and 3rd paper linked in the OP.
It's true that we don't have a rigorous mathematical proof for "why" simple functions should have larger volumes in parameter-space, but I can give an intuitive argument for why I think it would make sense to expect this to be the case.
First of all, we can look at the claim "backwards"; suppose a function has a large volume in parameter-space. This means tha... (read more)
Yes, as Adam Shimi suggested below, I'm here talking about "input-output relations" when I say "functions".
The claim can be made formal in a few different ways, but one way is this: let's say we have a classification task with 10 classes, 50k training examples, and 10k test examples. If we consider the domain to be these 60k instances (and the codomain to be the 10 classes), then we have 10^60k functions defined in this space, 10^10k of which fit the training data perfectly. Of these 10^10k functions, the vast vast majority will have very poor accura... (read more)
As you may know, CDT has a lot of fans in academia. It might be interesting to consider what they have to say about Newcomb's Problem (and other supposed counter-examples to CDT).In "The Foundations of Causal Decision Theory", James Joyce argues that Newcomb's Problem is "unfair" on the grounds that it treats EDT and CDT agents differently. An EDT agent is given two good choices ($1,000,000 and $1,001,000) whereas a CDT agent is given two bad choices ($0 and $1,000). If you wanted to represent Newcomb's Problem as a Marko... (read more)
I have already (sort of) addressed this point at the bottom of the post. There is a perspective from which any optimizer_1 can (kind of) be thought of as an optimizer_2, but its unclear how informative this is. It is certainly at least misleading in many cases. Whether or not the distinction is "leaky" in a given case is something that should be carefully examined, not something that should be glossed over.
I also agree with what ofer said.
"even if we can make something safe in optimizer_1 terms, it may still be dangerous as an optimizer_2 b... (read more)
I agree with everything you have said.
I don't think that I'm assuming the existence of some sort of Cartesian boundary, and the distinction between these two senses of "optimizer" does not seem to disappear if you think of a computer as an embedded, causal structure. Could you state more precisely why you think that this is a Cartesian distinction?