Nonlinear limitations of ReLUs

magfrump

[ Question ]

Nonlinear limitations of ReLUs

by magfrump

1 min read26th Oct 2023No answers 1 comment

13

Distributional ShiftsMachine Learning (ML)Scaling LawsAI

Frontpage

A neural net using rectified linear unit activation functions of any size is unable to approximate the function sin(x) outside a compact interval.

I am reasonably confident that I can prove that any NN with ReLU activation approximates a piecewise linear function. I believe the number of linear pieces that can be achieved is bounded by at most 2^(L*D) where L is the number of nodes per layer and D is the number of layers.

This leads me to two questions:

Is the inability to approximate periodic functions of a single variable important?
1. If not, why not?
2. If so, is there practical data augmentation that can be used to improve performance at reasonable compute cost?
  1. E.g., naively, augment the input vector {x_i} with {sin(x_i)} whenever x_i is a scalar.
Since the number of parameters of a NN scales as L*D^2 and the trivial bound on number of linear pieces scales with L*D, is this why neural nets go deep rather than going "wide"?
1. Are there established scaling hypotheses for the growth of depth vs. layer size?
Are there better (probabilistic) analytic or empirical bounds on the number of linear sections achieved by NNs of given size?
Are there activation functions that would avoid this constraint? I imagine a similar analytic constraint replacing "piecewise linear" with "piecewise strictly increasing" for classic activations like sigmoid or arctan.
Something something Fourier transform something something?

Regarding (2a), empirically I found that while approximating sin(x) with small NNs in scikit-learn, increasing the width of the network caused catastrophic failure of learning (starting at approximately L=10 with D=4, at L=30 with D=8, and at L=50 with D=50).

Regarding (1), naively this seems relevant to questions of out-of-distribution performance and especially the problem of what it means for an input to be out-of-distribution in large input spaces.

Distributional ShiftsMachine Learning (ML)Scaling LawsAI

Frontpage

13

New Answer

New Comment

1 comment, sorted by

top scoring

Click to highlight new comments since: Today at 12:05 PM

[-]jacob_cannell6mo110

Is the inability to approximate periodic functions of a single variable important?

Periodic functions are already used as an important encoding in SOTA ANNs, from transformer LLMs to NERFs in graphics. From the instant-ngp paper:

For neural networks, input encodings have proven useful in the attention components of recurrent architectures [Gehring et al. 2017] and, subsequently, transformers [Vaswani et al. 2017], where they help the neural network to identify the location it is currently processing. Vaswani et al. [2017] encode scalar positions 𝑥 ∈ R as a multiresolution sequence of 𝐿 ∈ N sine and cosine functions enc(𝑥) = sin(2 0 𝑥),sin(2 1 𝑥), . . . ,sin(2 𝐿−1 𝑥), cos(2 0 𝑥), cos(2 1 𝑥), . . . , cos(2 𝐿−1 𝑥) . (1) This has been adopted in computer graphics to encode the spatiodirectionally varying light field and volume density in the NeRF algorithm [Mildenhall et al. 2020].

Reply

Moderation Log