arxiv:2406.11463

Just How Flexible are Neural Networks in Practice?

Published on Jun 17

· Submitted by

ravid on Jun 18

Upvote

Authors:

Ravid Shwartz-Ziv ,

Yann LeCun ,

Abstract

It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters, underpinning notions of overparameterized and underparameterized models. In practice, however, we only find solutions accessible via our training procedure, including the optimizer and regularizers, limiting flexibility. Moreover, the exact parameterization of the function class, built into an architecture, shapes its loss surface and impacts the minima we find. In this work, we examine the ability of neural networks to fit data in practice. Our findings indicate that: (1) standard optimizers find minima where the model can only fit training sets with significantly fewer samples than it has parameters; (2) convolutional networks are more parameter-efficient than MLPs and ViTs, even on randomly labeled data; (3) while stochastic training is thought to have a regularizing effect, SGD actually finds minima that fit more training data than full-batch gradient descent; (4) the difference in capacity to fit correctly labeled and incorrectly labeled samples can be predictive of generalization; (5) ReLU activation functions result in finding minima that fit more data despite being designed to avoid vanishing and exploding gradients in deep architectures.

View arXiv page View PDF Add to collection

Community

ravid

Paper author Paper submitter Jun 18

The paper investigates how flexible neural networks truly are in practice. The paper uncovers that convolutional networks are exceptionally parameter-efficient, outperforming MLPs and Vision Transformers, even on randomly labeled data. It demonstrates that SGD not only enhances generalization but also enables fitting significantly more training samples than full-batch gradient descent. Additionally, the study shows that ReLU activation functions greatly boost the network's capacity to fit data, emphasizing their crucial role in addressing vanishing and exploding gradients.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2406.11463 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2406.11463 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2406.11463 in a Space README.md to link it from this page.