The purpose of machine learning is to design and study learning models with a multilayers representation, usually a connectionist system of variable dimension, the architecture of which consists of layers of nodes connected in an organized and structured way.
The idea of representing complex data through an organized and hierarchical structure has been advocated by Herbert Simon (1962), a Turing Award and Nobel laureate, and a pioneer of artificial intelligence.
The design of these learning systems is influenced by inductive biases, thoroughly
presented in Goyal and Bengio (2022).
These connectionnist systems are universal approximators, i.e., they can approximate any function $f(X)$:
$$
X \in \mathbb{R}^m\mapsto Y=f(X) \in \mathbb{R}^n, \: m, n \geqslant 1,
$$
and its derivatives.
The classical statistical theory of estimation is organized around the bias-variance trade-off. The overfitting of the function $f$, a small bias and a large variance, is avoided by using regularization methods, i.e., adding to the loss function $\mathcal{L}(Y,X;\theta) $ a weighted penalty term $\lambda R(f)$, so that the estimation problem becomes:
$$
\min_{f \in \mathcal{H}} \mathcal{L}(Y,X;\theta) + \lambda R(f),
$$
where $\mathcal{D}= \{X,Y\}$ is the dataset, $\mathcal{H}$ is the space of candidate functions, $\theta \in \Theta$ the set of hyperparameters, $R(f)$ is a penalty functional with weight $\lambda >0$. There is a wide literature on the selection of $\lambda$, using either the Akaike or the Schwartz criteria, or adaptive methods.
With the recent advent of deep learning and over-paramaterized systems, with up to a trillion of hyperparameters for the recent large language models (LLM), it empirically appears that regularizing the loss function $\mathcal{L}(Y,X;\theta) $ is no longer an issue. This is the double-descent phenomenon, studied by Belkin et al. (2019,2020) and Belkin (2021), and illustrated by the following generalization curve, which plots the risk versus the capacity of the space of candidate functions $\mathcal{H}$, such as its Vapnik-Chervonenkis dimension.
The trade-off between bias and variance in the classical case is graphically represented by the $U$-shaped curve in the left part of the graph, before the interpolation threshold. However, once the interpolation threshold is bypassed, the risk function decreases monotonically with the capacity of the over-parameterized system. The solution minimizing the loss function $\mathcal{L}(Y,X;\theta) $ maximizes the smoothness of the function subject to interpolating the data. The regularization is no longer necessary, one of the inductive biases is then removed.
Note that this empirical double-descent property has been formally demonstrated for the particular case of the
random Fourier features by Rahimi and Recht (2007): in the over-parameterized case, increasing the size number of parameters, i.e., the size of the neural network, entails that it converges to a kernel machine which minimizes the norm of the solution.
The landscape of the over-parameterized systems is neither convex, nor locally convex in a small neighborhood of a local minima.
However, topological properties, independently discovered by Łojasiewicz (1962) and Poljak (1963), and further studied by Liu, Zhu and Belkin (2022) entails both the existence of solutions and fast exponential convergence of the gradient descent (GD) algorithms, including the stochastic gradient descent (SGD).
The PL$^\star$ condition is defined as follows: a non-negative function $\mathcal{L}$ satisfies $\mu-\mbox{PL}^\star$ on a set $\mathcal{S} \subset \mathbb{R}^m$ for $\mu >0$ if
$$
\big \|\bigtriangledown\mathcal{L}(\theta) \big \|^2 \geqslant \mu \mathcal{L}(\theta), \quad \forall \theta \in \mathcal{S}.
$$
If $\mathcal{L}$ satisﬁes the $\mu-\mbox{PL}^\star$ condition in a ball of radius $O(1/\mu)$, then $\mathcal{L}$ has a global minimum in that ball. Furthermore, the (S)GD algorithm initialized at the center of such a ball converges exponentially to a global minimum of $\mathcal{L}$.
The condition PL$^\star$ has an analytic form via the spectrum of the tangent kernel, see Jacot et al. (2021). Over-parameterized systems, and then large neural networks satisfy PL$^\star$, while under-parameterized systems, i.e., the classical regime, do not verify PL$^\star$.
Belkin (2021) provides an exhaustive survey of the properties of the over-parameterized systems.