AI & Math

Introduction

The purpose of machine learning is to design and study learning models with a multilayers representation, usually a connectionist system of variable dimension, the architecture of which consists of layers of nodes connected in an organized and structured way. The idea of representing complex data through an organized and hierarchical structure has been advocated by Herbert Simon (1962), a Turing Award and Nobel laureate, and a pioneer of artificial intelligence. The design of these learning systems is influenced by inductive biases, thoroughly presented in Goyal and Bengio (2022).

These connectionnist systems are universal approximators, i.e., they can approximate any function $f(X)$: $$ X \in \mathbb{R}^m\mapsto Y=f(X) \in \mathbb{R}^n, \: m, n \geqslant 1, $$ and its derivatives. The classical statistical theory of estimation is organized around the bias-variance trade-off. The overfitting of the function $f$, a small bias and a large variance, is avoided by using regularization methods, i.e., adding to the loss function $\mathcal{L}(Y,X;\theta) $ a weighted penalty term $\lambda R(f)$, so that the estimation problem becomes: $$ \min_{f \in \mathcal{H}} \mathcal{L}(Y,X;\theta) + \lambda R(f), $$ where $\mathcal{D}= \{X,Y\}$ is the dataset, $\mathcal{H}$ is the space of candidate functions, $\theta \in \Theta$ the set of hyperparameters, $R(f)$ is a penalty functional with weight $\lambda >0$. There is a wide literature on the selection of $\lambda$, using either the Akaike or the Schwartz criteria, or adaptive methods.



The double descent : Interpolation properties of deep networks

With the recent advent of deep learning and over-paramaterized systems, with up to a trillion of hyperparameters for the recent large language models (LLM), it empirically appears that regularizing the loss function $\mathcal{L}(Y,X;\theta) $ is no longer an issue. This is the double-descent phenomenon, studied by Belkin et al. (2019,2020) and Belkin (2021), and illustrated by the following generalization curve, which plots the risk versus the capacity of the space of candidate functions $\mathcal{H}$, such as its Vapnik-Chervonenkis dimension.

Double Descent
Double Descent

The trade-off between bias and variance in the classical case is graphically represented by the $U$-shaped curve in the left part of the graph, before the interpolation threshold. However, once the interpolation threshold is bypassed, the risk function decreases monotonically with the capacity of the over-parameterized system. The solution minimizing the loss function $\mathcal{L}(Y,X;\theta) $ maximizes the smoothness of the function subject to interpolating the data. The regularization is no longer necessary, one of the inductive biases is then removed.
Note that this empirical double-descent property has been formally demonstrated for the particular case of the random Fourier features by Rahimi and Recht (2007): in the over-parameterized case, increasing the size number of parameters, i.e., the size of the neural network, entails that it converges to a kernel machine which minimizes the norm of the solution.


Landscape topological properties of over-parameterized systems

The landscape of the over-parameterized systems is neither convex, nor locally convex in a small neighborhood of a local minima. However, topological properties, independently discovered by Łojasiewicz (1962) and Poljak (1963), and further studied by Liu, Zhu and Belkin (2022) entails both the existence of solutions and fast exponential convergence of the gradient descent (GD) algorithms, including the stochastic gradient descent (SGD).
The PL$^\star$ condition is defined as follows: a non-negative function $\mathcal{L}$ satisfies $\mu-\mbox{PL}^\star$ on a set $\mathcal{S} \subset \mathbb{R}^m$ for $\mu >0$ if $$ \big \|\bigtriangledown\mathcal{L}(\theta) \big \|^2 \geqslant \mu \mathcal{L}(\theta), \quad \forall \theta \in \mathcal{S}. $$ If $\mathcal{L}$ satisfies the $\mu-\mbox{PL}^\star$ condition in a ball of radius $O(1/\mu)$, then $\mathcal{L}$ has a global minimum in that ball. Furthermore, the (S)GD algorithm initialized at the center of such a ball converges exponentially to a global minimum of $\mathcal{L}$.
The condition PL$^\star$ has an analytic form via the spectrum of the tangent kernel, see Jacot et al. (2021). Over-parameterized systems, and then large neural networks satisfy PL$^\star$, while under-parameterized systems, i.e., the classical regime, do not verify PL$^\star$.
Belkin (2021) provides an exhaustive survey of the properties of the over-parameterized systems.

References

  1. Belkin, M., 2021. Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation. Acta Numer. 3, 203-248. MR4298218
  2. Belkin , M., Hsu, D., Ma, S. and Mandal, S., 2019. Reconciling modern machine-learning practice and the classical bias-variance trade-off, Proc. Natl. Acad. Sci. USA 116, no. 32, 15849-15854. MR3997901
  3. Belkin, M., Hsu, D. and Xu, J., 2020. Two models of double descent for weak features, SIAM J. Math. Data Sci. 2, no. 4, 1167-1180. MR4186534
  4. Goyal, A. and Bengio, Y., 2022. Inductive biases for deep learning of higher-level cognition. Proc. R. Soc. A.478 no.2266, Paper No. 20210068, 35 pp. MR4505795
  5. Hinton, G., 2023. How to represent part-whole hierarchies in a neural network. Neural Computation, 35(3), pp.413-452. MR4555492
  6. Jacot, A., Gabriel, F. and Hongler, C., 2021. Neural tangent kernel: convergence and generalization in neural networks (invited paper). STOC '21-Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, 6. Association for Computing Machinery (ACM), New York, MR43998812
  7. Liu, C., Zhu, L. and Belkin, M., 2022. Loss landscapes and optimization in over-parameterized non-linear systems and neural networks, Appl. Comput. Harmon. Anal. 59, 85-116. MR4412181
  8. Łojasiewicz, S., 1962. Une propriété topologique des sous-ensembles analytiques réels, in Les Équations aux Dérivées Partielles , 87-89. MR0160856
  9. Poljak, B. T., 1963. Gradient methods for minimizing functionals, Ž. Vyčisl. Mat i Mat. Fiz. 3, 643-653. MR0158568
  10. Rahimi, A. and Recht, B., 2007. Random features for large-scale kernel machines. Advances in neural information processing systems (NIPS 2007) 20 (J. C. Platt et al., eds), Curran Associates, pp. 1177-1184.
  11. Simon, H. A., 1962. The architecture of complexity, Proceedings of The American Philosophical Society, 106(6), 467-482. Jstor
  12. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R., 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), pp.1929-1958. MR3231592
  13. Teyssière, G., 2022. Review of "Belkin, M. (2021). Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation". Mathematical Reviews, American Mathematical Society.