Introduction

The current technology behind machine learning (ML) and artificial intelligence (AI) relies on the use of advanced algorithms and computational power for handling very large datasets to train models, based on very large neural networks, with up to trillions of hyperparameters, to enable intelligent behavior.

The empirical astonishing performance of these massive neural networks is not fortuitous, but has strong mathematical foundations: these large learning systems verify two very useful and powerful properties, demonstrated by Belkin and its co-authors and exposed in the AI & Math Blog.

The set of learning systems used in Cat.IA-AI.Innovations draws upon three classes of learning models:

The supervised, semi-supervised, unsupervised, and self-supervised learning models. These models leverage various levels of labeled data to learn patterns and make predictions.
The reinforcement learning (RL) models. RL models excel in decision-making tasks, learning through trial and error and optimizing their actions based on rewards. They are backed by a robust mathematical framework, providing strong guarantees on their performance.
The generative large language models (LLMs). LLMs, known for their impressive text generation capabilities, are further enhanced by incorporating reasoning abilities. This empowers them to not only generate text but also to reason logically and solve complex problems.

Supervised, unsupervised and self-supervised learning

Supervised Learning

Supervised learning is the simplest learning method. We have a dataset $ \mathcal{D} = \{(x_1, y_1), (x_2, y_2), \ldots, (x_N, y_N)\} $, which consists of a collection of $N$ pairs of inputs, $x_i$ or features, and outputs $ y_i$, written in matrix form as $ \mathcal{D} =(X,Y)$. The output $Y$ might be discrete, continuous, a set of labels in the case of the classification problem, etc.
The goal of supervised learning is to approximate the function $f$ that maps $X$ to $Y$, i.e., $Y=f(X)$, by learning this function from the dataset $\mathcal{D}$.

The estimation of the function $f$ is obtained by minimizing a loss function: $$ \min_{f \in \mathcal{H}} \mathcal{L}(Y,X;\theta). $$ where $\mathcal{H}$ is the space of candidate functions, $\theta \in \Theta$ the set of hyperparameters to be estimated. The function $f$ is approximated by using neural networks, which are universal approximators. The hyperparameters are adjusted on a training set, the model is then tested and validated on different datasets. Further details on the minimization of the loss function, and the recent advances in this field for very large systems are given in the AI & Math Blog.

Unsupervised & Semi-supervised Learning

However, labels are either costly to obtain, or at least partially available. Furthermore, errors in the labels affect the result of supervised learning and require either to use specific corrective tools (Northcutt et al., 2021), or to design alternative methods for estimating the model without depending on labels.
Unsupervised learning takes into account the internal structure/organisation of the data, and does not need the set of labels. This approach is similar to the non-parametric approach in statistics which 'let the data speak for themselves'.
Semi-supervised learning is the intermediate case between supervised and unsupervised learning: the learning procedure uses first the labelled data while leveraging a larger amount of unlabeled data, which strongly enhance the accuracy of the model.

Self-supervised Learning

In self-supervised learning, the model generates its own supervisory signals from the underlying structure/organisation of the data. The model learns meaningful representation or features of the data using e.g., pretext tasks, contrastive learning, and redundancy reduction (Zbontar et al., 2021).

Reinforcement learning

Reinforcement learning (RL) is a fundamental and specific learning method. DeepSeek-R1's efficiency, a hybrid model combining RL and LLMs, will likely accelerate the adoption of RL methods.
In the standard supervised, unsupervised, self-supervised frameworks, the learner receives the data and learns a model from them.
In the RL framework, the learner is more active as he/she is an agent interacting in an unsupervised manner with its environment through a sequence of actions. In response to the action just taken, the agent receives two types of information through a standard feedback mechanism:

Its current state in the environment,
A real-valued immediate reward for its actions. In more advanced frameworks, future or long-term reward are provided. The trade-off between immmediate and future reward is controled by the discount factor $\gamma \in (0,1)$.

The agent takes into account this feedback for its future actions. The objective of the agent is to maximize its reward and thus to select the best course of actions, the policy, for maximizing the cumulative reward over time. If the environment is known, this is a planning problem, whilst if the environment is unknow, this is a learning problem: the agent learns a sequence of decisions, i.e., a policy.
In the latter case, the agent balances the trade-off between the exploration, i.e., obtaining more information on the environmemt and rewards, and the exploitation of the information already collected for optimizing the rewards.

Markov Decision Process

The standard framework for formalizing the reinforcement learning is the Markov decision process (MDP), defined by $M=(\mathcal{S,A,P},r,\gamma )$ where $\mathcal{S,A}$ are respectively the state space and the action space, which might be infinite, $r : \mathcal{S\times A} \rightarrow [0,1]$ is the reward function, i.e., $r(s,a)$ is the immediate reward upon choosing the action $a \in \mathcal{A}$ while in the state $s \in \mathcal{S}.$
The probability transition kernel is defined as $\mathcal{P}: \mathcal{S \times A} \rightarrow \Delta(S)$, where $\mathcal{P}(s^{'}|s,a)$ is the probability of transiting from state $s$ to state $s^{'}$ when $a$ is selected; $\Delta(S)$ is the probability simplex over $\mathcal{S}$. In the generative case, the probability transition kernel $\mathcal{P}$ is empirically estimated from the data.
The model is Markovian since the transition and reward probabilities depend only on the current state $s$ and not the entire history of states and actions. In the standard tabular case, the cardinalities $|\mathcal{S}|$ and $|\mathcal{A}|$, i.e., the number of states and actions respectively, are finite.

Optimal Policy

A deterministic policy $\pi$ is a mapping $\pi: \mathcal{S}\rightarrow \mathcal{A}$. The objective of the agent is to find the optimal policy $\pi^\star$ maximizing the value function $V^\pi(s)$, which is the expected discounted total reward starting from the initial state $s^0=s$ , i.e., \begin{eqnarray*} \pi^\star &:= &\arg \max_\pi V^\pi(s),\\ V^\pi(s)& := &E \Big[ \sum_{t=0}^\infty \gamma^t r(s^t, a^t)\big | s^0 = s \Big ]. \end{eqnarray*} The corresponding action value function $Q^\pi: \mathcal{S\times A} \rightarrow \mathbb{R}$ of a policy $\pi$ is defined as $$ Q^\pi(s,a) := E \Big [ \sum_{t=0}^\infty \gamma^t r(s^t, a^t) \big | (s^0, a^0)=(s, a) \Big ], $$ $\forall (s,a) \in \mathcal{S \times A}$. There exists an optimal policy $\pi^\star$ that simultaneoulsy maximizes $V^\pi(s) \; \forall s \in \mathcal{S}$, and $Q^\pi(s,a) \; \forall (s,a) \in \mathcal {S \times A}$.

Policy gradient (PG) methods offer an alternative approach to estimating optimal policies compared to methods based on action-value functions $Q^\pi(s, a)$. PG methods assume that the policies $\pi_\theta$ are parameterized by a vector $\theta \in \Theta$, and search and find the optimal policy over a class of parameterized policies by maximizing a differentiable performance measure $J(\theta)$ with gradient ascent methods. PG methods often exhibit better convergence properties compared to argmax-based algorithms. However, the convergence behavior within the class of PG methods varies significantly. For instance, while the convergence time of softmax-based PG methods can be exponential in some scenarios, natural policy gradient (NPG), the iteration complexity of which is dimension-free, appears to be more favorable in terms of convergence speed. The determination of the optimal policy $\pi^\star_\theta$ by these PG methods and the recent advances in this field are presented in the AI & Math Blog.

Flexibility and Reliability of the MDP Framework

This MDP framework is flexible enough to deal with several configurations, e.g., off-policy RL where data logged from a different policy are used in the exploration step, but rigorous enough as it allows to assess the reliability of the results, e.g., we obtain confidence intervals for the rewards. Furthermore, this setup allows to provides guarantees on the reliability of the algorithms used for estimating the optimal value function $V^\star := V^{\pi^\star}$, or optimal action value function $Q^\star := Q^{\pi^\star}$.
The MDP framework is particularly valuable for analyzing the sample complexity of reinforcement learning algorithms. This means we can study how efficiently these algorithms learn from potentially limited data. New results on the sample complexity and the minimum sample size for achieving the minimax sample complexity are given in the AI & Math Blog.
This MDP framework has been extended to the non-tabular case, with a very large number of states, and/or continuous states, see e.g., Long and Han (2022, 2023).

Generative AI

Generative AI is a part of automation technologies: it generates text, programming code and images using as database quite the exhaustive content of the web, using attention mechanisms, transformer architectures and autoregressive probabilistic models.

Reasoning Properties of LLMs

These combinatorial generative models, based on the linguistic representation of the physical world, are working inside that representation, and then are directly unable to plan and produce any content outside that box. From a mathematical perspective, LLMs often lack complete deductive closure, as they may not always derive every possible conclusion from the information set they have been trained on. The output of LLMs is akin to Kahneman's (2011) Type 1 system.
As a consequence, the output of these generative models must be checked, e.g., the generated code must be verified by an interpreter/compiler, the multi-step reasonning of a maths solver must be verified (Cobbe et al., 2021).

Recent research suggests that enhancing LLMs with reasoning and planning capabilities akin to Kahneman's (2011) Type 2 system can significantly improve their reasoning abilities. In particular, Chervonyi et al. (2025) developed a new fast algorithm called the deductive database arithmetic reasoning (DDAR 2). The DDAR employs a fixed set of deduction rules to iteratively add new elements to the initial information set until its deduction closure is reached. The DDAR 2 algorithm successively solves gold medal-level olympiad problems in geometry, and improves the results obtained with the previous version (DDAR 1) presented by Trinh et al. (2024).

Chain-of-Thought Prompting: A Reasoning Revolution

Alternatively, the reasoning performance of LLMs cand be extended and improved with the Chain-of-Thought (CoT) prompting (Wei et al., 2022). CoT is an extension of the few-shots prompting properties of LLMs (Brown et al., 2020): for the few-shot prompting, the LLM is provided with a small number of in-context (input, output) pairs of examples, or 'shots'. In contrast, for the CoT prompting, the LLM is provided with triples (input, chain-of-thought, output), where the chain-of-thought consists of a series of intermediate reasoning steps that lead to the solution of the studied problem.

Decomposing Complexity: The Power of Intermediate Steps

The CoT consists of breaking down complex tasks into smaller, intermediate and more manageable steps. These intermediate steps, which make the decision-making process more transparent, can be particularly useful in tasks that require multiple steps of reasoning, where understanding the model's reasoning is crucial.
Furthermore, this decomposition in elementary steps can reduce the likelihood of generating irrelevant or incorrect information, known as hallucinations. CoT reduces the cognitive load of processing complex information all at once by distributing the problem-solving process over multiple steps, allowing the model to handle each step more effectively and carrying intermediate verification of each step in the reasoning process. This means that errors can be identified and corrected earlier, preventing them from propagating and affecting the final conclusion. RL can be combined with CoT for reducing the hallucinations.

Prompt Engineering: Guiding the Model's Thought Process

Prompting the model consists of generating a series of thoughts or steps that lead to the final answer, rather than directly producing the solution. This approach helps the model to handle more complex, multi-step problems by mimicking human-like logical step-by-step reasoning leading to the final answer. This requires a careful prompt engineering, i.e., carefully designing and refining specific instructions or questions that guide the model's responses in a way that aligns with the user's goals or requirements. Poorly designed prompts can lead to suboptimal performance. Furthermore, CoT prompting does not require extensive retraining.

Expanding the Path: Tree of Thoughts (ToT) Prompting

The single pathway CoT prompting has been further generalized by Tree of Thoughts (ToT) prompting (Yao et al., 2024), which is characterized by multiple competing reasoning paths modeled as a tree. The formalization of a human problem solving as a search problem along a tree has been advocated by the Turing Award laureates Newell and Simon (1972).

Amazon Bedrock is the standard framework used for our Generative AI services.

Digital Business Card

References

Bahdanau, D., Cho, K., and Bengio, Y., 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
Belkin, M., 2021. Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation. Acta Numerica 3, 203-248. MR4298218
Bishop, C. M., 2006. Pattern recognition and machine learning. Inf. Sci. Stat. Springer, New York. MR2247587
Bishop, C.M. and Bishop, H., 2023. Deep learning: Foundations and concepts. Springer Nature. MR4719738
Brown, T. B. et al., 2020. Language models are few-shot learners. arXiv preprint ArXiv:2005.14165.
Cheng, K., Yang, J., Jiang, H., Wang, Z., Huang, B., Li, R., Li, S., Li, Z., Gao, Y., Li, X. and Yin, B., 2024. Inductive or deductive? Rethinking the fundamental reasoning abilities of LLMs. arXiv preprint arXiv:2408.00114.
Chervonyi, Y., Trinh, T.H., Olšák, M., Yang, X., Nguyen, H., Menegali, M., Jung, J., Verma, V., Le, Q.V. and Luong, T., 2025. Gold-medalist performance in solving olympiad geometry with AlphaGeometry2. arXiv preprint arXiv:2502.03544.
Chou, S.C., Gao, X.S. and Zhang, J.Z., 2000. A deductive database approach to automated geometry theorem proving and discovering. Journal of Automated Reasoning, 25(3), pp.219-246. MR1774686
Chou, S.C., Gao, X.S. and Zhang, J.Z., 1996. Automated generation of readable proofs with geometric invariants: I. Multiple and shortest proof generation. Journal of Automated Reasoning, 17(3), pp.325-347. MR1428346
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R. and Hesse, C., 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
DeepSeek-AI, 2025. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948,
Geiping, J., Garrido, Q., Fernandez, P., Bar, A., Pirsiavash, H., LeCun, Y. and Goldblum, M., 2023. A Cookbook of self-supervised learning. arXiv preprint arXiv:2304.12210.
Goodfellow, I., Bengio, Y., Courville, A., 2016. Deep learning. Adapt. Comput. Mach. Learn. MIT Press, Cambridge, MA. MR3617773
Goyal, A. and Bengio, Y., 2022. Inductive biases for deep learning of higher-level cognition. Proc. R. Soc. A.478 no.2266, Paper No. 20210068, 35 pp. MR4505795
Hastie, T., Tibshirani, R., Friedman, J., 2009. The elements of statistical learning. Data mining, inference, and prediction. Second edition Springer Ser. Statist. Springer, New York. MR2722294
Hinton, G.E. and Salakhutdinov, R.R., 2006. Reducing the dimensionality of data with neural networks. Science, 313(5786), pp.504-507 MR2242509
Hinton, G., 2023. How to represent part-whole hierarchies in a neural network. Neural Computation, 35(3), pp.413-452. MR4555492
Kahneman, D., 2011. Thinking fast and slow. Macmillan.
Kambhampati, S., 2024. Can large language models reason and plan? Annals of the New York Academy of Sciences, 1534(1), pp.15-18.
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y. and Iwasawa, Y., 2022. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35, 22199-22213.
Lan, G., 2023. Policy mirror descent for reinforcement learning: linear convergence, new sampling complexity, and generalized problem classes. Mathematical Programming 198, no.1, Ser. A, 1059-1106. MR4550970
Li, G., Wei, Y., Chi, Y. and Chen, Y., 2024. Breaking the sample size barrier in model-based reinforcement learning with a generative model. Operation Research, 72, no.1, 203–221. MR4705834
Long, J. and Han, J., 2022. Perturbational complexity by distribution mismatch: a systematic analysis of reinforcement learning in reproducing kernel Hilbert space.Journal of Machine Learning 1, no. 1, 1–37. MR4716372
Long, J. and Han, J. 2023. Reinforcement learning with function approximation: from linear to nonlinear. Journal of Machine Learning 2, no. 3, 161-193. MR4734710
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G. and Petersen, S., 2015. Human-level control through deep reinforcement learning. Nature, 518(7540), pp.529-533.
Mohri, M., Rostamizadeh, A. and Talwalkar, A., 2018. Foundations of machine learning, second edition. Adaptive Computation and Machine Learning, MIT Press, Cambridge, MA. MR3931734
Newell, A., Shaw, J.C. and Simon, H.A., 1959, June. Report on a general problem solving program. In IFIP congress, Vol. 256, p. 64.
Newell, A. and Simon, H., 1972. Human problem solving, ACS symposium series, Prentice Hall.
Northcutt, C., Jiang, L. and Chuang, I., 2021. Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research 70, pp.1373-1411. MR4306614
Rae, J.W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J., Henderson, S., Ring, R., Young, S. and Rutherford, E., 2021. Scaling language models: Methods, analysis & insights from training Gopher. arXiv preprint arXiv:2112.11446.
Russell, S.J. and Norvig, P., 2016. Artificial intelligence: a modern approach. Pearson.
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A. and Chen, Y., 2017. Mastering the game of go without human knowledge. Nature, 550(7676), pp.354-359.
Szepesvàri, C., 2022. Algorithms for reinforcement learning, reprint of the 2010 original, Synthesis Lectures on Artificial Intelligence and Machine Learning, 9, Springer, Cham, MR4647095
Teyssière, G., 2022. Review of "Belkin, M. (2021). Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation". Mathematical Reviews, American Mathematical Society.
Teyssière, G., 2023. Review of "Dixon, M.F., Halperin, I. and Bilokon, P. 2020. Machine learning in finance—from theory to practice." Mathematical Reviews, American Mathematical Society.
Teyssière, G., 2023. Review of "Lan, G., 2023. Policy mirror descent for reinforcement learning: linear convergence, new sampling complexity, and generalized problem classes". Mathematical Reviews, American Mathematical Society.
Teyssière, G., 2025. Review of "Long, J. and Han, J., 2022. Perturbational complexity by distribution mismatch: a systematic analysis of reinforcement learning in reproducing kernel Hilbert space". Mathematical Reviews, American Mathematical Society.
Teyssière, G., 2025. Review of "Li, G., Wei, Y., Chi, Y. and Chen, Y., 2024. Breaking the sample size barrier in model-based reinforcement learning with a generative model". Mathematical Reviews, American Mathematical Society.
Trinh, T.H., Wu, Y., Le, Q.V., He, H. and Luong, T., 2024. Solving olympiad geometry without human demonstrations. Nature, 625(7995), pp.476-482.
Vaswani, A. et al., 2017. Attention is all you need. arXiv preprint arXiv:1706.03762.
Wang, J., Gao, R., Zha, H., 2024. Reliable off-policy evaluation for reinforcement learning. Operation Research, 72, no. 2, 699-716. MR4677384
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., and Zhou, D., 2022. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems (NIPS '22), 35, Article No 1800, 24824-24837.
Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., and Narasimhan, K., 2024. Tree of thoughts: Deliberate problem solving with large language models. In Proceedings of the 37th International Conference on Neural Information Processing Systems (NIPS '23), 36, Article No 517, 11809-11822.
Yeo, E., Tong, Y., Niu, M., Neubig, G. and Yue, X., 2025. Demystifying Long Chain-of-Thought Reasoning in LLMs. arXiv preprint arXiv:2502.03373.
Zbontar, J., Jing, L., Misra, I., LeCun, Y. and Deny, S., 2021. Barlow Twins: Self-Supervised Learning via Redundancy Reduction. Proceedings of the 38th International Conference on Machine Learning, 139, 12310-12320.

Updated April 2025.

Digital Business Card