The current technology behind machine learning (ML) and artificial intelligence (AI) relies on the use of advanced algorithms and computational power for handling very large datasets to train models, based on very large neural networks, with up to trillions of hyperparameters, to enable intelligent behavior.
The empirical astonishing performance of these massive neural networks is not fortuitous, but has strong mathematical foundations: these large learning systems verify two very useful and powerful properties, demonstrated by Belkin and its co-authors and exposed in the
AI & Math page.
The set of learning systems used in Cat.IA-AI.Innovations draw from three classes of learning models:
- The generative large language models (LLM), enhanced by adding reasoning abilities,
- The supervised, semi-supervised, unsupervised, and self-supervised learning models,
- The reinforcement learning (RL) models, with a large mathematical apparitus providing proved guarantees on the performance of its algorithms.
Generative AI is a part of automation technologies: it generates text, programming code and images using as database quite the exhaustive content of the web, using attention mechanisms, transformer architectures and autoregressive probabilistic models.
These combinatorial generative models, based on the linguistic representation of the physical world, are working inside the box of that representation, and then are directly unable to plan and produce any content outside that box. They do not possess complete deductive closure as they might not always derive every possible conclusion from the information they have.
The output of LLM is similar to Kahneman's (2011) Type 1 system.
Although the syntax of the generated text is correct, as the correctness of the grammar is the straight consequence of the autoregressive probabilistic tools of generative AI, the meaning of this text must be evaluated.
Generally, the output of these generative models must still be checked, e.g., the generated code must be verified by an interpreter/compiler, the multi-step reasonning of a maths solver must be verified (Cobbe et al., 2021), historical images inconsistencies must be corrected, etc.
Recent research has shown that the reasoning properties of large language models (LLMs) can be enhanced by equipping them with reasoning and planning capabilities that are closer to Kahneman's (2011) Type 2 system.
The Chain-of-Thought (CoT) prompting (Wei et al., 2022) is a series of intermediate natural language reasoning steps along a single path that leads to the final output.
This is an extension of the few-shots prompting properties of LLM (Brown et al., 2020):
for the few-shot prompting, the LLM is provided with a small number of in-context (input, output) pairs of examples, or 'shots'. In contrast, for the CoT prompting, the LLM is provided with triples (input, chain-of-thought, output), where the chain-of-thought consists of a series of intermediate reasoning steps that lead to the solution of the studied problem.
CoT prompting enhances the model's performance without requiring extensive retraining.
The single pathway Chain of Thought (CoT) prompting has been further generalized by Tree of Thoughts (ToT) prompting (Yao et al., 2024), which is characterized by multiple competing reasoning paths modeled as a tree. The formalization of a human problem solving as a search problem along a tree has been advocated by Newell and Simon (1972).
Amazon Bedrock is the standard framework used for our Generative AI services.
Reinforcement learning (RL) is a fundamental learning method at the intersection of artificial intelligence and machine learning.
In the standard supervised, unsupervised, self-supervised frameworks, the learner receives the data and learns a model from them.
In the RL framework, the learner is more active: it is an
agent interacting with its
environment through a sequence of
actions. In response to the action just taken, the agent receives two types of information through a standard feedback mechanism:
- Its current state in the environment,
- A real-valued immediate reward for its actions. No future or long-term reward is provided.
The agent takes into account this feedback for its future actions.
The objective of the agent is to maximize its reward and thus to select the best course of actions, the
policy, for maximizing the cumulative reward over time. If the environment is known, this is a
planning problem, whilst if the environment is unknow, this is a
learning problem.
In the latter case, the agent balances the trade-off between the exploration, i.e., obtaining more information on the environmemt and rewards, and the exploitation of the information already collected for optimizing the rewards.
The standard framework for formalizing the reinforcement learning is the Markov decision process (MDP), defined by $M=(\mathcal{S,A,P},r,\gamma )$ where $\mathcal{S,A}$ are respectively the state space and the action space, which might be infinite, $r : \mathcal{S\times A} \rightarrow [0,1]$ is the reward function, i.e., $r(s,a)$ is the immediate reward upon choosing the action $a \in \mathcal{A}$ while in the state $s \in \mathcal{S}$.
The probability transition kernel is defined as $\mathcal{P}: \mathcal{S \times A} \rightarrow \Delta(S)$, where $\mathcal{P}(s^{'}|s,a)$ is the probability of transiting from state $s$ to state $s^{'}$ when $a$ is selected; $\Delta(S)$ is the probability simplex over $\mathcal{S}$.
The model is Markovian since the transition and reward probabilities depend only on the current state $s$ and not the entire history of states and actions.
A deterministic policy $\pi$ is a mapping $\pi: \mathcal{S}\rightarrow \mathcal{A}$.
The objective of the agent is to find the optimal policy $\pi^\star$ maximizing the value function $V^\pi(s)$ , which is the expected discounted total reward starting from the initial state $s^0=s$ , i.e.,
\begin{eqnarray*}
\pi^\star &:= &\arg \max_\pi V^\pi(s),\\
V^\pi(s)& := &E \Big[ \sum_{t=0}^\infty \gamma^t r(s^t, a^t)\big | s^0 = s \Big ].
\end{eqnarray*}
The corresponding action value function $Q^\pi: \mathcal{S\times A} \rightarrow \mathbb{R}$ of a policy $\pi$ is defined
as
$$
Q^\pi(s,a) := E \Big [ \sum_{t=0}^\infty \gamma^t r(s^t, a^t) \big | (s^0, a^0)=(s, a) \Big ],
$$
$\forall (s,a) \in \mathcal{S \times A}$. There exists an optimal policy $\pi^\star$ that simultaneoulsy maximizes $V^\pi(s) \; \forall s \in \mathcal{S}$, and $Q^\pi(s,a) \; \forall (s,a) \in \mathcal {S \times A}$.
This MDP setup is flexible enough to deal with several configurations, e.g., off-policy RL. In the generative case, the probability transition kernel $\mathcal{P}$ is empirically estimated from the data.
lt allows to provides guarantees on the reliability of the algorithms used for estimating the optimal value function $V^\star := V^{\pi^\star}$,
or optimal action value function $Q^\star := Q^{\pi^\star}$. In particular, new results show that the minimum sample size for achieving the minimax sample complexity is $ N \geqslant O\left(\frac{|\mathcal{S}||\mathcal{A}|}{1-\gamma}\right)$.