This page summarizes some recent mathematical advances on sample complexity in reinforcement learning (RL).
Imagine you're teaching a robot to play chess. You want it to learn the best moves, with a limited number of games to train it with. Sample complexity is all about how efficiently your robot can learn from those limited games. It measures how many games (or interactions) the robot needs to play before it can figure out the best moves. A lower sample complexity means it learns faster, even with limited training data. In real-world scenarios like self-driving cars, collecting data for training can be very expensive and time-consuming. This is where sample complexity becomes crucial. Algorithms with low sample complexity can learn effectively even with limited real-world driving data. So, how can we design algorithms that learn quickly and efficiently, even with limited data? This is where the study of sample complexity comes in.
The standard framework for RL is the Markov decision process (MDP). For the sake of generality, we consider
the infinite horizon discounted MDP defined by $\mathcal{M}=(\mathcal{S,A,P},r,\gamma )$ where $\mathcal{S,A}$ are respectively the state space and the action space, which might be infinite, $r : \mathcal{S\times A} \rightarrow [0,1]$ is the reward function, i.e., $r(s,a)$ is the immediate reward upon choosing the action $a \in \mathcal{A}$ while in the state $s \in \mathcal{S}$, and $\gamma \in (0,1)$ is the discount factor.
The probability transition kernel is defined as $\mathcal{P}: \mathcal{S \times A} \rightarrow \Delta(S)$, where $\mathcal{P}(s^{'}|s,a)$ is the probability of transiting from state $s$ to state $s^{'}$ when $a$ is selected; $\Delta(S)$ is the probability simplex over $\mathcal{S}$. In the generative case, the probability transition kernel $\mathcal{P}$ is empirically estimated from the data.
We consider here the standard tabular case: the cardinalities $|\mathcal{S}|$ and $|\mathcal{A}|$ are finite.
A deterministic policy $\pi$ is a mapping $\pi: \mathcal{S}\rightarrow \mathcal{A}$.
The objective of the agent is to find the optimal policy $\pi^\star$ maximizing the value function $V^\pi(s)$ . For the infinite horizon discounted MDP, $V^\pi(s)$ is the expected discounted total reward starting from the initial state $s^0=s$ , i.e.,
\begin{eqnarray*}
\pi^\star &:= &\arg \max_\pi V^\pi(s),\\
V^\pi(s)& := &E \Big[ \sum_{t=0}^\infty \gamma^t r(s^t, a^t)\big | s^0 = s \Big ].
\end{eqnarray*}
The corresponding action value function $Q^\pi: \mathcal{S\times A} \rightarrow \mathbb{R}$ of a policy $\pi$ is defined
as
$$
Q^\pi(s,a) := E \Big [ \sum_{t=0}^\infty \gamma^t r(s^t, a^t) \big | (s^0, a^0)=(s, a) \Big ],
$$
$\forall (s,a) \in \mathcal{S \times A}$. There exists an optimal policy $\pi^\star$ that simultaneoulsy maximizes $V^\pi(s) \; \forall s \in \mathcal{S}$, and $Q^\pi(s,a) \; \forall (s,a) \in \mathcal {S \times A}$. Denote the optimal value function $V^\star := V^{\pi^\star}$, and the optimal action value function $Q^\star := Q^{\pi^\star}$.
Given a few data samples, the aim of policy learning is to find/identify a policy that maximizes the expected discounted reward.
Formally, let $\varepsilon$ be the target accuracy level, an $\varepsilon$-accurate policy $\pi_\text{est}$ is defined as
\begin{eqnarray*}
V^{\pi_{\text{est}}}(s) &\geqslant& V^\star(s) -\varepsilon,\\
Q^{\pi_{\text{est}}}(s,a) &\geqslant& Q^\star(s,a) -\varepsilon,
\end{eqnarray*}
$\forall (s,a) \in \mathcal{S\times A}$.
The sample complexity of a RL algorithm is the minimum number of calls to a sampling model required for the algorithm to learn a policy that is within a specified target accuracy level, $\varepsilon$ of the optimal policy, with high probability. Formally, the sample complexity $m(\varepsilon, \delta)$ is the smallest number of samples such that, for any $\delta \in (0,1)$, the learned policy $ \pi$ is an $\varepsilon$-accurate policy, i.e.,
\begin{eqnarray*}
V^{\pi}(s) &\geqslant& V^\star(s) -\varepsilon,\\
Q^{\pi}(s,a) &\geqslant& Q^\star(s,a) -\varepsilon,
\end{eqnarray*}
with probability at least $1 - \delta$. In RL, there is a trade-off between sample complexity and statistical accuracy:
exploring more information to improve statistical accuracy, versus exploiting known actions to maximize reward, which reduces sample complexity.
The two common frameworks for analyzing sample complexity are:
- The PAC bounds, which allow to determine the number of samples needed to guarantee that the algorithm learns a policy with a small accuracy error $\varepsilon$, compared to the optimal policy, see e.g., Azar et al. (2013),
- On the other hand, the regret bounds measure the cumulative loss incurred before reaching the optimal policy. They quantify how much reward we miss out on during the learning process, see e.g., Azar et al. (2017).
Lan (2023) studied the sample complexity of two stochastic versions of a new policy mirror descent (PMD) for RL problems with either convex or strictly convex regularizers. Li, Shi
et al. (2024) consider both frameworks of discounted infinite-horizon and finite-horizon MDPs, and design an offline model-based RL algorithm achieving the minimax-optimal sample complexity without burn-in costs. Zhang
et al. (2023) study the sample complexity of a model-based multi-agent RL (MARL)
in zero- sum discounted Markov games, when a generative model is available. Sample complexity can be improved by hierarchical reinforcement learning (HRL) models, see e.g., Robert
et al. ( 2023), Brunskill and Li (2014). The Goal-Conditioned HRL approach allows to achieve lower sample complexity while maintaining efficiency.
Li, Wei
et al. (2024) consider the framework of the infinite horizon discounted MDP with a generative model. Denote by $\mathcal{M} =(\mathcal{S,A},\widehat{\mathcal{P}}, r,\gamma) $ the empirical MDP, where
the empirical kernel $\widehat{\mathcal{P}}$ is estimated from a collection of $N$ i.i.d samples $s_{s,a}^i \sim \mathcal{P}(.|s,a)$:
$$
\forall s^{'} \in \mathcal{S}, \quad \widehat{\mathcal{P}}(s^{'}|s,a):= \frac{1}{N}\sum_{i=1}^N\mathbf{1}\{s_{s,a}^i = s^{'} \}.
$$
Two algorithms are studied:
-
The first algorithm, i.e., the perturbed model-based planning (PBMP), is defined by adding a random term $\zeta(s,a)$, with $(s,a) \in \mathcal{S}\times \mathcal{A}$, to the reward function, i.e.,
\begin{equation*}
r_p(s,a) =r(s,a)+ \zeta(s,a), \quad \zeta(s,a) \sim \text{i.i.d } U(0,\xi) ,
\end{equation*}
where $U(0, \xi)$ denotes the uniform distribution on the interval $(0,\xi)$, $\xi$ being the perturbation size. The optimal policy $\hat{\pi}^\star_p$ of the perturbed empirical MDP $\widehat{\mathcal{M}}_p =(\mathcal{S,A},\widehat{\mathcal{P}},r_p,\gamma)$ is defined as
\begin{equation*}
\label{e:PBMP}
\hat{\pi}^\star_p := \arg \max_\pi \widehat{V}^\pi_p.
\end{equation*}
-
The second algorithm consists in selecting the approximate optimal action, for the empirical MDP $\mathcal{M}$ as follows:
$$
\hat{\pi}_\varsigma(s) := \min \{ a \in \mathcal{A} : \widehat{Q}^\star(s,a) > \widehat{V}^\star(s) - \varsigma\},
$$
$\forall s \in \mathcal{S}$, where $\varsigma \sim U(0,\xi)$.
The sample complexity of these two algorithms is equal to $ \frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^3 \varepsilon^2} $ as in Agarwal
et al. (2020), Sidford
et al. (2018), and Azar
et al. (2013).
However, for both algorithms, the minimum sample size for achieving the minimax sample complexity is $ N \geqslant \widetilde{\Omega}\left(\frac{|\mathcal{S}||\mathcal{A}|}{1-\gamma}\right)$,
This improves upon the previous results $N\geqslant \widetilde{\Omega}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^2}\right)$ obtained by Agarwal
et al. (2020), and $N\geqslant \widetilde{\Omega}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^3}\right)$ by Sidford
et al. (2018).
Define
$\chi$$\; := \Big (\mathcal{|S|,|A|},\frac{1}{1-\gamma},\frac{1}{\varepsilon} \Big )$.
The standard notation $f(\chi)= \mathcal{O}(g(\chi))$ means that $\exists\, C >0$ such that $f\, \leqslant \, Cg$, $f(\chi)= \Omega(g(\chi))$ means that $g(\chi)= \mathcal{O}(f(\chi))$, while $f(\chi)= \widetilde{\Omega}(g(\chi))$ is identical as the previous definition of $\Omega (\cdot)$ disregarding the logarithmic factors in
$\chi$.