AI & Math Blog

Reinforcement Learning: Sample Complexity

This page summarizes some recent mathematical advances on sample complexity in reinforcement learning (RL).
Imagine you're teaching a robot to play chess. You want it to learn the best moves, with a limited number of games to train it with. Sample complexity is all about how efficiently your robot can learn from those limited games. It measures how many games (or interactions) the robot needs to play before it can figure out the best moves. A lower sample complexity means it learns faster, even with limited training data. In real-world scenarios like self-driving cars, collecting data for training can be very expensive and time-consuming. This is where sample complexity becomes crucial. Algorithms with low sample complexity can learn effectively even with limited real-world driving data. So, how can we design algorithms that learn quickly and efficiently, even with limited data? This is where the study of sample complexity comes in.

The standard framework for RL is the Markov decision process (MDP). For the sake of generality, we consider the infinite horizon discounted MDP defined by $\mathcal{M}=(\mathcal{S,A,P},r,\gamma )$ where $\mathcal{S,A}$ are respectively the state space and the action space, which might be infinite, $r : \mathcal{S\times A} \rightarrow [0,1]$ is the reward function, i.e., $r(s,a)$ is the immediate reward upon choosing the action $a \in \mathcal{A}$ while in the state $s \in \mathcal{S}$, and $\gamma \in (0,1)$ is the discount factor.
The probability transition kernel is defined as $\mathcal{P}: \mathcal{S \times A} \rightarrow \Delta(S)$, where $\mathcal{P}(s^{'}|s,a)$ is the probability of transiting from state $s$ to state $s^{'}$ when $a$ is selected; $\Delta(S)$ is the probability simplex over $\mathcal{S}$. In the generative case, the probability transition kernel $\mathcal{P}$ is empirically estimated from the data.
We consider here the standard tabular case: the cardinalities $|\mathcal{S}|$ and $|\mathcal{A}|$ are finite.

A deterministic policy $\pi$ is a mapping $\pi: \mathcal{S}\rightarrow \mathcal{A}$. The objective of the agent is to find the optimal policy $\pi^\star$ maximizing the value function $V^\pi(s)$ . For the infinite horizon discounted MDP, $V^\pi(s)$ is the expected discounted total reward starting from the initial state $s^0=s$ , i.e., \begin{eqnarray*} \pi^\star &:= &\arg \max_\pi V^\pi(s),\\ V^\pi(s)& := &E \Big[ \sum_{t=0}^\infty \gamma^t r(s^t, a^t)\big | s^0 = s \Big ]. \end{eqnarray*} The corresponding action value function $Q^\pi: \mathcal{S\times A} \rightarrow \mathbb{R}$ of a policy $\pi$ is defined as $$ Q^\pi(s,a) := E \Big [ \sum_{t=0}^\infty \gamma^t r(s^t, a^t) \big | (s^0, a^0)=(s, a) \Big ], $$ $\forall (s,a) \in \mathcal{S \times A}$. There exists an optimal policy $\pi^\star$ that simultaneoulsy maximizes $V^\pi(s) \; \forall s \in \mathcal{S}$, and $Q^\pi(s,a) \; \forall (s,a) \in \mathcal {S \times A}$. Denote the optimal value function $V^\star := V^{\pi^\star}$, and the optimal action value function $Q^\star := Q^{\pi^\star}$.

Given a few data samples, the aim of policy learning is to find/identify a policy that maximizes the expected discounted reward. Formally, let $\varepsilon$ be the target accuracy level, an $\varepsilon$-accurate policy $\pi_\text{est}$ is defined as \begin{eqnarray*} V^{\pi_{\text{est}}}(s) &\geqslant& V^\star(s) -\varepsilon,\\ Q^{\pi_{\text{est}}}(s,a) &\geqslant& Q^\star(s,a) -\varepsilon, \end{eqnarray*} $\forall (s,a) \in \mathcal{S\times A}$. The sample complexity of a RL algorithm is the minimum number of calls to a sampling model required for the algorithm to learn a policy that is within a specified target accuracy level, $\varepsilon$ of the optimal policy, with high probability. Formally, the sample complexity $m(\varepsilon, \delta)$ is the smallest number of samples such that, for any $\delta \in (0,1)$, the learned policy $ \pi$ is an $\varepsilon$-accurate policy, i.e., \begin{eqnarray*} V^{\pi}(s) &\geqslant& V^\star(s) -\varepsilon,\\ Q^{\pi}(s,a) &\geqslant& Q^\star(s,a) -\varepsilon, \end{eqnarray*} with probability at least $1 - \delta$. In RL, there is a trade-off between sample complexity and statistical accuracy: exploring more information to improve statistical accuracy, versus exploiting known actions to maximize reward, which reduces sample complexity.
The two common frameworks for analyzing sample complexity are:
  1. The PAC bounds, which allow to determine the number of samples needed to guarantee that the algorithm learns a policy with a small accuracy error $\varepsilon$, compared to the optimal policy, see e.g., Azar et al. (2013),
  2. On the other hand, the regret bounds measure the cumulative loss incurred before reaching the optimal policy. They quantify how much reward we miss out on during the learning process, see e.g., Azar et al. (2017).

Recent Advances

Lan (2023) studied the sample complexity of two stochastic versions of a new policy mirror descent (PMD) for RL problems with either convex or strictly convex regularizers. Li, Shi et al. (2024) consider both frameworks of discounted infinite-horizon and finite-horizon MDPs, and design an offline model-based RL algorithm achieving the minimax-optimal sample complexity without burn-in costs. Zhang et al. (2023) study the sample complexity of a model-based multi-agent RL (MARL) in zero- sum discounted Markov games, when a generative model is available. Sample complexity can be improved by hierarchical reinforcement learning (HRL) models, see e.g., Robert et al. ( 2023), Brunskill and Li (2014). The Goal-Conditioned HRL approach allows to achieve lower sample complexity while maintaining efficiency.
Li, Wei et al. (2024) consider the framework of the infinite horizon discounted MDP with a generative model. Denote by $\mathcal{M} =(\mathcal{S,A},\widehat{\mathcal{P}}, r,\gamma) $ the empirical MDP, where the empirical kernel $\widehat{\mathcal{P}}$ is estimated from a collection of $N$ i.i.d samples $s_{s,a}^i \sim \mathcal{P}(.|s,a)$: $$ \forall s^{'} \in \mathcal{S}, \quad \widehat{\mathcal{P}}(s^{'}|s,a):= \frac{1}{N}\sum_{i=1}^N\mathbf{1}\{s_{s,a}^i = s^{'} \}. $$ Two algorithms are studied:
  1. The first algorithm, i.e., the perturbed model-based planning (PBMP), is defined by adding a random term $\zeta(s,a)$, with $(s,a) \in \mathcal{S}\times \mathcal{A}$, to the reward function, i.e., \begin{equation*} r_p(s,a) =r(s,a)+ \zeta(s,a), \quad \zeta(s,a) \sim \text{i.i.d } U(0,\xi) , \end{equation*} where $U(0, \xi)$ denotes the uniform distribution on the interval $(0,\xi)$, $\xi$ being the perturbation size. The optimal policy $\hat{\pi}^\star_p$ of the perturbed empirical MDP $\widehat{\mathcal{M}}_p =(\mathcal{S,A},\widehat{\mathcal{P}},r_p,\gamma)$ is defined as \begin{equation*} \label{e:PBMP} \hat{\pi}^\star_p := \arg \max_\pi \widehat{V}^\pi_p. \end{equation*}
  2. The second algorithm consists in selecting the approximate optimal action, for the empirical MDP $\mathcal{M}$ as follows: $$ \hat{\pi}_\varsigma(s) := \min \{ a \in \mathcal{A} : \widehat{Q}^\star(s,a) > \widehat{V}^\star(s) - \varsigma\}, $$ $\forall s \in \mathcal{S}$, where $\varsigma \sim U(0,\xi)$.
The sample complexity of these two algorithms is equal to $ \frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^3 \varepsilon^2} $ as in Agarwal et al. (2020), Sidford et al. (2018), and Azar et al. (2013). However, for both algorithms, the minimum sample size for achieving the minimax sample complexity is $ N \geqslant \widetilde{\Omega}\left(\frac{|\mathcal{S}||\mathcal{A}|}{1-\gamma}\right)$, This improves upon the previous results $N\geqslant \widetilde{\Omega}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^2}\right)$ obtained by Agarwal et al. (2020), and $N\geqslant \widetilde{\Omega}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^3}\right)$ by Sidford et al. (2018).
Define $\chi$$\; := \Big (\mathcal{|S|,|A|},\frac{1}{1-\gamma},\frac{1}{\varepsilon} \Big )$. The standard notation $f(\chi)= \mathcal{O}(g(\chi))$ means that $\exists\, C >0$ such that $f\, \leqslant \, Cg$, $f(\chi)= \Omega(g(\chi))$ means that $g(\chi)= \mathcal{O}(f(\chi))$, while $f(\chi)= \widetilde{\Omega}(g(\chi))$ is identical as the previous definition of $\Omega (\cdot)$ disregarding the logarithmic factors in $\chi$.

References

  1. Abernethy, J., Agarwal, A., Marinov, T.V. and Warmuth, M.K., 2024. A mechanism for sample-efficient in-context learning for sparse retrieval tasks. Proceedings of Machine Learning Research, PMLR 237, pp. 3-46. MR4740478
  2. Agarwal, A., Kakade, S. and Yang, L.F., 2020. Model-based reinforcement learning with a generative model is minimax optimal. Proceedings of Thirty Third Conference on Learning Theory, PMLR 125, pp. 67-83.
  3. Agarwal, A., Kakade, S.M., Lee, J.D. and Mahajan, G., 2021. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. Journal of Machine Learning Research, 22(98), pp.1-76. MR4279749
  4. Azar, M.G., Osband, I. and Munos, R., 2017. Minimax regret bounds for reinforcement learning. Proceedings of the 34th International Conference on Machine Learning, PMLR 70, pp.263-272.
  5. Azar, M.G., Munos, R. and Kappen, H.J., 2013. Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine Learning, 91, pp.325-349. MR3064431
  6. Brunskill, E. and Li, L., 2014, June. PAC-inspired option discovery in lifelong reinforcement learning. Proceedings of the 31st International Conference on Machine Learning, PMLR 32(2):316-324.
  7. Kakade, S.M., 2003. On the sample complexity of reinforcement learning. University of London, University College London (United Kingdom).
  8. Kearns, M. and Singh, S., 2002. Near-optimal reinforcement learning in polynomial time. Machine learning, 49, pp.209-232.
  9. Kearns, M. and Singh, S., 1998. Finite-sample convergence rates for $Q$-learning and indirect algorithms. Advances in neural information processing systems, 11.
  10. Lan, G., 2023. Policy mirror descent for reinforcement learning: linear convergence, new sampling complexity, and generalized problem classes.Mathematical Programming, 198, no.1, Ser. A, 1059-1106. MR4550970
  11. Li, G., Cai, C., Chen, Y., Wei, Y. and Chi, Y., 2024. Is $Q$-learning minimax optimal? A tight sample complexity analysis. Operations Research, 72(1), 222-236. MR4705835
  12. Li, G., Wei, Y., Chi, Y. and Chen, Y., 2024. Breaking the sample size barrier in model-based reinforcement learning with a generative model. Operation Research, 72(1), 203–221. MR4705834
  13. Li, G., Shi, L., Chen, Y., Chi, Y. and Wei, Y., 2024. Settling the sample complexity of model-based offline reinforcement learning. The Annals of Statistics, 52(1), pp233-260. MR4718414
  14. Li, T., Lan, G. and Pananjady, A., 2023. Accelerated and instance-optimal policy evaluation with linear function approximation. SIAM Journal on Mathematics of Data Science, 5(1), pp174-200. MR4562584
  15. Pananjady, A. and Wainwright, M.J., 2021. Instance-dependent $\ell_\infty $-bounds for policy evaluation in tabular reinforcement learning. IEEE Transactions on Information Theory,67(1) pp.566–585. MR4231973
  16. Robert, A., Pike-Burke, C. and Faisal, A.A., 2023. Sample complexity of goal-conditioned hierarchical reinforcement learning. Advances in Neural Information Processing Systems, 36, pp.62696-62712
  17. Sidford, A., Wang, M., Wu, X., Yang, L. and Ye, Y., 2018. Near-optimal time and sample complexities for solving Markov decision processes with a generative model. Advances in Neural Information Processing Systems, 31.
  18. Strehl, A.L. and Littman, M.L., 2008. An analysis of model-based interval estimation for Markov decision processes. Journal of Computer and System Sciences, 74(8), pp1309-1331. MR2460287
  19. Sutton, R.S. and Barto, A.G., 2018. Reinforcement learning: an introduction.. Second edition Adapt. Comput. Mach. Learn. MIT Press, Cambridge, MA. MR3889951
  20. Szepesvàri, C., 2022. Algorithms for reinforcement learning, reprint of the 2010 original, Synthesis Lectures on Artificial Intelligence and Machine Learning, 9, Springer, Cham, MR4647095
  21. Teyssière, G., 2023. Review of "Lan, G., 2023. Policy mirror descent for reinforcement learning: linear convergence, new sampling complexity, and generalized problem classes". Mathematical Reviews, American Mathematical Society.
  22. Teyssière, G., 2025. Review of "Li, G., Wei, Y., Chi, Y. and Chen, Y., 2024. Breaking the sample size barrier in model-based reinforcement learning with a generative model". Mathematical Reviews, American Mathematical Society.
  23. Wang, M., 2020. Randomized linear programming solves the markov decision problem in nearly linear (sometimes sublinear) time. Mathematics of Operations Research, 45(2), pp.517-546. MR4100273
  24. Zhang, K., Kakade, S.M., Basar, T. and Yang, L.F., 2023. Model-based multi-agent RL in zero-sum markov games with near-optimal sample complexity. Journal of Machine Learning Research, 24(175), pp.1-53. MR4633564
Updated April 2025.