AI & Math Blog

Policy Gradient Methods and Iteration Complexity

Reinforcement learning (RL) requires efficient algorithms for optimal policy learning. We examine the computational efficiency of policy gradient (PG) methods, presenting recent advancements that improve scalability by reducing iteration complexity.

The class of PG methods is an alternative method to the class of action value $Q^\pi(s,a)$ methods for determining the optimal policy, where $s \in \mathcal{S}$ the state space, $a\in \mathcal{A}$ the action space, $r : \mathcal{S\times A} \rightarrow [0,1]$ is the reward function, i.e., $r(s,a)$ is the immediate reward upon choosing the action $a \in \mathcal{A}$ while in the state $s \in \mathcal{S}$, and $\gamma \in (0,1)$ is the discount factor. In the PG framework, policies are parameterized by a vector $\theta \in \Theta \subset \mathbb{R}^d$. The purpose of PG methods is to search and find the optimal policy over a class of parameterized policies, which optimizes a performance measure $J\pi(\theta)$ differentiable with respect to $\theta$. The most common performance measure $J_\pi(\theta)$ is the expected cumulative reward: $$ J_\pi(\theta) = E_\pi \Big [\sum_{t=0}^\infty \gamma^t r_t \Big ]. $$ In that case we maximize $J_\pi(\theta)$ using gradient ascent methods $\theta_{t+1} = \theta_{t}+\alpha \nabla_\theta J_\pi(\theta_t) $, where the step size $\alpha >0$ is a constant learning rate. Alternatively, $J_\pi(\theta)$ might represent the losses, or the risks, and in that case the aim is to minimize $J_\pi(\theta)$ using gradient descent methods. We consider here that we wish to maximize $J_\pi(\theta)$, formally: $$ \hat{\theta} := \arg \max_\theta J_\pi(\theta). $$ In this framework, policies are parameterized as $\pi(a|s,\theta)$, with $(s,a) \in \mathcal{S\times A}$. The policy $\pi(a|s,\theta) \in (0,1)$ is differentiable in $\theta$, where $\nabla \pi(a|s,\theta) $ is the vector of partial derivatives.

Advantages and Importance of PG Methods

By directly optimizing the policy, the function that maps states to actions, PG methods offer several advantages: effective handling of high-dimensional or continuous action spaces; support for stochastic policies, which are often required in fields such as robotics, control, and games; and the ability to perform end-to-end learning, where the policy can be optimized directly from raw sensory inputs to actions, leveraging the power of deep learning.
However, by their nature, PG methods share the challenges of gradient-based optimization in non-convex settings, potentially leading to NP-hard optimization problems
Note that in very high dimension, the notion of convexity is no longer appropriate as these large learning systems verify two very useful and powerful properties, demonstrated by Belkin and its co-authors and exposed in the AI & Math Blog. PG methods can be sample inefficient, requiring a large number of interactions with the environment to converge. Techniques like experience replay and off-policy learning can mitigate this issue.

PG Theorem

The gradient ascent method defines the sequence $\theta_k$ as follows $\theta_{k+1}= \theta_k + \alpha \nabla_\theta J_\pi(\theta_t)$, and then requires an estimate of the gradient, as the perfect knowledge of the gradient in this stochastic case is no more than an assumption in theoretical works. In this stochastic context, an estimate of the gradient is given by the PG theorem: $$ \nabla _\theta J_\pi(\theta) = \sum_{s\in\mathcal{S}} \eta(s) \sum_{a\in \mathcal{A}} \nabla_\theta \pi(a,s,\theta) Q_\pi(s,a), $$ where $\eta(s) = \sum_{k=0}^\infty \mathbf{P}(s_0 \rightarrow s,k,\pi)$ is the probability of transitioning from $s_0$ to $s$ in $k$ steps under policy $\pi$. The gradient $ \nabla _\theta J_\pi(\theta) $ can be expressed as the following expectation $$ \nabla _\theta J_\pi(\theta) = E_{S\sim \pi, A\sim \pi(S,\theta)}\Big [ \nabla_\theta \ln \pi(A|S,\theta) Q_\pi (S,A)\Big ], $$ which implies that $\pi(A|S,\theta) >0$, a condition fulfilled with the softmax policy presented above. This is the general form of the PG theorem, the specific expressions of the gradients depending on the definition of $J_\pi(\theta)$.

REINFORCE Algorithm

The general form for the gradient ascent algorithm is defined by the system of equations: \begin{eqnarray*} \theta_{k+1}&= &\theta_k + \alpha \nabla_\theta J_\pi(\theta_t),\\ \nabla _\theta J_\pi(\theta) &=& E_{S\sim \pi, A\sim \pi(S,\theta)}\Big [ \nabla_\theta \ln \pi(A|S,\theta) Q_\pi (S,A)\Big ], \end{eqnarray*} where $\nabla _\theta J_\pi(\theta)$ is unknown but can be approximated by a stochastic gradient so that $$ \theta_{k+1}= \theta_k + \alpha \nabla_\theta \ln \pi(a_t|s_t,\theta) Q_t (s_t,a_t). $$ The REINFORCE algorithm consists of estimating the approximation $Q_t (s_t,a_t)$ of $Q_\pi (s_t,a_t)$ by Monte Carlo methods.

Natural Policy Gradient Methods

The convergence, the learning and sample efficiency of PG methods are improved by preconditioning the gradient, i.e., by pre-multiplying $\nabla_\theta \ln \pi(a_t|s_t,\theta)$ by the inverse of the Fisher information matrix $\mathcal{I}(\theta)$. The natural policy gradient (NPG) utilizes the natural gradient as its search direction instead of the standard gradient, achieving faster convergence (Amari, 1998). The update formula then becomes: $$ \theta_{k+1}= \theta_k + \alpha \mathcal{I}^{-1}(\theta) \nabla_\theta \ln \pi(a_t|s_t,\theta) Q_t (s_t,a_t), $$ where $\mathcal{I}(\theta)$ is empirically estimated from the data. The improvements in convergence rate of NPG over PG have been demonstrated by Fazel et al. (2018).

Policy Gradient Techniques

Policy Representation: Softmax Policy

The softmax policy is a particular policy representation, used if the cardinality $|\mathcal{A}|$ is not too large, $$ \pi(a|s,\theta) = \frac{\exp \Big ( \Phi(s,a,\theta)\Big ) }{\sum_{b \in \mathcal{A}} \exp \Big( \Phi(s,b,\theta) \Big )}, $$ where $\Phi(s,a,\theta)$ is a functional form representing the preferences of the agent. PG methods with softmax parameterization are referred to as softmax policy gradient methods. Softmax PG methods are often benchmarked against natural policy gradient (NPG) methods.

Proximal Policy Gradient

PG methods rely on the choice of the stepsize $\alpha$, or learning rate, with the trade-off between overshooting and rate of convergence. Trust Region Policy Optimization (TRPO) and its extension Proximal Policy Gradient (PPG) aim to define a "trusted region" for the next value of $\theta$. TRPO and PPG are generalizations of NPG.
For the TRPO, the optimization problem consists of maximizing a surrogate objective function subject to a constraint on the size of the policy update, i.e., \begin{eqnarray*} &\max_\theta& \widehat{\mathbf{E}}_t \Big [ \frac{\pi_\theta (a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)} \widehat{A}_t \Big ],\\ &s.t.& \widehat{\mathbf{E}} \Big[ \text{KL}[\pi_{\theta_{\text{old}}}(\cdot|s_t),\pi_\theta(\cdot|s_t)] \Big ]\leqslant \delta, \end{eqnarray*} where KL denotes the Kullback-Leibler divergence, and $\widehat{A}_t $ is an estimate of the advantage function, see Schulman et al. (2018) for a general approach on advantage function estimation. Define the probability ratio $\lambda_t(\theta)= \frac{\pi_\theta (a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}$, the previous constrained optimization problem is rewritten by using a penalized function with penalty weight $\tau > 0$: $$ \max_\theta \widehat{\mathbf{E}}_t \Big [ \lambda_t(\theta) \widehat{A}_t -\tau \text{KL}[\pi_{\theta_{\text{old}}}(\cdot|s_t),\pi_\theta(\cdot|s_t)] \Big ]. $$ As there is no fixed choice for $\tau$ which works even for all the steps of an optimization problem, a sensible choice for $\tau$ is the adaptive choice algorithm provided in Schulman et al. (2017).
Alternatively, the KL divergence constraint is directly enforced by optimizing a surrogate objective function $\max_\theta L^{\text{CLIP}}(\theta)$, with

$$ L^{\text{CLIP}}(\theta) = \widehat{\mathbf{E}}_t \Big [ \min \big (\lambda_t(\theta)\widehat{A}_t, \text{clip}(\lambda_t(\theta),1-\varepsilon,1+\varepsilon) \widehat{A}_t \big) \Big ], $$

$L^{\text{CLIP}}(\theta)=$
$$ \widehat{\mathbf{E}}_t \Big [ \min \big (\lambda_t(\theta)\widehat{A}_t, \text{clip}(\lambda_t(\theta),1-\varepsilon,1+\varepsilon) \widehat{A}_t \big) \Big ], $$

where the function $\text{clip}(\lambda_t(\theta),1-\varepsilon,1+\varepsilon)$ clips the probability ratio $\lambda_t(\theta)$ inside the interval $[1-\varepsilon,1+\varepsilon] $. The minimum is taken over the clipped and un-clipped objective function. Schulman et al. (2017) provide several comparisons between PPO, TRPO, and others algorithms.

Convergence and Iteration Complexity: Recent Advances

Establishing information-theoretic or algorithmic-specific lower bounds on the statistical and computational complexities of RL algorithms plays an instrumental role in understanding the bottlenecks of RL algorithms. RL algorithms must balance exploration—discovering new states and actions—with exploitation—leveraging existing knowledge to maximize rewards. This balance prevents premature convergence to suboptimal policies and inefficient exploration. While stochastic policies inherently encourage exploration, techniques like entropy regularization can further enhance exploration, ensuring a more robust and effective learning process.
Agarwal et al. (2021) consider the framework of the MDP $\mathcal{M} =\left (\mathcal{S,A,P},r,\gamma,\rho \right)$ where both $\mathcal{S}$ and $\mathcal{A}$ are finite, $\mathcal{P}$ is the transition kernel defined as $\mathcal{P}: \mathcal{S \times A} \rightarrow \Delta(S)$, i.e., $\mathcal{P}(s^{'}|s,a)$ is the probability of transiting from state $s$ to state $s^{'}$ when $a$ is selected, $\Delta(S)$ is the probability simplex over $\mathcal{S}$, and $\rho$ is the starting state distribution over $\mathcal{S}$. They demonstrated that PG methods satisfy a Poljak-like gradient domination property, i.e., the difference between the value function of the optimal policy $\pi^\star$ and the value function of a policy $\pi$ satisfies:

$$ V^\star (\rho) - V^\pi(\rho) \leqslant \frac{1}{1-\gamma}\Big \| \frac{d_\rho^{\pi^\star} }{\mu} \Big \|_\infty \max_{\bar{\pi}} (\bar{\pi} - \pi)^\top \nabla_\pi V^\pi(\mu), $$

$ V^\star (\rho) - V^\pi(\rho) \leqslant$
$$ \frac{1}{1-\gamma}\Big \| \frac{d_\rho^{\pi^\star} }{\mu} \Big \|_\infty \max_{\bar{\pi}} (\bar{\pi} - \pi)^\top \nabla_\pi V^\pi(\mu), $$

where $V^\pi(\rho)$ is the expected value of a policy $\pi$ when the initial state is drawn from a distribution $\rho$ over $\mathcal{S}$, $\bar{\pi} \in \Delta(\mathcal{A}^{|\mathcal{S}|})$ the set of all policies, $\Big \| \frac{d_\rho^{\pi} }{\mu} \Big \|_\infty $ is the distribution mismatch coefficient of $\pi$ relative to the measure $\mu \in \Delta(\mathcal{S})$, the discounted state visitation distribution $ d_\rho^\pi$ is defined as:

$$ d_\rho^\pi(s) := \mathbf{E}_{s_0\sim \rho} \Big [ (1-\gamma)\sum_{t=0}^\infty \gamma_t \mathbf{P}(s_t = s|s_0) \Big ], \quad \forall s \in \mathcal{S}, $$

\begin{eqnarray*} d_\rho^\pi(s) := \mathbf{E}_{s_0\sim \rho} \Big [ (1-\gamma)\sum_{t=0}^\infty \gamma_t \mathbf{P}(s_t = s|s_0) \Big ], \\ \quad \forall s \in \mathcal{S}, \end{eqnarray*}

This gradient domination property entails the global convergence of PG. Note that a non-uniform Łojasiewicz (1962) inequality entails convergence of the PG; see Mei et al. (2020) for the case of softmax PG. Agarwal et al. (2021) study the iteration complexity of several algorithms in the tabular case: the projected gradient ascent on the simplex achieving $\mathcal{O}\Big ( \frac{D^2_\infty |\mathcal{S}| |\mathcal{A}|}{(1-\gamma)^6 \varepsilon^2}\Big ) $, the PG with softmax parameterization (asymptotic complexity), the PG with log barrier regularization and softmax parameterization with complexity $\mathcal{O}\Big ( \frac{D^2_\infty |\mathcal{S}|^2 |\mathcal{A}|^2}{(1-\gamma)^6 \varepsilon^2}\Big ) $, and the NPG with softmax parameterization achieving complexity $\frac{2}{(1-\gamma)^2 \varepsilon}$.

Entropy Regularization and Iteration Complexity

Entropy regularization addresses the exploration-exploitation dilemma by encouraging the agent to maintain a diverse range of actions. This is achieved by adding an entropy term to the objective function, which penalizes policies that are too deterministic. The entropy term incentivizes exploration, preventing the agent from getting stuck in local optima.
The addition of entropy regularization has a significant impact on the iteration complexity of RL algorithms: by promoting exploration, entropy regularization often reduces the number of iterations required to find a near-optimal policy as RL algorithms avoid getting trapped in suboptimal states or actions. Note, however, that entropy regularization might also increase the total number of iterations needed to achieve absolute convergence, as it encourages exploration beyond what a purely greedy approach might do.
Furthermore, entropy regularization can lead to smoother convergence curves, reducing the likelihood of large fluctuations in performance during training. This is particularly beneficial in complex environments.
Cen et al. (2022) consider the standard MDP framework $\mathcal{M} = \left( \mathcal{S,A,P},r,\gamma \right)$. The entropy regularized value function is defined as: $$ V_\tau^\pi (\rho) := V^\pi(\rho) + \tau \mathcal{H}(\rho,\pi), $$ where $\tau \geqslant 0$ is the regularization parameter, and $\mathcal{H}(\rho,\pi)$ is the discounted entropy defined as: $$ \mathcal{H}(\rho,\pi) := \frac{1}{1-\gamma} \mathbf{E}_{s \sim d_\rho^\pi} \Big [ \sum_{a \in \mathcal{A}} \pi(a|s) \log\frac{1}{\pi(a|s)} \Big ]. $$ The algorithm is based on the updating formula for the policy

$$ \pi^{(t+1)} (a|s) = \frac{1}{Z^{(t)}(s)} \left ( \pi^{(t)}(a|s) \right )^{1-\frac{\alpha \tau}{1-\gamma}} \exp \Big ( \frac{\alpha Q_\tau^{\pi(t)} (s,a)}{1-\gamma} \Big ), $$ where $$ Z^{(t)}(s) = \sum_{a^{'} \in \mathcal{A}} \left ( \pi^{(t)}(a^{'}|s) \right )^{1-\frac{\alpha \tau}{1-\gamma}} \exp \Big( \frac{\alpha Q_\tau^{\pi(t)} (s,a^{'}|)}{1-\gamma} \Big ) . $$

$ \pi^{(t+1)} (a|s) =$ $$ \frac{1}{Z^{(t)}(s)} \left ( \pi^{(t)}(a|s) \right )^{1-\frac{\alpha \tau}{1-\gamma}} \exp \Big ( \frac{\alpha Q_\tau^{\pi(t)} (s,a)}{1-\gamma} \Big ), $$ where $Z^{(t)}(s) =$ $$ \sum_{a^{'} \in \mathcal{A}} \left ( \pi^{(t)}(a^{'}|s) \right )^{1-\frac{\alpha \tau}{1-\gamma}} \exp \Big( \frac{\alpha Q_\tau^{\pi(t)} (s,a^{'}|)}{1-\gamma} \Big ) . $$

Under the assumption that the function $Q_\tau^{\pi(t)}$ is accurately evaluated $\forall t$, the entropy regularized softmax NPG method with the previous updating policy formula converges linearly and requires at most \begin{eqnarray*} \frac{1}{\alpha \tau} \log \Big (\frac{C_1 \gamma}{\varepsilon} \Big ) \quad\text{iterations} \\ \text{to reach}\quad \| Q^\star_\tau - Q_\tau^{\pi(t)} \|_\infty \leqslant \varepsilon, \end{eqnarray*} where

$$ C_1 := \| Q^\star_\tau - Q_\tau^{\pi(0)} \|_\infty +2\tau \Big ( 1-\frac{\alpha \tau}{1-\gamma} \Big ) \| \log \pi^\star_\tau- \log \pi^{(0)} \|. $$

\begin{eqnarray*} C_1 &:=& \| Q^\star_\tau - Q_\tau^{\pi(0)} \|_\infty \\ &+&2\tau \Big ( 1-\frac{\alpha \tau}{1-\gamma} \Big ) \| \log \pi^\star_\tau- \log \pi^{(0)} \|. \end{eqnarray*}

Note that this convergence is nonasymptotic.

Iteration Complexity of Some MDP

The lower bound on the iteration complexity of softmax PG methods and its dependence on salient parameters of the some specific MDP are of interest. In particular, current architectures in deep reinforcement learning might be quite huge, as overparameterized systems have two nice and important properties, in terms of loss function minimization and convergence properties of the gradient methods, see e.g., Belkin (2021), Belkin et al. (2019), Liu et al. (2022) so that some of the parameters of these massive models are of larger magnitude than usual.
Li et al. (2023) study a MDP with a specific configuration so that the convergence time is exponential. This MDP defined as $\mathcal{M}=(\mathcal{S}, \{\mathcal{A}_s\}_{s\in \mathcal{S}},\mathcal{P},r,\gamma )$ where the action space $\mathcal{A}_s\subseteq \{a_0,a_1,a_2\}$ is associated with the state $s\in \mathcal{S}$, $\mathcal{P}$ is the transistion kernel, the reward function is normalized so that $r(s,a) \in [-1,1] \; \forall a,s \in \mathcal{A}_s \times \mathcal{S}$.
Denote by $V^\star$ the optimal value function, $V^{(t)}$ the value of the function a the $t$-th iteration with the softmax PG method. The authors make the following standard assumptions: the initial state is a uniform distribution, a uniform policy initialization, and that the analytical form of the gradient is known and used. Furthermore, they assume that $0 < \alpha < (1-\gamma)^{2/5}$ then there exists constants $c_1,c_2, c_3 >0$ such that $\forall \gamma \in (0.96,1)$ and $|\mathcal{S}| \geqslant c_3(1-\gamma)^{-6} $, one can find a MDP such that the softmax PG methods requires \begin{eqnarray*} \displaystyle \frac{c_1}{\alpha} \left | \mathcal{S} \right |^{2^{\frac{c_2}{1-\gamma}}} \quad\text{iterations} \\ \text{to reach}\quad \frac{1}{|S|} \sum_{S\in\mathcal{S}}\big [V^\star(s) -V^{(t)}(s) \big ] \leqslant 0.07. \end{eqnarray*}

References

Agarwal, A., Kakade, S.M., Lee, J.D. and Mahajan, G., 2021. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. Journal of Machine Learning Research, 22(98), pp.1-76. MR4279749
Ahmed, Z., Le Roux, N., Norouzi, M. and Schuurmans, D., 2019, May. Understanding the impact of entropy on policy optimization.Proceedings of the 36th International Conference on Machine Learning, PMLR 97, pp. 151-160
Amari, S.I., Ba, J., Grosse, R., Li, X., Nitanda, A., Suzuki, T., Wu, D. and Xu, J., 2020. When does preconditioning help or hurt generalization? arXiv preprint arXiv:2006.10732.
Amari, S.I., 1998. Natural gradient works efficiently in learning. Neural Computation, 10(2), pp. 251-276.
Amari, S.I., 1998. The natural gradient learning algorithm for neural networks. Theoretical Apects of Neural Computation (Hong Kong, 1997), pp. 1–15. Springer-Verlag Singapore. MR1655598
Antos, A., Szepesvári, C. and Munos, R., 2008. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71, pp. 89-129.
Azar, M.G., Gómez, V. and Kappen, H.J., 2012. Dynamic policy programming. Journal of Machine Learning Research, 13(1), pp. 3207-3245. MR3005885
Belkin, M., 2021. Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation. Acta Numerica 3, 203-248. MR4298218
Belkin , M., Hsu, D., Ma, S. and Mandal, S., 2019. Reconciling modern machine-learning practice and the classical bias-variance trade-off, Proc. Natl. Acad. Sci. USA 116, no. 32, 15849-15854. MR3997901
Bhandari, J. and Russo, D., 2024. Global optimality guarantees for policy gradient methods. Operations Research, 72(5), pp. 1906-1927. MR4818024
Cen, S., Cheng, C., Chen, Y., Wei, Y. and Chi, Y., 2022. Fast global convergence of natural policy gradient methods with entropy regularization.Operations Research, 70(4), pp. 2563-2578. MR4484422
Daneshmand, H., Kohler, J., Lucchi, A. and Hofmann, T., 2018. Escaping saddles with stochastic gradients Proceedings of the 35th International Conference on Machine Learning, PMLR 80, pp. 1155-1164
Ding, Y., Zhang, J., Lee, H. and Lavaei, J., 2025. Beyond exact gradients: Convergence of stochastic soft-max policy gradient methods with entropy regularization. IEEE Transactions on Automatic Control, forthcoming.
Even-Dar, E., Kakade, S.M. and Mansour, Y., 2009. Online Markov decision processes. Mathematics of Operations Research, 34(3), pp.726-736. MR2555346
Fazel, M., Ge, R., Kakade, S. and Mesbahi, M., 2018, Global convergence of policy gradient methods for the linear quadratic regulator. Proceedings of the 35th International Conference on Machine Learning, PMLR 80, pp. 1467-1476,
Haarnoja, T., Zhou, A., Abbeel, P. and Levine, S., 2018. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. Proceedings of the 35th International Conference on Machine Learning, PMLR 80, pp. 1861-1870.
Konda, V.R. and Borkar, V.S., 1999. Actor-critic--type learning algorithms for Markov decision processes. SIAM Journal on control and Optimization, 38(1), pp.94-123. MR1740605
Konda, V. and Tsitsiklis, J., 1999. Actor-critic algorithms. Proceedings of the 13th International Conference on Neural Information Processing Systems, 12, pp. 1008–1014.
Lan, G., 2023. Policy mirror descent for reinforcement learning: linear convergence, new sampling complexity, and generalized problem classes. Mathematical Programming, 198, no.1, Ser. A, 1059-1106. MR4550970
Li, G., Wei, Y., Chi, Y., Gu, Y. and Chen, Y., 2023. Softmax policy gradient methods can take exponential time to converge. Mathematical Programming, 201, no. 1-2, Ser. A, 707--802 MR4620239
Li, H. and He, H., 2024. Multiagent trust region policy optimization. IEEE Transactions on Neural Networks and Learning Systems, 35(9), pp. 12873-12887, MR4796425
Li, Z., Liu, B., Yang, Z., Wang, Z. and Wang, M., 2023. Double duality: variational primal-dual policy optimization for constrained reinforcement learning. Journal of Machine Learning Research, 24(385), pp. 1-43. MR4720841
Liu, C., Zhu, L. and Belkin, M., 2022. Loss landscapes and optimization in over-parameterized non-linear systems and neural networks, Applied and Computational Harmonic Analysis 59, 85-116. MR4412181
Łojasiewicz, S., 1962. Une propriété topologique des sous-ensembles analytiques réels, in Les Équations aux Dérivées Partielles , 87-89. MR0160856
Mei, J., Xiao, C., Szepesvari, C. and Schuurmans, D., 2020. On the global convergence rates of softmax policy gradient methods. Proceedings of the 37th International Conference on Machine Learning, PMLR 119, pp. 6820-6829,
Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D. and Kavukcuoglu, K., 2016. Asynchronous methods for deep reinforcement learning. Proceedings of The 33rd International Conference on Machine Learning, PMLR 48, pp. 1928-1937
Poljak, B. T., 1963. Gradient methods for minimizing functionals, Ž. Vyčisl. Mat i Mat. Fiz. 3, 643-653. MR0158568
Schulman, J., Levine, S., Abbeel, P., Jordan, M. and Moritz, P., 2015. Trust region policy optimization. Proceedings of the 32nd International Conference on Machine Learning, PMLR 37, pp. 1889-1897.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O., 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
Schulman, J., Moritz, P., Levine, S., Jordan, M. and Abbeel, P., 2018. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438.
Shani, L., Efroni, Y. and Mannor, S., 2020. Adaptive trust region policy optimization: Global convergence and faster rates for regularized MDPs. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04), pp. 5668-5675.
Sutton, R.S. and Barto, A.G., 2018. Reinforcement learning: an introduction. Second edition Adapt. Comput. Mach. Learn. MIT Press, Cambridge, MA. MR3889951
Sutton, R.S., McAllester, D., Singh, S. and Mansour, Y., 1999. Policy gradient methods for reinforcement learning with function approximation. Proceedings of the 13th International Conference on Neural Information Processing Systems (NIPS'99). MIT Press, Cambridge, MA, USA, 1057-1063.
Teyssière, G., 2023. Review of "Lan, G., 2023. Policy mirror descent for reinforcement learning: linear convergence, new sampling complexity, and generalized problem classes". Mathematical Reviews, American Mathematical Society.
Teyssière, G., 2024. Review of "Li, G., Wei, Y., Chi, Y., Gu, Y. and Chen, Y., 2023. Softmax policy gradient methods can take exponential time to converge." Mathematical Reviews, American Mathematical Society.
Williams, R.J., 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8, pp.229-256.
Williams, R.J. and Peng, J., 1991. Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3(3), pp. 241-268.
Xiao, L., 2022. On the convergence rates of policy gradient methods. Journal of Machine Learning Research, 23(282), pp.1-36. MR4577721
Xu, P., Gao, F. and Gu, Q., 2019. Sample efficient policy gradient methods with recursive variance reduction. arXiv preprint arXiv:1909.08610.
Zhang, K., Koppel, A., Zhu, H. and Başar, T., 2020. Global convergence of policy gradient methods to (almost) locally optimal policies. SIAM Journal on Control and Optimization, 58(6), pp. 3586-3612. MR4182900
Zhao, S., 2025. Mathematical foundations of reinforcement learning. Springer Verlag, Singapore. MR4882001
Zhao, F., You, K. and Başar, T., 2023. Global convergence of policy gradient primal–dual methods for risk-constrained LQRs. IEEE Transactions on Automatic Control, 68(5), pp.2934-2949. MR4583467

Updated May 2025.