# Provable Benefits of Actor-Critic Methods for Offline Reinforcement Learning

@article{Zanette2021ProvableBO, title={Provable Benefits of Actor-Critic Methods for Offline Reinforcement Learning}, author={Andrea Zanette and Martin J. Wainwright and Emma Brunskill}, journal={ArXiv}, year={2021}, volume={abs/2108.08812} }

Actor-critic methods are widely used in offline reinforcement learning practice, but are not so well-understood theoretically. We propose a new offline actor-critic algorithm that naturally incorporates the pessimism principle, leading to several key advantages compared to the state of the art. The algorithm can operate when the Bellman evaluation operator is closed with respect to the action value function of the actor’s policies; this is a more general setting than the low-rank MDP model… Expand

#### 2 Citations

Representation Learning for Online and Offline RL in Low-rank MDPs

- Computer Science, Mathematics
- ArXiv
- 2021

An algorithm REP-UCB—Upper Confidence Bound driven REPresentation learning for RL, which significantly improves the sample complexity and is simpler than FLAMBE, as it directly balances the interplay between representation learning, exploration, and exploitation. Expand

Towards Instance-Optimal Offline Reinforcement Learning with Pessimism

- Computer Science, Mathematics
- 2021

We study the offline reinforcement learning (offline RL) problem, where the goal is to learn a reward-maximizing policy in an unknown Markov Decision Process (MDP) using the data coming from a policy… Expand

#### References

SHOWING 1-10 OF 93 REFERENCES

Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning

- Computer Science
- ICML
- 2021

Uncertainty Weighted Actor-Critic (UWAC), an algorithm that detects OOD state-action pairs and down-weights their contribution in the training objectives accordingly, is proposed and observed that UWAC substantially improves model stability during training. Expand

Doubly Robust Off-policy Value Evaluation for Reinforcement Learning

- Computer Science, Mathematics
- ICML
- 2016

This work extends the doubly robust estimator for bandits to sequential decision-making problems, which gets the best of both worlds: it is guaranteed to be unbiased and can have a much lower variance than the popular importance sampling estimators. Expand

Provably Good Batch Reinforcement Learning Without Great Exploration

- Computer Science, Mathematics
- ArXiv
- 2020

It is shown that a small modification to Bellman optimality and evaluation back-up to take a more conservative update can have much stronger guarantees on the performance of the output policy, and in certain settings, they can find the approximately best policy within the state-action space explored by the batch data, without requiring a priori assumptions of concentrability. Expand

Is Pessimism Provably Efficient for Offline RL?

- Computer Science, Mathematics
- ICML
- 2021

A pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function and establishes a data-dependent upper bound on the suboptimality of PEVI for general Markov decision processes (MDPs). Expand

Behavior Regularized Offline Reinforcement Learning

- Computer Science, Mathematics
- ArXiv
- 2019

A general framework, behavior regularized actor critic (BRAC), is introduced to empirically evaluate recently proposed methods as well as a number of simple baselines across a variety of offline continuous control tasks. Expand

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

- Computer Science, Mathematics
- ICML
- 2018

This paper proposes soft actor-critic, an off-policy actor-Critic deep RL algorithm based on the maximum entropy reinforcement learning framework, and achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off- policy methods. Expand

AlgaeDICE: Policy Gradient from Arbitrary Experience

- Computer Science
- ArXiv
- 2019

A new formulation of max-return optimization that allows the problem to be re-expressed by an expectation over an arbitrary behavior-agnostic and off-policy data distribution and shows that, if auxiliary dual variables of the objective are optimized, then the gradient of the off-Policy objective is exactly the on-policy policy gradient, without any use of importance weighting. Expand

Policy Gradient Methods for Reinforcement Learning with Function Approximation

- Mathematics, Computer Science
- NIPS
- 1999

This paper proves for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy. Expand

Provably Efficient Reinforcement Learning with Linear Function Approximation

- Computer Science, Mathematics
- COLT
- 2020

This paper proves that an optimistic modification of Least-Squares Value Iteration (LSVI) achieves regret, where d is the ambient dimension of feature space, H is the length of each episode, and T is the total number of steps, and is independent of the number of states and actions. Expand

Bellman-consistent Pessimism for Offline Reinforcement Learning

- Computer Science, Mathematics
- ArXiv
- 2021

The notion of Bellman-consistent pessimism for general function approximation is introduced: instead of calculating a point-wise lower bound for the value function, pessimism is implemented at the initial state over the set of functions consistent with the Bellman equations. Expand