Proximal Gradient Temporal Difference Learning: Stable Reinforcement Learning with Polynomial Sample Complexity | Zendy

Bo Liu | Zendy; Ian Gemp | Zendy; Mohammad Ghavamzadeh | Zendy; Ji Liu | Zendy; Sridhar Mahadevan | Zendy; Marek Petrik | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Proximal Gradient Temporal Difference Learning: Stable Reinforcement Learning with Polynomial Sample Complexity

Author(s) -

Bo Liu,

Ian Gemp,

Mohammad Ghavamzadeh,

Ji Liu,

Sridhar Mahadevan,

Marek Petrik

Publication year - 2018

Publication title -

journal of artificial intelligence research

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.79

H-Index - 123

eISSN - 1943-5037

pISSN - 1076-9757

DOI - 10.1613/jair.1.11251

Subject(s) - reinforcement learning , temporal difference learning , stochastic approximation , polynomial , convergence (economics) , saddle point , saddle , computer science , rate of convergence , algorithm , mathematics , sample (material) , function (biology) , function approximation , mathematical optimization , artificial intelligence , mathematical analysis , key (lock) , artificial neural network , chemistry , geometry , computer security , chromatography , evolutionary biology , economics , biology , economic growth

In this paper, we introduce proximal gradient temporal difference learning, which provides a principled way of designing and analyzing true stochastic gradient temporal difference learning algorithms. We show how gradient TD (GTD) reinforcement learning methods can be formally derived, not by starting from their original objective functions, as previously attempted, but rather from a primal-dual saddle-point objective function. We also conduct a saddle-point error analysis to obtain finite-sample bounds on their performance. Previous analyses of this class of algorithms use stochastic approximation techniques to prove asymptotic convergence, and do not provide any finite-sample analysis. We also propose an accelerated algorithm, called GTD2-MP, that uses proximal ``mirror maps'' to yield an improved convergence rate. The results of our theoretical analysis imply that the GTD family of algorithms are comparable and may indeed be preferred over existing least squares TD methods for off-policy learning, due to their linear complexity. We provide experimental results showing the improved performance of our accelerated gradient TD methods.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research