A least squares temporal difference actor–critic algorithm with applications to warehouse management | Zendy

Estanjini Reza Moazzez | Zendy; Li Keyong | Zendy; Paschalidis Ioannis Ch. | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Premium

A least squares temporal difference actor–critic algorithm with applications to warehouse management

Author(s) -

Estanjini Reza Moazzez,

Li Keyong,

Paschalidis Ioannis Ch.

Publication year - 2012

Publication title -

naval research logistics (nrl)

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.665

H-Index - 68

eISSN - 1520-6750

pISSN - 0894-069X

DOI - 10.1002/nav.21481

Subject(s) - heuristics , computer science , markov decision process , mathematical optimization , algorithm , set (abstract data type) , partially observable markov decision process , state (computer science) , dynamic programming , markov chain , markov process , mathematics , markov model , machine learning , statistics , programming language

This article develops a new approximate dynamic programming (DP) algorithm for Markov decision problems and applies it to a vehicle dispatching problem arising in warehouse management. The algorithm is of the actor‐critic type and uses a least squares temporal difference learning method. It operates on a sample‐path of the system and optimizes the policy within a prespecified class parameterized by a parsimonious set of parameters. The method is applicable to a partially observable Markov decision process setting where the measurements of state variables are potentially corrupted, and the cost is only observed through the imperfect state observations. We show that under reasonable assumptions, the algorithm converges to a locally optimal parameter set. We also show that the imperfect cost observations do not affect the policy and the algorithm minimizes the true expected cost. In the warehouse application, the problem is to dispatch sensor‐equipped forklifts in order to minimize operating costs involving product movement delays and forklift maintenance. We consider instances where standard DP is computationally intractable. Simulation results confirm the theoretical claims of the article and show that our algorithm converges more smoothly than earlier actor–critic algorithms while substantially outperforming heuristics used in practice. © 2012 Wiley Periodicals, Inc. Naval Research Logistics, 2012

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here

Empowering knowledge with every search

About

About Careers Publisher Partners Contact Us

Learn

FAQs Blog Terms of Use Privacy Policy

About

Learn

Discover

Explore