Premium
Sublinear regret for learning POMDPs
Author(s) -
Xiong Yi,
Chen Ningyuan,
Gao Xuefeng,
Zhou Xiang
Publication year - 2022
Publication title -
production and operations management
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 3.279
H-Index - 110
eISSN - 1937-5956
pISSN - 1059-1478
DOI - 10.1111/poms.13778
Subject(s) - regret , markov decision process , oracle , reinforcement learning , sublinear function , computer science , partially observable markov decision process , mathematical optimization , upper and lower bounds , time horizon , artificial intelligence , markov chain , machine learning , markov process , mathematics , markov model , discrete mathematics , statistics , mathematical analysis , software engineering
We study the model‐based undiscounted reinforcement learning for partially observable Markov decision processes (POMDPs). The oracle we consider is the optimal policy of the POMDP with a known environment in terms of the average reward over an infinite horizon. We propose a learning algorithm for this problem, building on spectral method‐of‐moments estimations for hidden Markov models, the belief error control in POMDPs and upper confidence bound methods for online learning. We establish a regret bound ofO ( T 2 / 3log T ) $O(T^{2/3}\sqrt {\log T})$ for the proposed learning algorithm where T is the learning horizon. This is, to the best of our knowledge, the first algorithm achieving sublinear regret with respect to our oracle for learning general POMDPs.