Sublinear regret for learning POMDPs | Zendy

Xiong Yi | Zendy; Chen Ningyuan | Zendy; Gao Xuefeng | Zendy; Zhou Xiang | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Premium

Sublinear regret for learning POMDPs

Author(s) -

Xiong Yi,

Chen Ningyuan,

Gao Xuefeng,

Zhou Xiang

Publication year - 2022

Publication title -

production and operations management

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 3.279

H-Index - 110

eISSN - 1937-5956

pISSN - 1059-1478

DOI - 10.1111/poms.13778

Subject(s) - regret , markov decision process , oracle , reinforcement learning , sublinear function , computer science , partially observable markov decision process , mathematical optimization , upper and lower bounds , time horizon , artificial intelligence , markov chain , machine learning , markov process , mathematics , markov model , discrete mathematics , statistics , mathematical analysis , software engineering

We study the model‐based undiscounted reinforcement learning for partially observable Markov decision processes (POMDPs). The oracle we consider is the optimal policy of the POMDP with a known environment in terms of the average reward over an infinite horizon. We propose a learning algorithm for this problem, building on spectral method‐of‐moments estimations for hidden Markov models, the belief error control in POMDPs and upper confidence bound methods for online learning. We establish a regret bound ofO ( T 2 / 3log T ) $O(T^{2/3}\sqrt {\log T})$ for the proposed learning algorithm where T is the learning horizon. This is, to the best of our knowledge, the first algorithm achieving sublinear regret with respect to our oracle for learning general POMDPs.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here

Empowering knowledge with every search

About

About Careers Publisher Partners Contact Us

Learn

FAQs Blog Terms of Use Privacy Policy

About

Learn

Discover

Explore