English (United Kingdom)

https://curated-unify.zendy.io/wp-json/zendy-region/v1/featured_content/oa?rat=en

https://curated-unify.zendy.io/wp-json/zendy-region/v1/highlighted_journal/

Zendy Plus

Presents the access of premium content as premium feature

Premium Content

Presents the keyphrase highlighting as premium feature

Keyphrase Highlighting

Presents the summarisation as premium feature

Summarisation

Insights

Presents the pdf analysis as premium feature

PDF Analysis

Presents the zaia usage as premium feature

ZAIA

Zendy Tools

Zendy Open

We show two average-reward off-policy control algorithms, DifferentialQ-learning (Wan, Naik, & Sutton 2021a) and RVI Q-learning (Abounadi Bertsekas &Borkar 2001), converge in weakly communicating MDPs. Weakly communicating MDPsare the most general MDPs that can be solved by a learning algorithm with asingle stream of experience. The original convergence proofs of the twoalgorithms require that the solution set of the average-reward optimalityequation only has one degree of freedom, which is not necessarily true forweakly communicating MDPs. To the best of our knowledge, our results are thefirst showing average-reward off-policy control algorithms converge in weaklycommunicating MDPs. As a direct extension, we show that average-reward optionsalgorithms for temporal abstraction introduced by Wan, Naik, & Sutton (2021b)converge if the Semi-MDP induced by options is weakly communicating.

On Convergence of Average-Reward Off-Policy Control Algorithms in Weakly Communicating MDPs