
Computational Benefits of Intermediate Rewards for Goal-Reaching Policy Learning
Author(s) -
Yuexiang Zhai,
Christina Baek,
Zhengyuan Zhou,
Jiantao Jiao,
Yue Ma
Publication year - 2022
Publication title -
journal of artificial intelligence research/the journal of artificial intelligence research
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.79
H-Index - 123
eISSN - 1943-5037
pISSN - 1076-9757
DOI - 10.1613/jair.1.13326
Subject(s) - computer science , reinforcement learning , shortest path problem , path (computing) , convergence (economics) , computational complexity theory , artificial intelligence , state (computer science) , mathematical optimization , machine learning , theoretical computer science , algorithm , mathematics , graph , economics , programming language , economic growth
Many goal-reaching reinforcement learning (RL) tasks have empirically verified that rewarding the agent on subgoals improves convergence speed and practical performance. We attempt to provide a theoretical framework to quantify the computational benefits of rewarding the completion of subgoals, in terms of the number of synchronous value iterations. In particular, we consider subgoals as one-way intermediate states, which can only be visited once per episode and propose two settings that consider these one-way intermediate states: the one-way single-path (OWSP) and the one-way multi-path (OWMP) settings. In both OWSP and OWMP settings, we demonstrate that adding intermediate rewards to subgoals is more computationally efficient than only rewarding the agent once it completes the goal of reaching a terminal state. We also reveal a trade-off between computational complexity and the pursuit of the shortest path in the OWMP setting: adding intermediate rewards significantly reduces the computational complexity of reaching the goal but the agent may not find the shortest path, whereas with sparse terminal rewards, the agent finds the shortest path at a significantly higher computational cost. We also corroborate our theoretical results with extensive experiments on the MiniGrid environments using Q-learning and some popular deep RL algorithms.