Premium
Optimal tracking agent: a new framework of reinforcement learning for multiagent systems
Author(s) -
Cao Weihua,
Chen Gang,
Chen Xin,
Wu Min
Publication year - 2012
Publication title -
concurrency and computation: practice and experience
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.309
H-Index - 67
eISSN - 1532-0634
pISSN - 1532-0626
DOI - 10.1002/cpe.2870
Subject(s) - reinforcement learning , computer science , action selection , curse of dimensionality , bellman equation , convergence (economics) , estimator , action (physics) , artificial intelligence , dimension (graph theory) , mathematical optimization , function (biology) , q learning , process (computing) , mathematics , statistics , physics , quantum mechanics , neuroscience , evolutionary biology , economics , pure mathematics , perception , biology , economic growth , operating system
SUMMARY The curse of dimensionality is a ubiquitous problem for multiagent reinforcement learning, which means the learning and storing space grows exponentially with the number of agents and hinders the application of multiagent reinforcement learning. To relieve this problem, we propose a new framework named as optimal tracking agent (OTA). The OTA views the other agents as part of the environment and uses a reduced form to learn the optimal decision. Although merging other agents into the environment may reduce the dimension of action space, the environment characterized by such form is dynamic and does not satisfy the convergence of reinforcement learning (RL). Thus, we develop an estimator to track the dynamics of the environment. The estimator obtains the dynamic model, and then the model‐based RL can be used to react to the dynamic environment optimally. Because the Q‐function in OTA is also a dynamic process because of other agents’ dynamics, different from traditional RL, in which the learning is a stationary process and the usual action selection mechanisms just suit to such stationary process, we improve the greedy action selection mechanism to adapt to such dynamics. Thus, the OTA will have convergence. An experiment illustrates the validity and efficiency of the OTA.Copyright © 2012 John Wiley & Sons, Ltd.