Reducing Entropy Overestimation in Soft Actor Critic Using Dual Policy Network
Author(s) -
Hamid Ali,
Hammad Majeed,
Imran Usman,
Khaled A. Almejalli
Publication year - 2021
Publication title -
wireless communications and mobile computing
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.42
H-Index - 64
eISSN - 1530-8677
pISSN - 1530-8669
DOI - 10.1155/2021/9920591
Subject(s) - computer science , randomness , reinforcement learning , entropy (arrow of time) , mathematical optimization , artificial intelligence , principle of maximum entropy , machine learning , mathematics , statistics , physics , quantum mechanics
In reinforcement learning (RL), an agent learns an environment through hit and trail. This behavior allows the agent to learn in complex and difficult environments. In RL, the agent normally learns the given environment by exploring or exploiting. Most of the algorithms suffer from under exploration in the latter stage of the episodes. Recently, an off-policy algorithm called soft actor critic (SAC) is proposed that overcomes this problem by maximizing entropy as it learns the environment. In it, the agent tries to maximize entropy along with the expected discounted rewards. In SAC, the agent tries to be as random as possible while moving towards the maximum reward. This randomness allows the agent to explore the environment and stops it from getting stuck into local optima. We believe that maximizing the entropy causes the overestimation of entropy term which results in slow policy learning. This is because of the drastic change in action distribution whenever agent revisits the similar states. To overcome this problem, we propose a dual policy optimization framework, in which two independent policies are trained. Both the policies try to maximize entropy by choosing actions against the minimum entropy to reduce the overestimation. The use of two policies result in better and faster convergence. We demonstrate our approach on different well known continuous control simulated environments. Results show that our proposed technique achieves better results against state of the art SAC algorithm and learns better policies.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom