
Proximal Policy Optimization with Future rewards
Author(s) -
Chengcheng Yu,
Lijun Zhang,
Dawei Yin,
Dezhong Peng,
Haixiao Huang
Publication year - 2021
Publication title -
journal of physics. conference series
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.21
H-Index - 85
eISSN - 1742-6596
pISSN - 1742-6588
DOI - 10.1088/1742-6596/2010/1/012085
Subject(s) - overfitting , reinforcement learning , computer science , stability (learning theory) , asynchronous communication , mathematical optimization , position (finance) , disadvantage , artificial intelligence , trust region , machine learning , algorithm , artificial neural network , mathematics , computer network , radius , computer security , finance , economics
Among the current reinforcement learning algorithms, the Policy Gradient algorithm (PG)[7] is one of the traditional and most widely used algorithms, but it has the disadvantage of unstable gradient estimation, and the newly Proximal Policy Optimization algorithm (PPO) [8]solves the problem. It solves the stability problem, but the update policy is slow, and it is easy to produce over-fitting when the training times are too many. In this article, a new method is proposed, referring to Asynchronous Advantage Actor-Critic (A3C)[9], the basic PPO algorithm is trained in parallel, and a method that considers future rewards is introduced, and the future rewards are also calculated to the current. In the reward, through the OPEN GYM platform and the experimental results of capturing at any position of the robotic arm, our actions can ensure a faster training speed, while also avoiding overfitting during long-term training.