Adapting Behavior via Intrinsic Reward: A Survey and Empirical Study
Author(s) -
Cam Linke,
Nadia M. Ady,
Martha White,
Thomas Degris,
Adam White
Publication year - 2020
Publication title -
journal of artificial intelligence research
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.79
H-Index - 123
eISSN - 1943-5037
pISSN - 1076-9757
DOI - 10.1613/jair.1.12087
Subject(s) - reinforcement learning , computer science , active learning (machine learning) , artificial intelligence , introspection , proactive learning , representation (politics) , testbed , machine learning , cognitive psychology , robot learning , psychology , law , mobile robot , computer network , politics , political science , robot
Learning about many things can provide numerous benefits to a reinforcement learning system. For example, learning many auxiliary value functions, in addition to optimizing the environmental reward, appears to improve both exploration and representation learning. The question we tackle in this paper is how to sculpt the stream of experience—how to adapt the learning system’s behavior—to optimize the learning of a collection of value functions. A simple answer is to compute an intrinsic reward based on the statistics of each auxiliary learner, and use reinforcement learning to maximize that intrinsic reward. Unfortunately, implementing this simple idea has proven difficult, and thus has been the focus of decades of study. It remains unclear which of the many possible measures of learning would work well in a parallel learning setting where environmental reward is extremely sparse or absent. In this paper, we investigate and compare different intrinsic reward mechanisms in a new bandit-like parallel-learning testbed. We discuss the interaction between reward and prediction learners and highlight the importance of introspective prediction learners: those that increase their rate of learning when progress is possible, and decrease when it is not. We provide a comprehensive empirical comparison of 14 different rewards, including well-known ideas from reinforcement learning and active learning. Our results highlight a simple but seemingly powerful principle: intrinsic rewards based on the amount of learning can generate useful behavior, if each individual learner is introspective. 1. Balancing the Needs of Many Learners Learning about many things can provide numerous benefits to a reinforcement learning system. Adding many auxiliary losses to a deep learning system can act as a regularizer on the representation, ultimately resulting in better final performance in reward maximization problems, as demonstrated with Unreal (Jaderberg et al., 2016). A collection of value c ©2020 AI Access Foundation. All rights reserved. Linke, Ady, White, Degris & White functions encoding goal-directed behaviors can be combined to generate new policies that generalize to goals unseen during training (Schaul et al., 2015). Learning in hierarchical robot-control problems can be improved with persistent exploration, given call-return execution of a collection of subgoal policies or skills (Riedmiller et al., 2018), even if those policies are imperfectly learned. In all these examples, a collection of general value functions (see Sutton et al., 2011) is updated from a single stream of experience. The question we tackle in this paper is how to sculpt that stream of experience—how to adapt the learning system’s behavior—to optimize the learning of a collection of value functions. One answer is to simply maximize the environmental (extrinsic) reward. This was the approach explored in Unreal and it resulted in significant performance improvements in challenging visual navigation problems. However, it is not hard to imagine situations where this approach would be limited. In general, the reward may be delayed and sparse: what should the agent do in the absence of external (environmental) motivations? Learning reusable knowledge such as skills (Sutton et al., 1999) or a model of the world might result in more long-term reward. Such auxiliary learning objectives could emerge automatically during learning (Silver et al., 2017). Most agent architectures, however, include explicit skill and model learning components. It seems natural that progress towards these auxiliary learning objectives could positively influence the agent’s behavior, resulting in improved learning overall. Learning many value functions off-policy from a shared stream of experience—with function approximation and an unknown environment—provides a natural setting to investigate no-reward intrinsically motivated learning. The basic idea is simple. The aim is to accurately estimate many value functions, each with an independent learner—where there is no external reward signal. Directly optimizing the data collection for all learners jointly is difficult because we cannot directly measure this total learning objective and because actions have an indirect impact on learning efficiency. There is a large related literature in active learning (Cohn et al., 1996; Balcan et al., 2009; Settles, 2009; Golovin and Krause, 2011; Konyushkova et al., 2017) and active perception (Bajcsy et al., 2018), from which to draw inspiration for a solution but which do not directly apply to this problem. In active learning the agent must sub-select from a larger set of items to choose which points to label. Active perception is a subfield of vision and robotics. Much of the work in active perception has focused on specific settings—namely visual attention (Bylinskii et al., 2015), localization in robotics (Patten et al., 2018) and sensor selection (Satsangi et al., 2018, 2020)—or assumes knowledge of the dynamics of the world (see Bajcsy et al., 2018). An alternative strategy is to formulate our task as a reinforcement learning problem. We can use an intrinsic reward, internal to the learning system, that approximates the total learning across all learners. The behavior can be adapted to choose actions in each state that maximize the intrinsic reward, towards the goal of maximizing the total learning of the system. The choice of intrinsic rewards can have a significant impact on the sample efficiency of such intrinsically motivated learning systems. This paper provides the first formulation of parallel value function learning as a reinforcement learning task. Fortunately, there are many ideas from related areas that can inform our choice of intrinsic rewards. Rewards computed from internal statistics about the learning process have been explored in many contexts over the years. Intrinsic rewards have been shown to induce behavior that resembles the development stages exhibited by young humans and animals (Barto, 2013;
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom