z-logo
Premium
Scalable reinforcement learning on Cray XC
Author(s) -
Kommaraju Ananda V.,
Maschhoff Kristyn J.,
Ringenburg Michael F.,
Robbins Benjamin
Publication year - 2020
Publication title -
concurrency and computation: practice and experience
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.309
H-Index - 67
eISSN - 1532-0634
pISSN - 1532-0626
DOI - 10.1002/cpe.5636
Subject(s) - reinforcement learning , computer science , scalability , workflow , distributed computing , learning network , deep learning , artificial intelligence , node (physics) , range (aeronautics) , parallel computing , computer architecture , operating system , database , materials science , composite material , structural engineering , engineering
Summary Recent advancements in deep learning have made reinforcement learning (RL) applicable to a much broader range of decision making problems. However, the emergence of reinforcement learning workloads brings multiple challenges to system resource management. RL applications continuously train a deep learning or a machine learning model while interacting with uncertain simulation models. This new generation of AI applications impose significant demands on system resources such as memory, storage, network, and compute. In this paper, we describe a typical RL application workflow and introduce the Ray distributed execution framework developed at the UC Berkeley RISELab. Ray includes the RLlib library for executing distributed reinforcement learning applications. We describe a recipe for deploying the Ray execution framework on Cray XC systems and demonstrate scaling of RLlib algorithms across multiple nodes of the system. We also explore performance characteristics across multiple CPU and GPU node types.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here