Premium
Dataflow management, dynamic load balancing, and concurrent processing for real‐time embedded vision applications using Quasar
Author(s) -
Goossens Bart
Publication year - 2018
Publication title -
international journal of circuit theory and applications
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.364
H-Index - 52
eISSN - 1097-007X
pISSN - 0098-9886
DOI - 10.1002/cta.2494
Subject(s) - computer science , parallel computing , compiler , concurrency , cuda , dataflow , kernel (algebra) , load balancing (electrical power) , runtime system , shared memory , programming paradigm , distributed computing , programming language , geometry , mathematics , combinatorics , grid
Summary Programming modern embedded vision systems brings various challenges, due to the steep learning curve for programmers and the different characteristics of the devices. Quasar, a new high‐level programming language and development environment, considerably simplifies the development. Quasar has a compiler that detects and optimizes parallel programming patterns and a heterogeneous runtime that distributes the computational load over the available compute devices (CPUs and Graphical Processing Unit [GPUs]). In this paper, we focus on runtime aspects of Quasar. We show that with good approximation, the execution time of a GPU kernel function can be factorized in a compile‐time‐specific component and a runtime‐specific component. We show that this approximation leads to a computationally simple runtime load balancing rule. Moreover, the load balancing rule permits efficient implicit concurrency of kernel functions and automatic scaling to multiple compute devices (eg, multi‐CPU/GPU systems). Based on an appropriate mathematical scheduling model, we investigate the command queue size trade‐off between memory usage and device utilization. The result is a programming environment for embedded vision systems for which automatic parallelization and implicit concurrency detection allow scaling the program efficiently to multi‐CPU/GPU systems. Finally, benchmark results are provided to demonstrate the performance of our approach compared with OpenACC and CUDA (Compute Unified Device Architecture).