Premium
Building reliable and efficient data transfer and processing pipelines
Author(s) -
Kosar T.,
Kola G.,
Livny M.
Publication year - 2006
Publication title -
concurrency and computation: practice and experience
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.309
H-Index - 67
eISSN - 1532-0634
pISSN - 1532-0626
DOI - 10.1002/cpe.969
Subject(s) - terabyte , computer science , pipeline (software) , process (computing) , pipeline transport , replicate , data transmission , data flow diagram , data processing , software , interface (matter) , distributed computing , real time computing , database , computer hardware , operating system , engineering , statistics , mathematics , bubble , maximum bubble pressure method , environmental engineering
Scientific distributed applications have an increasing need to process and move large amounts of data across wide area networks. Existing systems either closely couple computation and data movement, or they require substantial human involvement during the end‐to‐end process. We propose a framework that enables scientists to build reliable and efficient data transfer and processing pipelines. Our framework provides a universal interface to different data transfer protocols and storage systems. It has sophisticated flow control and recovers automatically from network, storage system, software and hardware failures. We successfully used data pipelines to replicate and process three terabytes of DPOSS astronomy image dataset and several terabytes of WCER educational video dataset. In both cases, the entire process was performed without any human intervention and the data pipeline recovered automatically from various failures. Copyright © 2005 John Wiley & Sons, Ltd.