Premium
A data dependency based strategy for intermediate data storage in scientific cloud workflow systems
Author(s) -
Yuan Dong,
Yang Yun,
Liu Xiao,
Zhang Gaofeng,
Chen Jinjun
Publication year - 2010
Publication title -
concurrency and computation: practice and experience
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.309
H-Index - 67
eISSN - 1532-0634
pISSN - 1532-0626
DOI - 10.1002/cpe.1636
Subject(s) - workflow , cloud computing , computer science , cloud storage , reuse , distributed computing , cloud service provider , database , operating system , engineering , cloud computing security , waste management
SUMMARY Many scientific workflows are data intensive where large volumes of intermediate data are generated during their execution. Some valuable intermediate data need to be stored for sharing or reuse. Traditionally, they are selectively stored according to the system storage capacity, determined manually. As doing science in the cloud has become popular nowadays, more intermediate data can be stored in scientific cloud workflows based on a pay‐for‐use model. In this paper, we build an intermediate data dependency graph (IDG) from the data provenance in scientific workflows. With the IDG, deleted intermediate data can be regenerated, and as such we develop a novel intermediate data storage strategy that can reduce the cost of scientific cloud workflow systems by automatically storing appropriate intermediate data sets with one cloud service provider. The strategy has significant research merits, i.e. it achieves a cost‐effective trade‐off of computation cost and storage cost and is not strongly impacted by the forecasting inaccuracy of data sets' usages. Meanwhile, the strategy also takes the users' tolerance of data accessing delay into consideration. We utilize Amazon's cost model and apply the strategy to general random as well as specific astrophysics pulsar searching scientific workflows for evaluation. The results show that our strategy can reduce the overall cost of scientific cloud workflow execution significantly. Copyright © 2010 John Wiley & Sons, Ltd.