Premium
Scalable network analytics for characterization of outbreak influence in voluminous epidemiology datasets
Author(s) -
Shah Naman,
Malensek Matthew,
Shah Harshil,
Pallickara Shrideep,
Pallickara Sangmi Lee
Publication year - 2018
Publication title -
concurrency and computation: practice and experience
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.309
H-Index - 67
eISSN - 1532-0634
pISSN - 1532-0626
DOI - 10.1002/cpe.4998
Subject(s) - computer science , scalability , data science , analytics , data mining , scale (ratio) , spark (programming language) , set (abstract data type) , identification (biology) , transmission (telecommunications) , field (mathematics) , geography , database , cartography , telecommunications , botany , mathematics , pure mathematics , biology , programming language
Summary Planning for large‐scale epidemiological outbreaks in livestock populations often involves executing compute‐intensive disease spread simulations. To capture the probabilities of various outcomes, these simulations are executed several times over a collection of representative input scenarios , producing voluminous data. The resulting datasets contain valuable insights, including sequences of events that lead to extreme outbreaks. However, discovering and leveraging such information is also computationally expensive. In this study, we set out to achieve two goals, ie, (1) providing a distributed framework for modeling disease transmission at scale using Spark, including improvements to the default GraphX partitioning strategy, and (2) giving planners and epidemiologists a means to analyze interactions between entities (herds) during simulated disease outbreaks. Using our disease transmission network (DTN), planners or analysts can isolate herds that have a disproportionate effect on epidemiological outcomes, enabling effective allocation of limited resources such as vaccinations and field personnel. We use a representative dataset to verify our approach and optimized the underlying graph partitioning algorithm to ensure the system will scale with increases in the dataset size or number of participating machines. Our analysis includes identification of influential herds as well as the creation of machine learning models for accurate classifications that generalize to other datasets.