Premium
An investigation of the effects of hard and soft errors on graphics processing unit‐accelerated molecular dynamics simulations
Author(s) -
Betz Robin M.,
DeBardeleben Nathan A.,
Walker Ross C.
Publication year - 2014
Publication title -
concurrency and computation: practice and experience
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.309
H-Index - 67
eISSN - 1532-0634
pISSN - 1532-0626
DOI - 10.1002/cpe.3232
Subject(s) - computer science , graphics processing unit , supercomputer , graphics , gpu cluster , general purpose computing on graphics processing units , parallel computing , computational science , cuda , single precision floating point format , soft error , algorithm , computer graphics (images) , floating point , electronic engineering , engineering
SUMMARY Molecular dynamics (MD) simulations rely on the accurate evaluation and integration of Newton's equations of motion to propagate the positions of atoms in proteins during a simulation. As such, one can expect them to be sensitive to any form of numerical error that may occur during a simulation. Increasingly graphics processing units (GPUs) are being used to accelerate MD simulations. Current GPU architectures designed for high performance computing applications support error‐correcting codes (ECC) that detect and correct single bit‐flip soft error events in GPU memory; however, this error checking carries a penalty in terms of simulation speed. ECC is also a major distinguishing feature between high performance computing NVIDIA Tesla cards and the considerably more cost‐effective NVIDIA GeForce gaming cards. An argument often put forward for not using GeForce cards is that the results are unreliable because of the lack of ECC. In an initial attempt to quantify these concerns, an investigation of the reproducibility of GPU‐accelerated MD simulations using the AMBER software was conducted on the XSEDE supercomputer Keeneland, a cluster at Los Alamos National Laboratory, and a cluster at the San Diego Supercomputer Center. While the data collected are insufficient to make solid conclusions and more extensive testing is needed to provide quantitative statistics, the absence of ECC events and lack of any silent errors in all the simulations conducted to date suggest that these errors are exceedingly rare and as such the time and memory penalty of ECC may outweigh the utility of error checking functionality. However, a considerable amount of error originating from defective hardware was observed, which suggests that rigorous acceptance testing should be performed on new GPU‐based systems by repeatedly running reproducible yet realistic calculations.