
Better models by discarding data?
Author(s) -
Diederichs K.,
Karplus P. A.
Publication year - 2013
Publication title -
acta crystallographica section d
Language(s) - English
Resource type - Journals
ISSN - 1399-0047
DOI - 10.1107/s0907444913001121
Subject(s) - data quality , computer science , data set , consistency (knowledge bases) , data mining , set (abstract data type) , quality (philosophy) , experimental data , algorithm , statistics , mathematics , artificial intelligence , metric (unit) , operations management , programming language , philosophy , epistemology , economics
In macromolecular X‐ray crystallography, typical data sets have substantial multiplicity. This can be used to calculate the consistency of repeated measurements and thereby assess data quality. Recently, the properties of a correlation coefficient, CC 1/2 , that can be used for this purpose were characterized and it was shown that CC 1/2 has superior properties compared with `merging' R values. A derived quantity, CC*, links data and model quality. Using experimental data sets, the behaviour of CC 1/2 and the more conventional indicators were compared in two situations of practical importance: merging data sets from different crystals and selectively rejecting weak observations or (merged) unique reflections from a data set. In these situations controlled `paired‐refinement' tests show that even though discarding the weaker data leads to improvements in the merging R values, the refined models based on these data are of lower quality. These results show the folly of such data‐filtering practices aimed at improving the merging R values. Interestingly, in all of these tests CC 1/2 is the one data‐quality indicator for which the behaviour accurately reflects which of the alternative data‐handling strategies results in the best‐quality refined model. Its properties in the presence of systematic error are documented and discussed.