Premium
Letter from the Editor
Author(s) -
Ribeiro Celso C.
Publication year - 2007
Publication title -
international transactions in operational research
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 1.032
H-Index - 52
eISSN - 1475-3995
pISSN - 0969-6016
DOI - 10.1111/j.1475-3995.2007.00603.x
Subject(s) - nomination , publication , editorial board , political science , duty , mandate , impact factor , library science , milestone , public relations , operations research , law , computer science , history , engineering , archaeology
I write to comment on the article by Muralidhar and Sarathy (MS) that appeared in the September 2006 issue of JOS. MS compared a particular form of data perturbation to multiple imputation as methods for limiting the risks of identification disclosures in public use data. Their primary conclusion was that the perturbation techniques are more effective than the multiple imputation techniques because they exactly preserve means and covariances, whereas the multiple imputation approach preserves these quantities up to random noise. While I compliment MS for comparing different methods of disclosure limitation and for writing a clear and interesting paper, I believe their evaluation was too limited to adequately compare the two approaches. The comparison does not justify the recommendation that the particular method of data perturbation should be preferred to multiple imputation, as I describe below. Multiple imputation is designed to handle categorical, continuous, or mixed data. For example, research has shown inference valid simulation of categorical identifiers like race, sex, and marital status using various forms of logistic regression (Reiter 2005a). It also has been used to protect elaborate, relationally-linked data products where specification of the complete data-generating process is not feasible (Abowd and Woodcock 2002). In contrast, the perturbative techniques of MS operate only on continuous variables. They are not appropriate for datasets with many categorical and mixed variables. Imputation models can mimic the distributions of the data; they need not be confined to linear regressions or normally distributed errors. For example, recent synthesis projects for mixed and highly skewed data are based on CART models (Reiter 2005b) and density regressions (Abowd and Woodcock 2004). These projects show that it is possible (with reasonable mean squared error) to preserve univariate distributions, maintain interaction and nonlinear effects, and enable valid estimation of sub-domain relationships. In contrast, MS do not provide evidence that the perturbative methods can preserve these and other fine features of distributions. Even preserving means and covariances may not guarantee that the analyst obtains the same results from the released and original data. If the distributional features are badly distorted with perturbed data, the analyst of the perturbed data could arrive at a model that fits poorly on the original data, because the model diagnostics based on perturbed data may suggest entirely different (and inappropriate) models. This issue has received little attention in the evaluations of disclosure limitation methods. The multiple imputation approach is, or at least can be, relatively transparent to the public analyst. The agency can release meta-data describing the imputation models. When the analyst seeks to estimate relationships that are not included in the imputation models,