Evaluating Common De-Identification Heuristics for Personal Health Information | Zendy

Khaled El Emam | Zendy; Sam Jabbouri | Zendy; Scott Sams | Zendy; Youenn Drouet | Zendy; Michael Power | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Evaluating Common De-Identification Heuristics for Personal Health Information

Author(s) -

Khaled El Emam,

Sam Jabbouri,

Scott Sams,

Youenn Drouet,

Michael Power

Publication year - 2006

Publication title -

journal of medical internet research

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 1.446

H-Index - 142

eISSN - 1439-4456

pISSN - 1438-8871

DOI - 10.2196/jmir.8.4.e28

Subject(s) - observational study , heuristics , health insurance portability and accountability act , identification (biology) , protected health information , agency (philosophy) , personally identifiable information , confidentiality , internet privacy , accountability , data science , computer science , medicine , nursing , computer security , hrhis , public health , health policy , political science , sociology , botany , pathology , biology , operating system , social science , law

Background With the growing adoption of electronic medical records, there are increasing demands for the use of this electronic clinical data in observational research. A frequent ethics board requirement for such secondary use of personal health information in observational research is that the data be de-identified. De-identification heuristics are provided in the Health Insurance Portability and Accountability Act Privacy Rule, funding agency and professional association privacy guidelines, and common practice. Objective The aim of the study was to evaluate whether the re-identification risks due to record linkage are sufficiently low when following common de-identification heuristics and whether the risk is stable across sample sizes and data sets. Methods Two methods were followed to construct identification data sets. Re-identification attacks were simulated on these. For each data set we varied the sample size down to 30 individuals, and for each sample size evaluated the risk of re-identification for all combinations of quasi-identifiers. The combinations of quasi-identifiers that were low risk more than 50% of the time were considered stable. Results The identification data sets we were able to construct were the list of all physicians and the list of all lawyers registered in Ontario, using 1% sampling fractions. The quasi-identifiers of region, gender, and year of birth were found to be low risk more than 50% of the time across both data sets. The combination of gender and region was also found to be low risk more than 50% of the time. We were not able to create an identification data set for the whole population. Conclusions Existing Canadian federal and provincial privacy laws help explain why it is difficult to create an identification data set for the whole population. That such examples of high re-identification risk exist for mainstream professions makes a strong case for not disclosing the high-risk variables and their combinations identified here. For professional subpopulations with published membership lists, many variables often needed by researchers would have to be excluded or generalized to ensure consistently low re-identification risk. Data custodians and researchers need to consider other statistical disclosure techniques for protecting privacy.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research