
Using Logistic Regression to Estimate the False Positive Rate in The IDI (SoLinks)
Author(s) -
Anna L. Lin,
Soon H Song,
Nancy Wang
Publication year - 2020
Publication title -
international journal of population data science
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.602
H-Index - 7
ISSN - 2399-4908
DOI - 10.23889/ijpds.v5i5.1484
Subject(s) - logistic regression , computer science , statistics , quality (philosophy) , sample (material) , econometrics , data mining , machine learning , mathematics , philosophy , chemistry , epistemology , chromatography
Stats NZ’s Integrated Data Infrastructure (IDI) is a linked longitudinal database combining administrative and survey data. Previously, false positive linkages (FP) in the IDI were assessed by clerical review of a sample of linked records, which was time consuming and subject to inconsistency.
Objectives and ApproachA modelled approach, ‘SoLinks’ has been developed in order to automate the FP estimation process for the IDI. It uses a logistic regression model to calculate the probability that a given link is a true match. The model is based on the agreement types defined for four key linking variables – first name, last name, sex, and date of birth. Exemptions have been given to some specific types of links that we believe to be high quality true matches. The training data used to estimate the model parameters was based on the outcomes of the clerical review process over several years.
ResultsWe have compared the FP rates estimated through clerical review to the ones estimated through the SoLinks model. Some SoLinks estimates fall outside the 95% confidence intervals of the clerically reviewed ones. This may be the result of the pre-defined probabilities for the specific types of links are too high.
ConclusionThe automation of FP checking has saved analyst time and resource. The modelled FP estimates have been more stable across time than the previous clerical reviews. As this model estimates the probability of a true match at the individual link level, we may provide this probability to researchers so that they can calculate linked quality indicators for their research populations.