
Big Data Matching Using the Identity Correlation Approach
Author(s) -
Mary M. Smyth,
Kevin McCormack
Publication year - 2016
Language(s) - English
Resource type - Conference proceedings
DOI - 10.4995/carma2016.2016.2991
Subject(s) - identifier , matching (statistics) , census , computer science , earnings , big data , identity (music) , correlation , government (linguistics) , identification (biology) , irish , relation (database) , data mining , data science , econometrics , statistics , mathematics , sociology , business , finance , demography , population , linguistics , physics , geometry , philosophy , botany , biology , acoustics , programming language
The Identity Correlation Approach (ICA) is a statistical technique developed for matching big data where a unique identifier does not exist. This technique was developed to match the Irish Census 2011 dataset to Central Government Administrative Datasets in order to attach a unique identifier to each individual person in the Census dataset (McCormack & Smyth, 20151). The unique identifier attached is the PPS No. (Personal Public Service No.2). By attaching the PPS No. to the Census dataset, each individual can be linked to datasets held centrally by Public Sector Organisations. This expands the range of variables for statistical analysis at individual level. Statistical techniques developed here were undertaken for a major European Structure of Earnings Survey (SES) compiled by the CSO using administrative data only, and thus eliminating the need for an expensive business survey to be conducted (NES, 20073,4,5). A description of how the Identity Correlation Approach was developed is given in this paper. Data matching results and conclusions are presented here in relation to the Structure of Earnings Survey (SES)6 results for 2011.