Premium
Dementias Platform UK (DPUK): Facilitating cross‐cohort analysis in a digital age
Author(s) -
Bauermeister Sarah D
Publication year - 2020
Publication title -
alzheimer's and dementia
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 6.713
H-Index - 118
eISSN - 1552-5279
pISSN - 1552-5260
DOI - 10.1002/alz.041085
Subject(s) - computer science , ontology , cohort , linked data , alphanumeric , information retrieval , data access , data mining , database , medicine , semantic web , epistemology , programming language , philosophy
Background Dementias Platform UK (DPUK) is a £53M public‐private partnership established by the MRC to provide access to large‐scale cohort data and accelerate the research and discovery of new treatments for dementia. The DPUK Data Portal facilitates multi‐modal remote data access to 3.4M individuals from 42 cohorts using a secure, robust, persistent, and fully auditable data repository. To facilitate the analysis of multiple independent datasets, DPUK curates cohort data to the C‐Surv data model. This model standardises data structure, variable naming and value labelling conventions to reduce data pre‐processing times and reduce unwanted variation. C‐Surv also enables the harmonisation of datasets for analysis feasibility testing. We present the C‐Surv ontology and demonstrate its impact on data discovery and knowledge management for multi‐cohort, multi‐modal analysis in dementia research. Method Ontology : The C‐Surv ontology has 22 level‐1 and 132 level‐2 categories, written to be machine‐readable for mapping to other ontologies and is optimised for cohort (survey) data. C‐Surv accommodates heterogeneous data‐types across cohorts. Higher order data (imaging, genetics, devices) are pre‐processed prior to curation. Researchers may request access to native or curated data. Results Standardisation : C‐Surv standardisation process: standardised naming structure of variables – maximum coding structure of 12 uppercase alphanumeric characters selected, with 5 additional characters for the imaging, linkage, genomic categories and accelerometer data. An intuitive coding structure is adopted whereby syllable‐based acronyms, word fragments, abbreviations and minimal numeric characters are utilised. Harmonisation : C‐Surv provides a harmonised dataset of 30 common data elements (variables) which are widely used in neurodegenerative and bio‐epidemiological research, and can be visualised in the DPUK Cohort Explorer. The C‐Surv standard data structure allows efficient data discovery and data selection. Building on the DPUK Cohort Matrix and Cohort Directory, the C‐Surv standard data structure provides a framework for the development of ergonomic data discovery and data selection tools. Conclusion The C‐Surv ontology is a data model optimised for analysis of epidemiologic survey data. It enables analysts to conduct multi‐cohort multi‐modal analyses rapidly; offering reduced administrative load to researchers and data managers. It also reduces artifactual variation in data due to non‐standard data pre‐processing.