z-logo
Premium
The reliability of a deep learning model in external memory clinic MRI data: A multi‐cohort study
Author(s) -
Mårtensson Gustav,
Ferreira Daniel,
Granberg Tobias,
Cavallin Lena,
Oppedal Ketil,
Padovani Alessandro,
Rektorova Irena,
Bonanni Laura,
Pardini Matteo,
Kramberger Milica G.,
Taylor JohnPaul,
Hort Jakub,
Snædal Jón,
Kulisevsky Jaime,
Blanc Frédéric,
Antonini Angelo,
Mecocci Patrizia,
Vellas Bruno,
Tsolaki Magda,
Kloszewska Iwona,
Soininen Hilkka,
Lovestone Simon,
Simmons Andrew,
Aarsland Dag,
Westman Eric
Publication year - 2020
Publication title -
alzheimer's and dementia
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 6.713
H-Index - 118
eISSN - 1552-5279
pISSN - 1552-5260
DOI - 10.1002/alz.042969
Subject(s) - reliability (semiconductor) , neuroimaging , cohort , kappa , medicine , artificial intelligence , test (biology) , computer science , psychology , medical physics , pathology , psychiatry , mathematics , paleontology , power (physics) , physics , geometry , quantum mechanics , biology
Background Deep learning (DL) has provided impressive results in numerous domains in recent years, including medical image analysis. Training DL models requires large data sets to yield good performance. Since medical data can be difficult to acquire, most studies rely on public research cohorts, which often have harmonized scanning protocols and strict exclusion criteria. This is not representative of a clinical setting. In this study, we investigated the performance of a DL model in out‐of‐distribution data from multiple memory clinics and research cohorts. Method We trained multiple versions of AVRA: a DL model trained to predict visual ratings of Scheltens' medial temporal atrophy (MTA) scale (Mårtensson et al., 2019). This was done on different combinations of training data—starting with only harmonized MRI data from public research cohorts, and further increasing image heterogeneity in the training set by including external memory clinic data. We assessed the performance in multiple test sets by comparing AVRA’s MTA ratings to an experienced radiologist’s (who rated all images in this study). Data came from Alzheimer’s Disease Neuroimaging Initiative (ADNI), AddNeuroMed, and images from 13 European memory clinics in the E‐DLB consortium. Results Models trained only on research cohorts generalized well to new data acquired with similar protocols as the training data (weighted kappa κ w between 0.70‐0.72), but worse to memory clinic data with more image variability (κ w between 0.34‐0.66). This was most prominent in one specific memory clinic, where the DL model systematically predicted too low MTA scores. When including data from a wider range of scanners and protocols during training, the agreement to the radiologist’s ratings in external memory clinics increased (κ w between 0.51‐0.71). Conclusion In this study we showed that increasing heterogeneity in training data improves generalization to out‐of‐distribution data. Our findings suggest that studies assessing reliability of a DL model should be done in multiple cohorts, and that softwares based on DL need to be rigorously evaluated prior to being certified for deployment to clinics. References: Mårtensson, G. et al. (2019) ‘AVRA: Automatic Visual Ratings of Atrophy from MRI images using Recurrent Convolutional Neural Networks’, NeuroImage: Clinical. Elsevier, 23(March), p. 101872.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here