Clinical information extraction using small data: An active learning approach based on sequence representations and word embeddings | Zendy

Kholghi Mahnoosh | Zendy; De Vine Lance | Zendy; Sitbon Laurianne | Zendy; Zuccon Guido | Zendy; Nguyen Anthony | Zendy

Premium

Clinical information extraction using small data: An active learning approach based on sequence representations and word embeddings

Author(s) -

Kholghi Mahnoosh,

De Vine Lance,

Sitbon Laurianne,

Zuccon Guido,

Nguyen Anthony

Publication year - 2017

Publication title -

journal of the association for information science and technology

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.903

H-Index - 145

eISSN - 2330-1643

pISSN - 2330-1635

DOI - 10.1002/asi.23936

Subject(s) - computer science , annotation , artificial intelligence , selection (genetic algorithm) , sample (material) , pipeline (software) , machine learning , information extraction , word (group theory) , natural language processing , sequence (biology) , process (computing) , active learning (machine learning) , task (project management) , information retrieval , mathematics , chemistry , geometry , management , chromatography , biology , economics , genetics , programming language , operating system

This article demonstrates the benefits of using sequence representations based on word embeddings to inform the seed selection and sample selection processes in an active learning pipeline for clinical information extraction. Seed selection refers to choosing an initial sample set to label to form an initial learning model. Sample selection refers to selecting informative samples to update the model at each iteration of the active learning process. Compared to supervised machine learning approaches, active learning offers the opportunity to build statistical classifiers with a reduced amount of training samples that require manual annotation. Reducing the manual annotation effort can support automating the clinical information extraction process. This is particularly beneficial in the clinical domain, where manual annotation is a time‐consuming and costly task, as it requires extensive labor from clinical experts. Our empirical findings demonstrate that (a) using sequence representations along with the length of sequence for seed selection shows potential towards more effective initial models, and (b) using sequence representations for sample selection leads to significantly lower manual annotation efforts, with up to 3% and 6% fewer tokens and concepts requiring annotation, respectively, compared to state‐of‐the‐art query strategies.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here

Accelerating Research