
Leveraging of Weighted Ensemble Technique for Identifying Medical Concepts from Clinical Texts at Word and Phrase Level
Author(s) -
Dipankar Das,
Krishna Sharma
Publication year - 2021
Language(s) - English
Resource type - Conference proceedings
DOI - 10.5121/csit.2021.111213
Subject(s) - artificial intelligence , computer science , support vector machine , phrase , machine learning , random forest , naive bayes classifier , word2vec , classifier (uml) , natural language processing , ensemble learning , feature extraction , pattern recognition (psychology) , embedding
Concept identification from medical texts becomes important due to digitization. However, it is not always feasible to identify all such medical concepts manually. Thus, in the present attempt, we have applied five machine learning classifiers (Support Vector Machine, K-Nearest Neighbours, Logistic Regression, Random Forest and Naïve Bayes) and one deep learning classifier (Long Short Term Memory) to identify medical concepts by training a total of 27.383K sentences. In addition, we have also developed a rule based phrase identification module to help the existing classifiers for identifying multi- word medical concepts. We have employed word2vec technique for feature extraction and PCA and T- SNE for conducting ablation study over various features to select important ones. Finally, we have adopted two different ensemble approaches, stacking and weighted sum to improve the performance of the individual classifier and significant improvements were observed with respect to each of the classifiers. It has been observed that phrase identification module plays an important role when dealing with individual classifier in identifying higher order ngram medical concepts. Finally, the ensemble approach enhances the results over SVM that was showing initial improvement even after the application of phrase based module.