Identifying bacterial biotope entities using sequence labeling: Performance and feature analysis | Zendy

Mao Jin | Zendy; Cui Hong | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Premium

Identifying bacterial biotope entities using sequence labeling: Performance and feature analysis

Author(s) -

Mao Jin,

Cui Hong

Publication year - 2018

Publication title -

journal of the association for information science and technology

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.903

H-Index - 145

eISSN - 2330-1643

pISSN - 2330-1635

DOI - 10.1002/asi.24032

Subject(s) - computer science , artificial intelligence , word embedding , natural language processing , named entity recognition , conditional random field , support vector machine , wordnet , classifier (uml) , biomedical text mining , crfs , biotope , cluster analysis , task (project management) , embedding , text mining , ecology , management , habitat , economics , biology

Habitat information is important to biodiversity conservation and research. Extracting bacterial biotope entities from scientific publications is important to large scale study of the relationships between bacteria and their living environments. To facilitate the further development of robust habitat text mining systems for biodiversity, following the BioNLP task framework, three sequence labeling techniques, CRFs (Conditional Random Fields), MEMM (Maximum Entropy Markov Model) and SVM hmm (Support Vector Machine) and one classifier, SVM multiclass , are compared on their performance in identifying three types of bacterial biotope entities: bacteria, habitats and geographical locations. The effectiveness of a variety of basic word formation features, syntactic features, and semantic features are exploited and compared for the three sequence labeling methods. Experiments on two publicly available BioNLP collections show that, in addition to a WordNet feature, word embedding featured clusters (although not trained with the task‐specific corpus) consistently improve the performance for all methods on all entity types in both collections. Other features produce various results. Our results also show that when trained on limited corpora, Brown clusters resulted in better performance than word embedding clusters did. Further analysis suggests that the entity recognition performance can be greatly boosted through improving the accuracy of entity boundary identification.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here

Accelerating Research