z-logo
open-access-imgOpen Access
Chemical entity recognition in patents by combining dictionary-based and statistical approaches
Author(s) -
Saber A. Akhondi,
Ewoud Pons,
Zubair Afzal,
Herman van Haagen,
Benedikt Becker,
Kristina Hettne,
Erik M. van Mulligen,
Jan A. Kors
Publication year - 2016
Publication title -
database
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 1.406
H-Index - 62
ISSN - 1758-0463
DOI - 10.1093/database/baw061
Subject(s) - computer science , task (project management) , search engine indexing , conditional random field , classifier (uml) , artificial intelligence , word (group theory) , natural language processing , training set , named entity recognition , set (abstract data type) , test set , information retrieval , data mining , machine learning , linguistics , philosophy , management , programming language , economics
We describe the development of a chemical entity recognition system and its application in the CHEMDNER-patent track of BioCreative 2015. This community challenge includes a Chemical Entity Mention in Patents (CEMP) recognition task and a Chemical Passage Detection (CPD) classification task. We addressed both tasks by an ensemble system that combines a dictionary-based approach with a statistical one. For this purpose the performance of several lexical resources was assessed using Peregrine, our open-source indexing engine. We combined our dictionary-based results on the patent corpus with the results of tmChem, a chemical recognizer using a conditional random field classifier. To improve the performance of tmChem, we utilized three additional features, viz. part-of-speech tags, lemmas and word-vector clusters. When evaluated on the training data, our final system obtained an F-score of 85.21% for the CEMP task, and an accuracy of 91.53% for the CPD task. On the test set, the best system ranked sixth among 21 teams for CEMP with an F-score of 86.82%, and second among nine teams for CPD with an accuracy of 94.23%. The differences in performance between the best ensemble system and the statistical system separately were small.Database URL: http://biosemantics.org/chemdner-patents.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom