z-logo
open-access-imgOpen Access
GeoBERTSegmenter: Word Segmentation of Chinese Texts in the Geoscience Domain Using the Improved BERT Model
Author(s) -
Wei Dongqi,
Liu Zhihao,
Xu Dexin,
Ma Kai,
Tao Liufeng,
Xie Zhong,
Qiu Qinjun,
Pan Shengyong
Publication year - 2022
Publication title -
earth and space science
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.843
H-Index - 23
ISSN - 2333-5084
DOI - 10.1029/2022ea002511
Subject(s) - computer science , natural language processing , segmentation , artificial intelligence , word (group theory) , deep learning , transformer , text segmentation , encoder , domain (mathematical analysis) , conditional random field , linguistics , mathematical analysis , philosophy , physics , mathematics , quantum mechanics , voltage , operating system
Unlike English, there is no natural separator‐like gap between words in Chinese, which makes Chinese word segmentation (CWS) a difficult information processing problem. At present, geological texts contain a large number of unregistered geological terms, and the existing rule‐based methods and machine‐learning and deep learning algorithms still cannot be used to solve the problem of word segmentation in geosciences, especially for the large number of unregistered words. In this study, we propose GeoBERTSegmenter, which is a GeoBERT‐based (Geoscience‐Bidirectional Encoder Representation from Transformers) CWS model that is specifically designed with various linguistic irregularities in mind. In this method, a general model is extended to a BERT bidirectional recurrent neural network (BiLSTM) and conditional random field (GeoBERT + BiLSTM + CRF) model with a number of features designed to address the CWS task in geological text. We also train a pretrained language model named GeoBERT on a geological domain that is based on a large amount of Chinese geological text. In open testing, a precision of 94.77%, recall of 96.31% and F1 of 95.44%, are obtained, indicating that the proposed strategy performs much better than alternative methods in our study. In this study, unregistered geological terms can be effectively identified, and the recognition rate of common words is ensured, which lays the foundation for natural language processing in the domain of geoscience through Chinese text word segmentation.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here