z-logo
open-access-imgOpen Access
Unsupervised acquisition of idiomatic units of symbolic natural language: An n-gram frequency-based approach for the chunking of news articles and tweets
Author(s) -
Dario Borrelli,
Gabriela Gongora-Svartzman,
Carlo Lipizzi
Publication year - 2020
Publication title -
plos one
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.99
H-Index - 332
ISSN - 1932-6203
DOI - 10.1371/journal.pone.0234214
Subject(s) - chunking (psychology) , n gram , natural language processing , computer science , natural language , artificial intelligence , gram , speech recognition , language model , biology , bacteria , genetics
Symbolic sequential data are produced in huge quantities in numerous contexts, such as text and speech data, biometrics, genomics, financial market indexes, music sheets, and online social media posts. In this paper, an unsupervised approach for the chunking of idiomatic units of sequential text data is presented. Text chunking refers to the task of splitting a string of textual information into non-overlapping groups of related units. This is a fundamental problem in numerous fields where understanding the relation between raw units of symbolic sequential data is relevant. Existing methods are based primarily on supervised and semi-supervised learning approaches; however, in this study, a novel unsupervised approach is proposed based on the existing concept of n-grams, which requires no labeled text as an input. The proposed methodology is applied to two natural language corpora: a Wall Street Journal corpus and a Twitter corpus. In both cases, the corpus length was increased gradually to measure the accuracy with a different number of unitary elements as inputs. Both corpora reveal improvements in accuracy proportional with increases in the number of tokens. For the Twitter corpus, the increase in accuracy follows a linear trend. The results show that the proposed methodology can achieve a higher accuracy with incremental usage. A future study will aim at designing an iterative system for the proposed methodology.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here