z-logo
open-access-imgOpen Access
Sentence‐Chain Based Seq2seq Model for Corpus Expansion
Author(s) -
Chung Euisok,
Park Jeon Gue
Publication year - 2017
Publication title -
etri journal
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.295
H-Index - 46
eISSN - 2233-7326
pISSN - 1225-6463
DOI - 10.4218/etrij.17.0116.0074
Subject(s) - perplexity , computer science , sentence , language model , natural language processing , artificial intelligence , recurrent neural network , encoder , artificial neural network , sequence (biology) , speech recognition , biology , genetics , operating system
This study focuses on a method for sequential data augmentation in order to alleviate data sparseness problems. Specifically, we present corpus expansion techniques for enhancing the coverage of a language model. Recent recurrent neural network studies show that a seq2seq model can be applied for addressing language generation issues; it has the ability to generate new sentences from given input sentences. We present a method of corpus expansion using a sentence‐chain based seq2seq model. For training the seq2seq model, sentence chains are used as triples. The first two sentences in a triple are used for the encoder of the seq2seq model, while the last sentence becomes a target sequence for the decoder. Using only internal resources, evaluation results show an improvement of approximately 7.6% relative perplexity over a baseline language model of Korean text. Additionally, from a comparison with a previous study, the sentence chain approach reduces the size of the training data by 38.4% while generating 1.4‐times the number of n‐grams with superior performance for English text.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here