
A Fast and Simple Online Synchronous Context Free Grammar Extractor
Author(s) -
Paul Baltescu,
Phil Blunsom
Publication year - 2014
Publication title -
the prague bulletin of mathematical linguistics
Language(s) - English
Resource type - Journals
eISSN - 1804-0462
pISSN - 0032-6585
DOI - 10.2478/pralin-2014-0010
Subject(s) - computer science , synchronous context free grammar , natural language processing , machine translation , grammar , artificial intelligence , phrase , transfer based machine translation , context free grammar , rule based machine translation , example based machine translation , programming language , linguistics , philosophy
Hierarchical phrase-based machine translation systems rely on the synchronous context free grammar formalism to learn and use translation rules containing gaps. The grammars learned by such systems become unmanageably large even for medium sized parallel corpora. The traditional approach of preprocessing the training data and loading all possible translation rules into memory does not scale well for hierarchical phrase-based systems. Online grammar extractors address this problem by constructing memory efficient data structures on top of the source sideof the parallel data (often based on suffix arrays), which are usedto efficiently match phrases in the corpus and to extract translation rules on the fly during decoding. This paper describes an open source implementation of an online synchronous context free grammar extractor. Our approach builds on the work of Lopez (2008a) and introduces a new technique for extending the lists of phrase matches for phrases containing gaps that reduces the extraction time by a factor of 4. Our extractor is available as part of the cdec toolkit1 (Dyer et al., 2010)