z-logo
open-access-imgOpen Access
Establishment of Parallel Text Corpus of Equipment Manufacturing Industry Based on Data Mining Technology
Author(s) -
Dongxia Liu,
Jianguo Liu,
Xianghui Zhang,
Manqian Chen
Publication year - 2021
Publication title -
journal of physics. conference series
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.21
H-Index - 85
eISSN - 1742-6596
pISSN - 1742-6588
DOI - 10.1088/1742-6596/1881/4/042091
Subject(s) - computer science , sorting , the internet , transformation (genetics) , manufacturing , base (topology) , scale (ratio) , natural language processing , artificial intelligence , information retrieval , world wide web , programming language , mathematical analysis , biochemistry , chemistry , physics , mathematics , quantum mechanics , political science , law , gene
In the era of language big data, traditional data analysis methods can’t analyze semi-structured or unstructured data such as text, but all the contents in the equipment manufacturing corpus belong to text data. The equipment manufacturing corpus is a linguistic information base for legal activities and equipment manufacturing research, which aims to study equipment manufacturing and collect equipment manufacturing cases. At present, the construction of legal database in China is not perfect, and there are still many problems. In this paper, a method based on template transformation is proposed to automatically acquire parallel corpus on the Internet, and a method based on the number of transformation patterns and the retrieval and sorting of transformation patterns is adopted to verify bilingual parallel texts. This system can build a large-scale parallel corpus of equipment manufacturing industry by automatically acquiring a large number of parallel texts from the Internet.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here