
Document Parsing Tool for Language Translation and Web Crawling using Django REST Framework
Author(s) -
Kruthika Alnavar,
Ravinder Kumar,
C. Narendra Babu
Publication year - 2021
Publication title -
journal of physics. conference series
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.21
H-Index - 85
eISSN - 1742-6596
pISSN - 1742-6588
DOI - 10.1088/1742-6596/1962/1/012018
Subject(s) - computer science , parsing , html , artificial intelligence , natural language processing , markup language , newspaper , world wide web , precision and recall , crawling , web crawler , python (programming language) , information retrieval , web page , xml , programming language , advertising , business , medicine , anatomy
There are 7.5 billion inhabitants and over 7,117 languages existing around the world, but only 20% of the people speak English. To understand the wisdom and knowledge of other cultures language translation becomes a basic need. In this paper, a computer-assisted document parsing tool is investigated. The proposed approach uses a language translator that performs translation from images eliminating the need of a human translator for images avoiding the scope for misinterpretation and misunderstanding among people of different ethnic groups. The proposed tool is also capable of performing web crawling using Django Representational State Transfer framework. Further, the proposed approach employs Python packages such as pytesseract, textblob and beautifulsoup to perform Optical Character Recognition, Translation and Extraction of Hypertext Markup Language data respectively. Experimental results of translation on four different categories of images such as Maps, Comics, Newspapers and Magazines, Scientific Publications demonstrate an accuracy of 97.2%, 93.3%, 95.82% and 98.27% respectively. By considering websites like E-commerce, Magazines, Blogs, Social Media, News and Educational sites average precision of 5.4, recall of 7.45 and F-score of 6.24 is achieved. The results reveal that the proposed system can be used as an improvement over a human translator and a data entry operator.