Open Access
Research on Methods of Parsing and Classification of Internet Super Large-scale Texts
Author(s) -
Miaojing Song,
Hang Zheng,
Tao Zhang,
Jia Jiang,
Bin Pan
Publication year - 2021
Publication title -
journal of physics. conference series
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.21
H-Index - 85
eISSN - 1742-6596
pISSN - 1742-6588
DOI - 10.1088/1742-6596/1757/1/012121
Subject(s) - computer science , xml , parsing , naive bayes classifier , the internet , information retrieval , download , simple api for xml , world wide web , support vector machine , xml signature , artificial intelligence , efficient xml interchange
Web crawlers are an important part of modern search engines. With the development of the times, data has shown explosive growth, and mankind has entered a “big data era”. For example, Wikipedia, which carries knowledge achievements from all over the world, records real-time news that occurs every day and provides users with a good text search database[1]. Wikipedia updates data up to 50+GB every day. This project focuses on solving the problems of data acquisition and data analysis. At the same time, it downloads and parses the latest data of Wikipedia and analyzes XML files, and then uses SVM algorithm and Naive Bayes algorithm to classify articles, Train the model to download Wikipedia files efficiently and parse XML files.