Relevant XML Documents - Approach Based on Vectors and Weight Calculation of Terms
Author(s) -
Abdeslem Dennai,
Mohammed Yacine DENNAI,
Sidi Mohamed Benslimane
Publication year - 2016
Publication title -
international journal of information technology and computer science
Language(s) - English
Resource type - Journals
eISSN - 2074-9015
pISSN - 2074-9007
DOI - 10.5815/ijitcs.2016.11.03
Subject(s) - computer science , information retrieval , xml , relevance (law) , document structure description , set (abstract data type) , representation (politics) , tf–idf , exploit , term (time) , world wide web , programming language , physics , computer security , quantum mechanics , politics , political science , law
Three classes of documents, based on their data, circulate in the web: Unstructured documents (.Doc, .html, .pdf ...), semi-structured documents (.xml, .Owl ...) and structured documents (Tables database for example). A semi-structured document is organized around predefined tags or defined by its author. However, many studies use a document classification by taking into account their textual content and underestimate their structure. We attempt in this paper to propose a representation of these semi-structured web documents based on weighted vectors allowing exploit ing their content for a possible treatment. The weight of terms is calculated using: The normal frequency for a document, TF-IDF (Term Frequency Inverse Document Frequency) and logic (Boolean) frequency for a set of documents. To assess and demonstrate the relevance of our proposed approach, we will realize several experiments on different corpus.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom