
A Classification Framework of Identifying Major Documents with Search Engine Suggests and Unsupervised Subtopic Clustering
Publication year - 2021
Publication title -
international journal of cognitive informatics and natural intelligence
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.164
H-Index - 24
eISSN - 1557-3966
pISSN - 1557-3958
DOI - 10.4018/ijcini.20211001oa29
Subject(s) - computer science , information retrieval , ranking (information retrieval) , set (abstract data type) , search engine , relevance (law) , task (project management) , cluster analysis , document clustering , representation (politics) , word embedding , similarity (geometry) , baseline (sea) , word (group theory) , data mining , artificial intelligence , embedding , linguistics , oceanography , philosophy , management , politics , political science , law , economics , image (mathematics) , programming language , geology
This paper addresses the problem of automatic recognition of out-of-topic documents from a small set of similar documents that are expected to be on some common topic. The objective is to remove documents of noise from a set. A topic model based classification framework is proposed for the task of discovering out-of-topic documents. This paper introduces a new concept of annotated {\it search engine suggests}, where this paper takes whichever search queries were used to search for a page as representations of content in that page. This paper adopted word embedding to create distributed representation of words and documents, and perform similarity comparison on search engine suggests. It is shown that search engine suggests can be highly accurate semantic representations of textual content and demonstrate that our document analysis algorithm using such representation for relevance measure gives satisfactory performance in terms of in-topic content filtering compared to the baseline technique of topic probability ranking.