Premium
Data mining from web search queries: A comparison of google trends and baidu index
Author(s) -
Vaughan Liwen,
Chen Yue
Publication year - 2015
Publication title -
journal of the association for information science and technology
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.903
H-Index - 145
eISSN - 2330-1643
pISSN - 2330-1635
DOI - 10.1002/asi.23201
Subject(s) - information retrieval , computer science , volume (thermodynamics) , search engine , index (typography) , web search query , disadvantage , database , world wide web , physics , quantum mechanics , artificial intelligence
Numerous studies have explored the possibility of uncovering information from web search queries but few have examined the factors that affect web query data sources. We conducted a study that investigated this issue by comparing G oogle T rends and B aidu I ndex. Data from these two services are based on queries entered by users into G oogle and B aidu, two of the largest search engines in the world. We first compared the features and functions of the two services based on documents and extensive testing. We then carried out an empirical study that collected query volume data from the two sources. We found that data from both sources could be used to predict the quality of C hinese universities and companies. Despite the differences between the two services in terms of technology, such as differing methods of language processing, the search volume data from the two were highly correlated and combining the two data sources did not improve the predictive power of the data. However, there was a major difference between the two in terms of data availability. B aidu I ndex was able to provide more search volume data than G oogle T rends did. Our analysis showed that the disadvantage of G oogle T rends in this regard was due to G oogle's smaller user base in C hina. The implication of this finding goes beyond C hina. G oogle's user bases in many countries are smaller than that in C hina, so the search volume data related to those countries could result in the same issue as that related to C hina.