How to normalize cooccurrence data? An analysis of some well‐known similarity measures
Author(s) -
Eck Nees Jan van,
Waltman Ludo
Publication year - 2009
Publication title -
journal of the american society for information science and technology
Language(s) - English
Resource type - Journals
eISSN - 1532-2890
pISSN - 1532-2882
DOI - 10.1002/asi.21075
Subject(s) - jaccard index , probabilistic logic , normalization (sociology) , similarity (geometry) , measure (data warehouse) , data mining , computer science , similarity measure , set (abstract data type) , cosine similarity , index (typography) , mathematics , artificial intelligence , pattern recognition (psychology) , world wide web , anthropology , image (mathematics) , programming language , sociology
In scientometric research, the use of cooccurrence data is very common. In many cases, a similarity measure is employed to normalize the data. However, there is no consensus among researchers on which similarity measure is most appropriate for normalization purposes. In this article, we theoretically analyze the properties of similarity measures for cooccurrence data, focusing in particular on four well‐known measures: the association strength, the cosine, the inclusion index, and the Jaccard index. We also study the behavior of these measures empirically. Our analysis reveals that there exist two fundamentally different types of similarity measures, namely, set‐theoretic measures and probabilistic measures. The association strength is a probabilistic measure, while the cosine, the inclusion index, and the Jaccard index are set‐theoretic measures. Both our theoretical and our empirical results indicate that cooccurrence data can best be normalized using a probabilistic measure. This provides strong support for the use of the association strength in scientometric research.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom