
A Semantic Similarity Approach for Linking Tweet Messages to Library of Congress Subject Headings using Linked Resources: A Pilot Study
Author(s) -
Kwan Yi
Publication year - 2014
Publication title -
advances in classification research online
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.155
H-Index - 7
ISSN - 2324-9773
DOI - 10.7152/acro.v24i1.14676
Subject(s) - jaccard index , subject (documents) , computer science , information retrieval , ranking (information retrieval) , task (project management) , similarity (geometry) , cluster analysis , metric (unit) , semantic similarity , natural language processing , artificial intelligence , world wide web , operations management , management , economics , image (mathematics)
The objective of this study is to propose, implement, and test a framework of assigning relevant Library of Congress (LC) subject headings to tweet messages. In this study, the task of assigning LC headings is considered an automatic classification task that identifies relevant LC subject headings for given tweets. The classification task is conducted in two stages. In the first stage, tweets are clustered so that similar tweets are grouped together. In the second stage, the degree of similarity between a cluster of tweets and LC subject headings is measured by a popular similarity metric, Jaccard Coefficient (JC). In this pilot study, five selected tweet clusters and nine LC subject headings were carefully chosen and used. This pilot study demonstrates a positive result forthe proposed approach of identifying subject headings for tweets. In three cluster cases out of the five, JC selected the most relevant headings as the largest degrees of similarity. For the other two cases, JC was not successful in ranking the most relevant within the top three headings. In the next step, a more sophisticated clustering method will be explored and applied. Also, all possible LC subject headings will be employed to identify LC subjects for tweets in the next steps of this study.