z-logo
open-access-imgOpen Access
Social Graph Community Differentiated by Node Features with Partly Missing Information
Author(s) -
Vladislav Chesnokov,
Petr Klyucharev
Publication year - 2014
Publication title -
nauka i obrazovanie
Language(s) - English
Resource type - Journals
ISSN - 1994-0408
DOI - 10.7463/0915.0811704
Subject(s) - computer science , node (physics) , social graph , graph , social network (sociolinguistics) , data science , world wide web , information retrieval , theoretical computer science , social media , engineering , structural engineering

This paper proposes a new algorithm for community differentiation in social graphs, which uses information both on the graph structure and on the vertices. We consider user's ego-network i.e. his friends, with no himself, where each vertex has a set of features such as details on a workplace, institution, etc. The task is to determine missing or unspecified features of the vertices, based on their neighbors' features, and use these features to differentiate the communities in the social graph. Two vertices are believed to belong to the same community if they have a common feature. A hypothesis has been put forward that if most neighbors of a vertex have a common feature, there is a good probability that the vertex has this feature as well. The proposed algorithm is iterative and updates features of vertices, based on its neighbors, according to the hypothesis. Share of neighbors that form a majority is specified by the algorithm parameter. Complexity of single iteration depends linearly on the number of edges in the graph.

To assess the quality of clustering three normalized metrics were used, namely: expected density, silhouette index, and Hubert's Gamma Statistic. The paper describes a method for test sampling of 2.000 graphs of the user's social network \VKontakte". The API requests addressed \VKontakte" and parsing HTML-pages of user's profiles and search results provided crawling. Information on user's group membership, secondary and higher education, and workplace was used as features. To store data the PostgreSQL DBMS was used, and the gexf format was used for data processing. For the test sample, metrics for several values of algorithm parameter were estimated: the value of index silhouettes was low (0.14-0.20), but within the normal range; the value of expected density was high, i.e. 1.17-1.52; the value of Hubert's gamma statistic was 0.94-0.95 that is close to the maximum. The number of vertices with no features was calculated before and after applying the algorithm. With two-third of vertices before using the algorithm their number has fallen to 25-64% depending on algorithm parameter after applying it. The proposed algorithm can be used for clustering the social graphs, but it should be modified to differentiate overlapping communities.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here