
DCG++: A data-driven metric for geometric pattern recognition
Author(s) -
Jiahui Guan,
Fushing Hsieh,
Patrice Koehl
Publication year - 2019
Publication title -
plos one
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.99
H-Index - 332
ISSN - 1932-6203
DOI - 10.1371/journal.pone.0217838
Subject(s) - ultrametric space , cluster analysis , measure (data warehouse) , computer science , metric (unit) , pattern recognition (psychology) , context (archaeology) , similarity (geometry) , data point , similarity measure , artificial intelligence , segmentation , synthetic data , mathematics , algorithm , data mining , metric space , image (mathematics) , discrete mathematics , operations management , economics , paleontology , biology
Clustering large and complex data sets whose partitions may adopt arbitrary shapes remains a difficult challenge. Part of this challenge comes from the difficulty in defining a similarity measure between the data points that captures the underlying geometry of those data points. In this paper, we propose an algorithm, DCG++ that generates such a similarity measure that is data-driven and ultrametric. DCG++ uses Markov Chain Random Walks to capture the intrinsic geometry of data, scans possible scales, and combines all this information using a simple procedure that is shown to generate an ultrametric. We validate the effectiveness of this similarity measure within the context of clustering on synthetic data with complex geometry, on a real-world data set containing segmented audio records of frog calls described by mel-frequency cepstral coefficients, as well as on an image segmentation problem. The experimental results show a significant improvement on performance with the DCG-based ultrametric compared to using an empirical distance measure.