z-logo
open-access-imgOpen Access
Document clustering using the LSI subspace signature model
Author(s) -
Zhu W.Z.,
Allen R.B.
Publication year - 2013
Publication title -
journal of the american society for information science and technology
Language(s) - English
Resource type - Journals
eISSN - 1532-2890
pISSN - 1532-2882
DOI - 10.1002/asi.22623
Subject(s) - computer science , cluster analysis , subspace topology , initialization , document clustering , feature vector , pattern recognition (psychology) , rank (graph theory) , artificial intelligence , data mining , discriminative model , vector space model , ranking (information retrieval) , mathematics , combinatorics , programming language
We describe the latent semantic indexing subspace signature model ( LSISSM ) for semantic content representation of unstructured text. Grounded on singular value decomposition, the model represents terms and documents by the distribution signatures of their statistical contribution across the top‐ranking latent concept dimensions. LSISSM matches term signatures with document signatures according to their mapping coherence between latent semantic indexing ( LSI ) term subspace and LSI document subspace. LSISSM does feature reduction and finds a low‐rank approximation of scalable and sparse term‐document matrices. Experiments demonstrate that this approach significantly improves the performance of major clustering algorithms such as standard K ‐means and self‐organizing maps compared with the vector space model and the traditional LSI model. The unique contribution ranking mechanism in LSISSM also improves the initialization of standard K ‐means compared with random seeding procedure, which sometimes causes low efficiency and effectiveness of clustering. A two‐stage initialization strategy based on LSISSM significantly reduces the running time of standard K ‐means procedures.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here