
Modified union feature selection method on English translation of hadith text clustering
Author(s) -
Arief Fatchul Huda,
Nanda Priatna,
Q. U. Safitri,
W Darmalaksana
Publication year - 2019
Publication title -
journal of physics. conference series
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.21
H-Index - 85
eISSN - 1742-6596
pISSN - 1742-6588
DOI - 10.1088/1742-6596/1402/6/066052
Subject(s) - cluster analysis , intersection (aeronautics) , feature (linguistics) , computer science , data mining , artificial intelligence , feature selection , pattern recognition (psychology) , ranking (information retrieval) , set (abstract data type) , dimensionality reduction , engineering , linguistics , philosophy , programming language , aerospace engineering
The high feature space (dimension) is one of the main issues to be considered in the text clustering process. Therefore, various dimensional reduction methods have been introduced for selecting informative sub feature. Each method uses a different strategy to select sub feature, and the results are different even if using the same data set. Typically, union methods and intersection methods are used to combine selected sub feature with different reduction methods. The union method selects all feature and intersection only selects the general feature under consideration. Thus, the union approach causes an increase in feature dimensions and the intersection approach causes the loss of some important feature. Therefore, in order to take advantage of a method and reduce its weaknesses, this research proposes new approach, which are called modified union. This approach applies the union methods to select top ranking feature and applies intersection methods to the rest of the feature. In this case, feature selection uses the Term Variance (TV) and Document Frequency (DF) methods to calculate the relevance value of each feature. The effectiveness of the proposed method is tested on the data set of Hadith Shahih Bukhary. The results show that the proposed method improves clustering accuracy over other methods with DB index is 2.7.