An Empirical Study of Several Information Theoretic Based Feature Extraction Methods for Classifying High Dimensional Low Sample Size Data | Zendy

Sheena Leeza Verghese | Zendy; Iman Yi Liao | Zendy; Tomas H. Maul | Zendy; Siang Yew Chong | Zendy

Open Access

An Empirical Study of Several Information Theoretic Based Feature Extraction Methods for Classifying High Dimensional Low Sample Size Data

Author(s) -

Sheena Leeza Verghese,

Iman Yi Liao,

Tomas H. Maul,

Siang Yew Chong

Publication year - 2021

Publication title -

ieee access

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.587

H-Index - 127

ISSN - 2169-3536

DOI - 10.1109/access.2021.3077958

Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation

A high dimensional low sample size (HDLSS) dataset typically contains many features but a limited number of samples. It is commonly found in domains such as microarray data and medical imaging. When sample size is small, the population probability density function (PDF) of a HDLSS dataset may not be well represented, causing difficulties of applying feature selection or feature extraction methods for HDLSS data classification. In this paper, we explore the possibility of designing feature selection and feature extraction methods for HDLSS data classification by making loose assumption on the underlying PDF of a HDLSS dataset. Specifically, we propose to leverage on Correlation Explanation (CorEx), a recent unsupervised probabilistic graphical model that assumes (hierarchical) hidden structure for generating subsets of features that are conditionally independent. We benchmark the proposed method against frequently cited Information Theory based feature extraction and feature selection methods, including Conditional Infomax Feature Extraction (CIFE), Maximum Relevance Minimum Redundancy (MRMR), Maximization of Mutual Information (MMI), Infomax Independent Component Analysis (Infomax ICA),and Kernel Entropy Component Analysis (KECA). The HDLSS datasets used in this study are Breast Cancer Dataset by Gravier et. al and West et. al, Colon Cancer dataset by Alon et. al., Leukemia Dataset by Golub et.al and the Gisette Dataset used by Guyon et. al. Experimental results demonstrate that the proposed method shows some improvement in classification performance over MMI, and Infomax ICA and is competitive with MRMR and CIFE.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research