z-logo
open-access-imgOpen Access
An Empirical Study of Several Information Theoretic Based Feature Extraction Methods for Classifying High Dimensional Low Sample Size Data
Author(s) -
Sheena Leeza Verghese,
Iman Yi Liao,
Tomas H. Maul,
Siang Yew Chong
Publication year - 2021
Publication title -
ieee access
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.587
H-Index - 127
ISSN - 2169-3536
DOI - 10.1109/access.2021.3077958
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
A high dimensional low sample size (HDLSS) dataset typically contains many features but a limited number of samples. It is commonly found in domains such as microarray data and medical imaging. When sample size is small, the population probability density function (PDF) of a HDLSS dataset may not be well represented, causing difficulties of applying feature selection or feature extraction methods for HDLSS data classification. In this paper, we explore the possibility of designing feature selection and feature extraction methods for HDLSS data classification by making loose assumption on the underlying PDF of a HDLSS dataset. Specifically, we propose to leverage on Correlation Explanation (CorEx), a recent unsupervised probabilistic graphical model that assumes (hierarchical) hidden structure for generating subsets of features that are conditionally independent. We benchmark the proposed method against frequently cited Information Theory based feature extraction and feature selection methods, including Conditional Infomax Feature Extraction (CIFE), Maximum Relevance Minimum Redundancy (MRMR), Maximization of Mutual Information (MMI), Infomax Independent Component Analysis (Infomax ICA),and Kernel Entropy Component Analysis (KECA). The HDLSS datasets used in this study are Breast Cancer Dataset by Gravier et. al and West et. al, Colon Cancer dataset by Alon et. al., Leukemia Dataset by Golub et.al and the Gisette Dataset used by Guyon et. al. Experimental results demonstrate that the proposed method shows some improvement in classification performance over MMI, and Infomax ICA and is competitive with MRMR and CIFE.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom