Eigen-Entropy: A metric for multivariate sampling decisions

Jiajing Huang, Hyunsoo Yoon, Teresa Wu, Kasim Selcuk Candan, Ojas Pradhan, Jin Wen, Zheng O'Neill

Research output: Contribution to journalArticlepeer-review

Abstract

Sampling is a technique to help identify a representative data subset that captures the characteristics of the whole dataset. Most existing sampling algorithms require distribution assumptions of the multivariate data, which may not be available beforehand. This study proposes a new metric called Eigen-Entropy (EE), which is based on information entropy for the multivariate dataset. EE is a model-free metric because it is derived based on eigenvalues extracted from the correlation coefficient matrix without any assumptions on data distributions. We prove that EE measures the composition of the dataset, such as its heterogeneity or homogeneity. As a result, EE can be used to support sampling decisions, such as which samples and how many samples to consider with respect to the application of interest. To demonstrate the utility of the EE metric, two sets of use cases are considered. The first use case focuses on classification problems with an imbalanced dataset, and EE is used to guide the rendering of homogeneous samples from minority classes. Using 10 public datasets, it is demonstrated that two oversampling techniques using the proposed EE method outperform reported methods from the literature in terms of precision, recall, F-measure, and G-mean. In the second experiment, building fault detection is investigated where EE is used to sample heterogeneous data to support fault detection. Historical normal datasets collected from real building systems are used to construct the baselines by EE for 14 test cases, and experimental results indicate that the EE method outperforms benchmark methods in terms of recall. We conclude that EE is a viable metric to support sampling decisions.

Original languageEnglish (US)
Pages (from-to)84-97
Number of pages14
JournalInformation Sciences
Volume619
DOIs
StatePublished - Jan 2023

Keywords

  • Correlation coefficient
  • Eigenvalues
  • Information entropy
  • Model-free
  • Sampling

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Software
  • Control and Systems Engineering
  • Computer Science Applications
  • Information Systems and Management
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Eigen-Entropy: A metric for multivariate sampling decisions'. Together they form a unique fingerprint.

Cite this