Abstract

Feature engineering has found increasing interest in recent years because of its ability to improve the effectiveness of various machine learning models. Although tailored feature engineering methods have been designed for various domains, there are few that simulate the consistent effectiveness of kernel methods. At the core, the success of kernel methods is achieved by using similarity functions that emphasize local variations in similarity. Unfortunately, this ability comes at the price of the high level of computational resources required and the inflexibility of the representation as it only provides the similarity of two data points instead of vector representations of each data point; while the vector representations can be readily used as input to facilitate various models for different tasks. Furthermore, kernel methods are also highly susceptible to overfitting and noise and it cannot capture the variety of data locality. In this paper, we first analyze the inner working and weaknesses of kernel method, which serves as guidance for designing feature engineering. With the guidance, we explore the use of randomized methods for feature engineering by capturing multi-granular locality of data. This approach has the merit of being time and space efficient for feature construction. Furthermore, the approach is resistant to overfitting and noise because the randomized approach naturally enables fast and robust ensemble methods. Extensive experiments on a number of real world datasets are conducted to show the effectiveness of the approach for various tasks such as clustering, classification and outlier detection.

Original languageEnglish (US)
Title of host publicationKDD 2017 - Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
PublisherAssociation for Computing Machinery
Pages485-494
Number of pages10
VolumePart F129685
ISBN (Electronic)9781450348874
DOIs
StatePublished - Aug 13 2017
Event23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2017 - Halifax, Canada
Duration: Aug 13 2017Aug 17 2017

Other

Other23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2017
CountryCanada
CityHalifax
Period8/13/178/17/17

Keywords

  • Randomized feature engineering
  • Unsupervised feature learning

ASJC Scopus subject areas

  • Software
  • Information Systems

Fingerprint Dive into the research topics of 'Randomized feature engineering as a fast and accurate alternative to kernel methods'. Together they form a unique fingerprint.

  • Cite this

    Wang, S., Aggarwal, C., & Liu, H. (2017). Randomized feature engineering as a fast and accurate alternative to kernel methods. In KDD 2017 - Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Vol. Part F129685, pp. 485-494). Association for Computing Machinery. https://doi.org/10.1145/3097983.3098001