A Machine Learning-Aware Data Re-partitioning Framework for Spatial Datasets

Kanchan Chowdhury, Venkata Vamsikrishna Meduri, Mohamed Sarwat

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Spatial datasets are used extensively to train machine learning (ML) models for applications such as spatial regression, classification, clustering, and deep learning. Most of the real-world spatial datasets are often too large, and many spatial ML algorithms represent the geographical region as a grid consisting of several spatial cells. If the granularity of the grid is too fine, that results in a large number of grid cells leading to long training time and high memory consumption issues during the model training. To alleviate this problem, we propose a machine learning-aware spatial data re-partitioning framework that substantially reduces the granularity of the spatial grid. Our spatial data re-partitioning approach combines fine-grained, adjacent spatial cells from a grid into coarser cells prior to training an ML model. During this re-partitioning phase, we keep the information loss within a user-defined threshold without significantly degrading the accuracy of the ML model. According to the empirical evaluation performed on several real-world datasets, the best results achieved by our spatial re-partitioning framework show that we can reduce the data volume and training time by up to 81%, while keeping the difference in prediction or classification error below 5% as compared to a model that is trained on the original input dataset, for most of the ML applications. Our re-partitioned framework also outperforms the state-of-the-art data reduction baselines by 2% to 20% w.r.t. prediction and classification errors.

Original languageEnglish (US)
Title of host publicationProceedings - 2022 IEEE 38th International Conference on Data Engineering, ICDE 2022
PublisherIEEE Computer Society
Pages2426-2439
Number of pages14
ISBN (Electronic)9781665408837
DOIs
StatePublished - 2022
Event38th IEEE International Conference on Data Engineering, ICDE 2022 - Virtual, Online, Malaysia
Duration: May 9 2022May 12 2022

Publication series

NameProceedings - International Conference on Data Engineering
Volume2022-May
ISSN (Print)1084-4627

Conference

Conference38th IEEE International Conference on Data Engineering, ICDE 2022
Country/TerritoryMalaysia
CityVirtual, Online
Period5/9/225/12/22

Keywords

  • Spatial Data
  • Spatial Machine Learning
  • Training Data Volume Reduction
  • Training Time Reduction

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Information Systems

Cite this