A Machine Learning-Aware Data Re-partitioning Framework for Spatial Datasets

Kanchan Chowdhury; Venkata Vamsikrishna Meduri; Mohamed Sarwat

doi:10.1109/ICDE53745.2022.00227

A Machine Learning-Aware Data Re-partitioning Framework for Spatial Datasets

Kanchan Chowdhury, Venkata Vamsikrishna Meduri, Mohamed Sarwat

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Abstract

Spatial datasets are used extensively to train machine learning (ML) models for applications such as spatial regression, classification, clustering, and deep learning. Most of the real-world spatial datasets are often too large, and many spatial ML algorithms represent the geographical region as a grid consisting of several spatial cells. If the granularity of the grid is too fine, that results in a large number of grid cells leading to long training time and high memory consumption issues during the model training. To alleviate this problem, we propose a machine learning-aware spatial data re-partitioning framework that substantially reduces the granularity of the spatial grid. Our spatial data re-partitioning approach combines fine-grained, adjacent spatial cells from a grid into coarser cells prior to training an ML model. During this re-partitioning phase, we keep the information loss within a user-defined threshold without significantly degrading the accuracy of the ML model. According to the empirical evaluation performed on several real-world datasets, the best results achieved by our spatial re-partitioning framework show that we can reduce the data volume and training time by up to 81%, while keeping the difference in prediction or classification error below 5% as compared to a model that is trained on the original input dataset, for most of the ML applications. Our re-partitioned framework also outperforms the state-of-the-art data reduction baselines by 2% to 20% w.r.t. prediction and classification errors.

Original language	English (US)
Title of host publication	Proceedings - 2022 IEEE 38th International Conference on Data Engineering, ICDE 2022
Publisher	IEEE Computer Society
Pages	2426-2439
Number of pages	14
ISBN (Electronic)	9781665408837
DOIs	https://doi.org/10.1109/ICDE53745.2022.00227
State	Published - 2022
Event	38th IEEE International Conference on Data Engineering, ICDE 2022 - Virtual, Online, Malaysia Duration: May 9 2022 → May 12 2022

Publication series

Name	Proceedings - International Conference on Data Engineering
Volume	2022-May
ISSN (Print)	1084-4627

Conference

Conference	38th IEEE International Conference on Data Engineering, ICDE 2022
Country/Territory	Malaysia
City	Virtual, Online
Period	5/9/22 → 5/12/22

Keywords

Spatial Data
Spatial Machine Learning
Training Data Volume Reduction
Training Time Reduction

ASJC Scopus subject areas

Software
Signal Processing
Information Systems

Access to Document

10.1109/ICDE53745.2022.00227

Cite this

Chowdhury, K., Meduri, V. V., & Sarwat, M. (2022). A Machine Learning-Aware Data Re-partitioning Framework for Spatial Datasets. In Proceedings - 2022 IEEE 38th International Conference on Data Engineering, ICDE 2022 (pp. 2426-2439). (Proceedings - International Conference on Data Engineering; Vol. 2022-May). IEEE Computer Society. https://doi.org/10.1109/ICDE53745.2022.00227

A Machine Learning-Aware Data Re-partitioning Framework for Spatial Datasets. / Chowdhury, Kanchan; Meduri, Venkata Vamsikrishna; Sarwat, Mohamed.
Proceedings - 2022 IEEE 38th International Conference on Data Engineering, ICDE 2022. IEEE Computer Society, 2022. p. 2426-2439 (Proceedings - International Conference on Data Engineering; Vol. 2022-May).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Chowdhury, K, Meduri, VV & Sarwat, M 2022, A Machine Learning-Aware Data Re-partitioning Framework for Spatial Datasets. in Proceedings - 2022 IEEE 38th International Conference on Data Engineering, ICDE 2022. Proceedings - International Conference on Data Engineering, vol. 2022-May, IEEE Computer Society, pp. 2426-2439, 38th IEEE International Conference on Data Engineering, ICDE 2022, Virtual, Online, Malaysia, 5/9/22. https://doi.org/10.1109/ICDE53745.2022.00227

@inproceedings{597ccf2fc0ae47b28288a1db71affd31,

title = "A Machine Learning-Aware Data Re-partitioning Framework for Spatial Datasets",

abstract = "Spatial datasets are used extensively to train machine learning (ML) models for applications such as spatial regression, classification, clustering, and deep learning. Most of the real-world spatial datasets are often too large, and many spatial ML algorithms represent the geographical region as a grid consisting of several spatial cells. If the granularity of the grid is too fine, that results in a large number of grid cells leading to long training time and high memory consumption issues during the model training. To alleviate this problem, we propose a machine learning-aware spatial data re-partitioning framework that substantially reduces the granularity of the spatial grid. Our spatial data re-partitioning approach combines fine-grained, adjacent spatial cells from a grid into coarser cells prior to training an ML model. During this re-partitioning phase, we keep the information loss within a user-defined threshold without significantly degrading the accuracy of the ML model. According to the empirical evaluation performed on several real-world datasets, the best results achieved by our spatial re-partitioning framework show that we can reduce the data volume and training time by up to 81%, while keeping the difference in prediction or classification error below 5% as compared to a model that is trained on the original input dataset, for most of the ML applications. Our re-partitioned framework also outperforms the state-of-the-art data reduction baselines by 2% to 20% w.r.t. prediction and classification errors.",

keywords = "Spatial Data, Spatial Machine Learning, Training Data Volume Reduction, Training Time Reduction",

author = "Kanchan Chowdhury and Meduri, {Venkata Vamsikrishna} and Mohamed Sarwat",

note = "Publisher Copyright: {\textcopyright} 2022 IEEE.; 38th IEEE International Conference on Data Engineering, ICDE 2022 ; Conference date: 09-05-2022 Through 12-05-2022",

year = "2022",

doi = "10.1109/ICDE53745.2022.00227",

language = "English (US)",

series = "Proceedings - International Conference on Data Engineering",

publisher = "IEEE Computer Society",

pages = "2426--2439",

booktitle = "Proceedings - 2022 IEEE 38th International Conference on Data Engineering, ICDE 2022",

}

TY - GEN

T1 - A Machine Learning-Aware Data Re-partitioning Framework for Spatial Datasets

AU - Chowdhury, Kanchan

AU - Meduri, Venkata Vamsikrishna

AU - Sarwat, Mohamed

PY - 2022

Y1 - 2022

N2 - Spatial datasets are used extensively to train machine learning (ML) models for applications such as spatial regression, classification, clustering, and deep learning. Most of the real-world spatial datasets are often too large, and many spatial ML algorithms represent the geographical region as a grid consisting of several spatial cells. If the granularity of the grid is too fine, that results in a large number of grid cells leading to long training time and high memory consumption issues during the model training. To alleviate this problem, we propose a machine learning-aware spatial data re-partitioning framework that substantially reduces the granularity of the spatial grid. Our spatial data re-partitioning approach combines fine-grained, adjacent spatial cells from a grid into coarser cells prior to training an ML model. During this re-partitioning phase, we keep the information loss within a user-defined threshold without significantly degrading the accuracy of the ML model. According to the empirical evaluation performed on several real-world datasets, the best results achieved by our spatial re-partitioning framework show that we can reduce the data volume and training time by up to 81%, while keeping the difference in prediction or classification error below 5% as compared to a model that is trained on the original input dataset, for most of the ML applications. Our re-partitioned framework also outperforms the state-of-the-art data reduction baselines by 2% to 20% w.r.t. prediction and classification errors.

AB - Spatial datasets are used extensively to train machine learning (ML) models for applications such as spatial regression, classification, clustering, and deep learning. Most of the real-world spatial datasets are often too large, and many spatial ML algorithms represent the geographical region as a grid consisting of several spatial cells. If the granularity of the grid is too fine, that results in a large number of grid cells leading to long training time and high memory consumption issues during the model training. To alleviate this problem, we propose a machine learning-aware spatial data re-partitioning framework that substantially reduces the granularity of the spatial grid. Our spatial data re-partitioning approach combines fine-grained, adjacent spatial cells from a grid into coarser cells prior to training an ML model. During this re-partitioning phase, we keep the information loss within a user-defined threshold without significantly degrading the accuracy of the ML model. According to the empirical evaluation performed on several real-world datasets, the best results achieved by our spatial re-partitioning framework show that we can reduce the data volume and training time by up to 81%, while keeping the difference in prediction or classification error below 5% as compared to a model that is trained on the original input dataset, for most of the ML applications. Our re-partitioned framework also outperforms the state-of-the-art data reduction baselines by 2% to 20% w.r.t. prediction and classification errors.

KW - Spatial Data

KW - Spatial Machine Learning

KW - Training Data Volume Reduction

KW - Training Time Reduction

UR - http://www.scopus.com/inward/record.url?scp=85136418977&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85136418977&partnerID=8YFLogxK

U2 - 10.1109/ICDE53745.2022.00227

DO - 10.1109/ICDE53745.2022.00227

M3 - Conference contribution

AN - SCOPUS:85136418977

T3 - Proceedings - International Conference on Data Engineering

SP - 2426

EP - 2439

BT - Proceedings - 2022 IEEE 38th International Conference on Data Engineering, ICDE 2022

PB - IEEE Computer Society

T2 - 38th IEEE International Conference on Data Engineering, ICDE 2022

Y2 - 9 May 2022 through 12 May 2022

ER -

A Machine Learning-Aware Data Re-partitioning Framework for Spatial Datasets

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this