Enabling scientific data storage and processing on big-data systems

Saman Biookaghazadeh; Yiqi Xu; Shujia Zhou; Ming Zhao

doi:10.1109/BigData.2015.7363978

Enabling scientific data storage and processing on big-data systems

Saman Biookaghazadeh, Yiqi Xu, Shujia Zhou, Ming Zhao

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

6 Scopus citations

Abstract

Big-data systems are increasingly important for solving the data-driven problems in many science domains including geosciences. However, existing big-data systems cannot support the self-describing data formats such as NetCDF which are commonly used by scientific communities for data distribution and sharing. This limitation presents a serious hurdle to the further adoption of big-data systems by science domains and prevents scientific users from leveraging these systems to improve their productivity. This paper presents a solution to this problem by enabling big-data systems to directly store and process scientific data. Specifically, it enables Hadoop to efficiently store NetCDF data on HDFS and process them in MapReduce using convenient APIs. It also enables Hive to support standard queries on NetCDF data, transparently to users. The paper also presents an evaluation of the proposed solution using several representative queries on a typical geoscientific dataset. The results show that the proposed approach achieves substantial speedup (up to 20 times) and space saving (83% reduction), compared to the traditional approach which has to convert NetCDF data to CSV format for Hadoop and Hive to use them.

Original language	English (US)
Title of host publication	Proceedings - 2015 IEEE International Conference on Big Data, IEEE Big Data 2015
Editors	Feng Luo, Kemafor Ogan, Mohammed J. Zaki, Laura Haas, Beng Chin Ooi, Vipin Kumar, Sudarsan Rachuri, Saumyadipta Pyne, Howard Ho, Xiaohua Hu, Shipeng Yu, Morris Hui-I Hsiao, Jian Li
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	1978-1984
Number of pages	7
ISBN (Electronic)	9781479999255
DOIs	https://doi.org/10.1109/BigData.2015.7363978
State	Published - Dec 22 2015
Event	3rd IEEE International Conference on Big Data, IEEE Big Data 2015 - Santa Clara, United States Duration: Oct 29 2015 → Nov 1 2015

Publication series

Name	Proceedings - 2015 IEEE International Conference on Big Data, IEEE Big Data 2015

Other

Other	3rd IEEE International Conference on Big Data, IEEE Big Data 2015
Country/Territory	United States
City	Santa Clara
Period	10/29/15 → 11/1/15

Keywords

Hadoop
NetCDF
Scientific data
big data

ASJC Scopus subject areas

Computer Networks and Communications
Computer Science Applications
Information Systems
Software

Access to Document

10.1109/BigData.2015.7363978

Cite this

Biookaghazadeh, S., Xu, Y., Zhou, S., & Zhao, M. (2015). Enabling scientific data storage and processing on big-data systems. In F. Luo, K. Ogan, M. J. Zaki, L. Haas, B. C. Ooi, V. Kumar, S. Rachuri, S. Pyne, H. Ho, X. Hu, S. Yu, M. H.-I. Hsiao, & J. Li (Eds.), Proceedings - 2015 IEEE International Conference on Big Data, IEEE Big Data 2015 (pp. 1978-1984). Article 7363978 (Proceedings - 2015 IEEE International Conference on Big Data, IEEE Big Data 2015). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/BigData.2015.7363978

Enabling scientific data storage and processing on big-data systems. / Biookaghazadeh, Saman; Xu, Yiqi; Zhou, Shujia et al.
Proceedings - 2015 IEEE International Conference on Big Data, IEEE Big Data 2015. ed. / Feng Luo; Kemafor Ogan; Mohammed J. Zaki; Laura Haas; Beng Chin Ooi; Vipin Kumar; Sudarsan Rachuri; Saumyadipta Pyne; Howard Ho; Xiaohua Hu; Shipeng Yu; Morris Hui-I Hsiao; Jian Li. Institute of Electrical and Electronics Engineers Inc., 2015. p. 1978-1984 7363978 (Proceedings - 2015 IEEE International Conference on Big Data, IEEE Big Data 2015).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Biookaghazadeh, S, Xu, Y, Zhou, S & Zhao, M 2015, Enabling scientific data storage and processing on big-data systems. in F Luo, K Ogan, MJ Zaki, L Haas, BC Ooi, V Kumar, S Rachuri, S Pyne, H Ho, X Hu, S Yu, MH-I Hsiao & J Li (eds), Proceedings - 2015 IEEE International Conference on Big Data, IEEE Big Data 2015., 7363978, Proceedings - 2015 IEEE International Conference on Big Data, IEEE Big Data 2015, Institute of Electrical and Electronics Engineers Inc., pp. 1978-1984, 3rd IEEE International Conference on Big Data, IEEE Big Data 2015, Santa Clara, United States, 10/29/15. https://doi.org/10.1109/BigData.2015.7363978

Biookaghazadeh S, Xu Y, Zhou S, Zhao M. Enabling scientific data storage and processing on big-data systems. In Luo F, Ogan K, Zaki MJ, Haas L, Ooi BC, Kumar V, Rachuri S, Pyne S, Ho H, Hu X, Yu S, Hsiao MHI, Li J, editors, Proceedings - 2015 IEEE International Conference on Big Data, IEEE Big Data 2015. Institute of Electrical and Electronics Engineers Inc. 2015. p. 1978-1984. 7363978. (Proceedings - 2015 IEEE International Conference on Big Data, IEEE Big Data 2015). doi: 10.1109/BigData.2015.7363978

Biookaghazadeh, Saman ; Xu, Yiqi ; Zhou, Shujia et al. / Enabling scientific data storage and processing on big-data systems. Proceedings - 2015 IEEE International Conference on Big Data, IEEE Big Data 2015. editor / Feng Luo ; Kemafor Ogan ; Mohammed J. Zaki ; Laura Haas ; Beng Chin Ooi ; Vipin Kumar ; Sudarsan Rachuri ; Saumyadipta Pyne ; Howard Ho ; Xiaohua Hu ; Shipeng Yu ; Morris Hui-I Hsiao ; Jian Li. Institute of Electrical and Electronics Engineers Inc., 2015. pp. 1978-1984 (Proceedings - 2015 IEEE International Conference on Big Data, IEEE Big Data 2015).

@inproceedings{37f461b6a51341ab9e1a4b6eb516d8b6,

title = "Enabling scientific data storage and processing on big-data systems",

abstract = "Big-data systems are increasingly important for solving the data-driven problems in many science domains including geosciences. However, existing big-data systems cannot support the self-describing data formats such as NetCDF which are commonly used by scientific communities for data distribution and sharing. This limitation presents a serious hurdle to the further adoption of big-data systems by science domains and prevents scientific users from leveraging these systems to improve their productivity. This paper presents a solution to this problem by enabling big-data systems to directly store and process scientific data. Specifically, it enables Hadoop to efficiently store NetCDF data on HDFS and process them in MapReduce using convenient APIs. It also enables Hive to support standard queries on NetCDF data, transparently to users. The paper also presents an evaluation of the proposed solution using several representative queries on a typical geoscientific dataset. The results show that the proposed approach achieves substantial speedup (up to 20 times) and space saving (83% reduction), compared to the traditional approach which has to convert NetCDF data to CSV format for Hadoop and Hive to use them.",

keywords = "Hadoop, NetCDF, Scientific data, big data",

author = "Saman Biookaghazadeh and Yiqi Xu and Shujia Zhou and Ming Zhao",

note = "Publisher Copyright: {\textcopyright} 2015 IEEE.; 3rd IEEE International Conference on Big Data, IEEE Big Data 2015 ; Conference date: 29-10-2015 Through 01-11-2015",

year = "2015",

month = dec,

day = "22",

doi = "10.1109/BigData.2015.7363978",

language = "English (US)",

series = "Proceedings - 2015 IEEE International Conference on Big Data, IEEE Big Data 2015",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "1978--1984",

editor = "Feng Luo and Kemafor Ogan and Zaki, {Mohammed J.} and Laura Haas and Ooi, {Beng Chin} and Vipin Kumar and Sudarsan Rachuri and Saumyadipta Pyne and Howard Ho and Xiaohua Hu and Shipeng Yu and Hsiao, {Morris Hui-I} and Jian Li",

booktitle = "Proceedings - 2015 IEEE International Conference on Big Data, IEEE Big Data 2015",

}

TY - GEN

T1 - Enabling scientific data storage and processing on big-data systems

AU - Biookaghazadeh, Saman

AU - Xu, Yiqi

AU - Zhou, Shujia

AU - Zhao, Ming

PY - 2015/12/22

Y1 - 2015/12/22

N2 - Big-data systems are increasingly important for solving the data-driven problems in many science domains including geosciences. However, existing big-data systems cannot support the self-describing data formats such as NetCDF which are commonly used by scientific communities for data distribution and sharing. This limitation presents a serious hurdle to the further adoption of big-data systems by science domains and prevents scientific users from leveraging these systems to improve their productivity. This paper presents a solution to this problem by enabling big-data systems to directly store and process scientific data. Specifically, it enables Hadoop to efficiently store NetCDF data on HDFS and process them in MapReduce using convenient APIs. It also enables Hive to support standard queries on NetCDF data, transparently to users. The paper also presents an evaluation of the proposed solution using several representative queries on a typical geoscientific dataset. The results show that the proposed approach achieves substantial speedup (up to 20 times) and space saving (83% reduction), compared to the traditional approach which has to convert NetCDF data to CSV format for Hadoop and Hive to use them.

AB - Big-data systems are increasingly important for solving the data-driven problems in many science domains including geosciences. However, existing big-data systems cannot support the self-describing data formats such as NetCDF which are commonly used by scientific communities for data distribution and sharing. This limitation presents a serious hurdle to the further adoption of big-data systems by science domains and prevents scientific users from leveraging these systems to improve their productivity. This paper presents a solution to this problem by enabling big-data systems to directly store and process scientific data. Specifically, it enables Hadoop to efficiently store NetCDF data on HDFS and process them in MapReduce using convenient APIs. It also enables Hive to support standard queries on NetCDF data, transparently to users. The paper also presents an evaluation of the proposed solution using several representative queries on a typical geoscientific dataset. The results show that the proposed approach achieves substantial speedup (up to 20 times) and space saving (83% reduction), compared to the traditional approach which has to convert NetCDF data to CSV format for Hadoop and Hive to use them.

KW - Hadoop

KW - NetCDF

KW - Scientific data

KW - big data

UR - http://www.scopus.com/inward/record.url?scp=84963730678&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84963730678&partnerID=8YFLogxK

U2 - 10.1109/BigData.2015.7363978

DO - 10.1109/BigData.2015.7363978

M3 - Conference contribution

AN - SCOPUS:84963730678

T3 - Proceedings - 2015 IEEE International Conference on Big Data, IEEE Big Data 2015

SP - 1978

EP - 1984

BT - Proceedings - 2015 IEEE International Conference on Big Data, IEEE Big Data 2015

A2 - Luo, Feng

A2 - Ogan, Kemafor

A2 - Zaki, Mohammed J.

A2 - Haas, Laura

A2 - Ooi, Beng Chin

A2 - Kumar, Vipin

A2 - Rachuri, Sudarsan

A2 - Pyne, Saumyadipta

A2 - Ho, Howard

A2 - Hu, Xiaohua

A2 - Yu, Shipeng

A2 - Hsiao, Morris Hui-I

A2 - Li, Jian

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 3rd IEEE International Conference on Big Data, IEEE Big Data 2015

Y2 - 29 October 2015 through 1 November 2015

ER -

Enabling scientific data storage and processing on big-data systems

Abstract

Publication series

Other

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this