RDF data storage techniques for efficient SPARQL query processing using distributed computation engines

Mahmudul Hassan, Srividya Bansal

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

The rapidly growing amount of linked open data demands semantic RDF services that are efficient, scalable, and distributed along with high availability for reuse and fault tolerance. To address this concern, the Big Data processing infrastructure Hadoop has been adopted for RDF data management systems. In this paper, we introduce distributed RDF data stores, namely VPExp and 3CStore, based on the existing vertical partitioning (VP) approach. In the VPExp approach, we propose splitting of predicates based on explicit type information of an object. The 3CStore scheme is designed with a 3-column store, comprising of a subset of triples from the VP table based on different join correlations, to reduce the number of join operations while executing SPARQL queries as SQL in a distributed system. We evaluate these two RDF data storage approaches by comparing them with vertical partitioning approach and state-of-the-art RDF management system S2RDF. We also present an evaluation of query performance of these systems built upon two popular distributed computation engines namely, Spark and Drill.

Original languageEnglish (US)
Title of host publicationProceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages323-330
Number of pages8
ISBN (Print)9781538626597
DOIs
StatePublished - Aug 2 2018
Event19th IEEE International Conference on Information Reuse and Integration for Data Science, IRI 2018 - Salt Lake City, United States
Duration: Jul 7 2018Jul 9 2018

Other

Other19th IEEE International Conference on Information Reuse and Integration for Data Science, IRI 2018
CountryUnited States
CitySalt Lake City
Period7/7/187/9/18

Fingerprint

data storage
Query processing
Fault tolerance
Electric sparks
Information management
Semantics
Availability
Engines
Data storage equipment
management
tolerance
semantics
infrastructure
Big data
evaluation
performance
Partitioning
Join
Query

Keywords

  • Drill
  • Hadoop
  • In-memory processing engine
  • Information reuse
  • RDF data storage
  • Semantic web
  • Spark
  • SPARQL Querying

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Software
  • Artificial Intelligence
  • Information Systems and Management
  • Safety, Risk, Reliability and Quality
  • Public Administration

Cite this

Hassan, M., & Bansal, S. (2018). RDF data storage techniques for efficient SPARQL query processing using distributed computation engines. In Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018 (pp. 323-330). [8424727] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/IRI.2018.00056

RDF data storage techniques for efficient SPARQL query processing using distributed computation engines. / Hassan, Mahmudul; Bansal, Srividya.

Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018. Institute of Electrical and Electronics Engineers Inc., 2018. p. 323-330 8424727.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Hassan, M & Bansal, S 2018, RDF data storage techniques for efficient SPARQL query processing using distributed computation engines. in Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018., 8424727, Institute of Electrical and Electronics Engineers Inc., pp. 323-330, 19th IEEE International Conference on Information Reuse and Integration for Data Science, IRI 2018, Salt Lake City, United States, 7/7/18. https://doi.org/10.1109/IRI.2018.00056
Hassan M, Bansal S. RDF data storage techniques for efficient SPARQL query processing using distributed computation engines. In Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018. Institute of Electrical and Electronics Engineers Inc. 2018. p. 323-330. 8424727 https://doi.org/10.1109/IRI.2018.00056
Hassan, Mahmudul ; Bansal, Srividya. / RDF data storage techniques for efficient SPARQL query processing using distributed computation engines. Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018. Institute of Electrical and Electronics Engineers Inc., 2018. pp. 323-330
@inproceedings{c0346132dd0c4b7a932db688063a4b94,
title = "RDF data storage techniques for efficient SPARQL query processing using distributed computation engines",
abstract = "The rapidly growing amount of linked open data demands semantic RDF services that are efficient, scalable, and distributed along with high availability for reuse and fault tolerance. To address this concern, the Big Data processing infrastructure Hadoop has been adopted for RDF data management systems. In this paper, we introduce distributed RDF data stores, namely VPExp and 3CStore, based on the existing vertical partitioning (VP) approach. In the VPExp approach, we propose splitting of predicates based on explicit type information of an object. The 3CStore scheme is designed with a 3-column store, comprising of a subset of triples from the VP table based on different join correlations, to reduce the number of join operations while executing SPARQL queries as SQL in a distributed system. We evaluate these two RDF data storage approaches by comparing them with vertical partitioning approach and state-of-the-art RDF management system S2RDF. We also present an evaluation of query performance of these systems built upon two popular distributed computation engines namely, Spark and Drill.",
keywords = "Drill, Hadoop, In-memory processing engine, Information reuse, RDF data storage, Semantic web, Spark, SPARQL Querying",
author = "Mahmudul Hassan and Srividya Bansal",
year = "2018",
month = "8",
day = "2",
doi = "10.1109/IRI.2018.00056",
language = "English (US)",
isbn = "9781538626597",
pages = "323--330",
booktitle = "Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - GEN

T1 - RDF data storage techniques for efficient SPARQL query processing using distributed computation engines

AU - Hassan, Mahmudul

AU - Bansal, Srividya

PY - 2018/8/2

Y1 - 2018/8/2

N2 - The rapidly growing amount of linked open data demands semantic RDF services that are efficient, scalable, and distributed along with high availability for reuse and fault tolerance. To address this concern, the Big Data processing infrastructure Hadoop has been adopted for RDF data management systems. In this paper, we introduce distributed RDF data stores, namely VPExp and 3CStore, based on the existing vertical partitioning (VP) approach. In the VPExp approach, we propose splitting of predicates based on explicit type information of an object. The 3CStore scheme is designed with a 3-column store, comprising of a subset of triples from the VP table based on different join correlations, to reduce the number of join operations while executing SPARQL queries as SQL in a distributed system. We evaluate these two RDF data storage approaches by comparing them with vertical partitioning approach and state-of-the-art RDF management system S2RDF. We also present an evaluation of query performance of these systems built upon two popular distributed computation engines namely, Spark and Drill.

AB - The rapidly growing amount of linked open data demands semantic RDF services that are efficient, scalable, and distributed along with high availability for reuse and fault tolerance. To address this concern, the Big Data processing infrastructure Hadoop has been adopted for RDF data management systems. In this paper, we introduce distributed RDF data stores, namely VPExp and 3CStore, based on the existing vertical partitioning (VP) approach. In the VPExp approach, we propose splitting of predicates based on explicit type information of an object. The 3CStore scheme is designed with a 3-column store, comprising of a subset of triples from the VP table based on different join correlations, to reduce the number of join operations while executing SPARQL queries as SQL in a distributed system. We evaluate these two RDF data storage approaches by comparing them with vertical partitioning approach and state-of-the-art RDF management system S2RDF. We also present an evaluation of query performance of these systems built upon two popular distributed computation engines namely, Spark and Drill.

KW - Drill

KW - Hadoop

KW - In-memory processing engine

KW - Information reuse

KW - RDF data storage

KW - Semantic web

KW - Spark

KW - SPARQL Querying

UR - http://www.scopus.com/inward/record.url?scp=85052300159&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85052300159&partnerID=8YFLogxK

U2 - 10.1109/IRI.2018.00056

DO - 10.1109/IRI.2018.00056

M3 - Conference contribution

AN - SCOPUS:85052300159

SN - 9781538626597

SP - 323

EP - 330

BT - Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018

PB - Institute of Electrical and Electronics Engineers Inc.

ER -