Semantic data querying over NoSQL databases with apache spark

Mahmudul Hassan, Srividya Bansal

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The rapid growth of semantic data in the form of RDF triples demands a scalable distributed storage and efficient query processing engine for its management and reuse. To overcome the limitation of native RDF stores and traditional relational database management systems and scale adequately with the exponential increase in the size of RDF datasets, Big Data processing infrastructure like Hadoop with MapReduce have been used. NoSQL databases such as HBase and Cassandra for storing large-scale RDF data and in-memory data processing to execute SPARQL query as SQL query using Apache Spark is proposed in this paper. This paper presents techniques for distributed RDF data storage and querying schemes for HBase and Cassandra clusters. We also present a compiler that translates SPARQL queries into their Spark SQL equivalent for execution. An empirical comparison of HBase and Cassandra systems using datasets and queries from Berlin SPARQL Benchmark (BSBM) and SPARQL Performance Benchmark (SP2Bench) on Microsoft Azure cloud is presented.

Original languageEnglish (US)
Title of host publicationProceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages364-371
Number of pages8
ISBN (Print)9781538626597
DOIs
StatePublished - Aug 2 2018
Event19th IEEE International Conference on Information Reuse and Integration for Data Science, IRI 2018 - Salt Lake City, United States
Duration: Jul 7 2018Jul 9 2018

Other

Other19th IEEE International Conference on Information Reuse and Integration for Data Science, IRI 2018
CountryUnited States
CitySalt Lake City
Period7/7/187/9/18

Fingerprint

Electric sparks
Semantics
semantics
Data storage equipment
data storage
Query processing
management
Berlin
infrastructure
Engines
performance
Data base
Query
Big data
Benchmark

Keywords

  • Apache Spark
  • Hadoop
  • In-memory RDF processing
  • Information reuse
  • NoSQL
  • Semantic RDF data storage
  • SPARQL Querying

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Software
  • Artificial Intelligence
  • Information Systems and Management
  • Safety, Risk, Reliability and Quality
  • Public Administration

Cite this

Hassan, M., & Bansal, S. (2018). Semantic data querying over NoSQL databases with apache spark. In Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018 (pp. 364-371). [8424732] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/IRI.2018.00061

Semantic data querying over NoSQL databases with apache spark. / Hassan, Mahmudul; Bansal, Srividya.

Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018. Institute of Electrical and Electronics Engineers Inc., 2018. p. 364-371 8424732.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Hassan, M & Bansal, S 2018, Semantic data querying over NoSQL databases with apache spark. in Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018., 8424732, Institute of Electrical and Electronics Engineers Inc., pp. 364-371, 19th IEEE International Conference on Information Reuse and Integration for Data Science, IRI 2018, Salt Lake City, United States, 7/7/18. https://doi.org/10.1109/IRI.2018.00061
Hassan M, Bansal S. Semantic data querying over NoSQL databases with apache spark. In Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018. Institute of Electrical and Electronics Engineers Inc. 2018. p. 364-371. 8424732 https://doi.org/10.1109/IRI.2018.00061
Hassan, Mahmudul ; Bansal, Srividya. / Semantic data querying over NoSQL databases with apache spark. Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018. Institute of Electrical and Electronics Engineers Inc., 2018. pp. 364-371
@inproceedings{0ed0efa6c9bd4c21a2bafcd4b002b5bd,
title = "Semantic data querying over NoSQL databases with apache spark",
abstract = "The rapid growth of semantic data in the form of RDF triples demands a scalable distributed storage and efficient query processing engine for its management and reuse. To overcome the limitation of native RDF stores and traditional relational database management systems and scale adequately with the exponential increase in the size of RDF datasets, Big Data processing infrastructure like Hadoop with MapReduce have been used. NoSQL databases such as HBase and Cassandra for storing large-scale RDF data and in-memory data processing to execute SPARQL query as SQL query using Apache Spark is proposed in this paper. This paper presents techniques for distributed RDF data storage and querying schemes for HBase and Cassandra clusters. We also present a compiler that translates SPARQL queries into their Spark SQL equivalent for execution. An empirical comparison of HBase and Cassandra systems using datasets and queries from Berlin SPARQL Benchmark (BSBM) and SPARQL Performance Benchmark (SP2Bench) on Microsoft Azure cloud is presented.",
keywords = "Apache Spark, Hadoop, In-memory RDF processing, Information reuse, NoSQL, Semantic RDF data storage, SPARQL Querying",
author = "Mahmudul Hassan and Srividya Bansal",
year = "2018",
month = "8",
day = "2",
doi = "10.1109/IRI.2018.00061",
language = "English (US)",
isbn = "9781538626597",
pages = "364--371",
booktitle = "Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - GEN

T1 - Semantic data querying over NoSQL databases with apache spark

AU - Hassan, Mahmudul

AU - Bansal, Srividya

PY - 2018/8/2

Y1 - 2018/8/2

N2 - The rapid growth of semantic data in the form of RDF triples demands a scalable distributed storage and efficient query processing engine for its management and reuse. To overcome the limitation of native RDF stores and traditional relational database management systems and scale adequately with the exponential increase in the size of RDF datasets, Big Data processing infrastructure like Hadoop with MapReduce have been used. NoSQL databases such as HBase and Cassandra for storing large-scale RDF data and in-memory data processing to execute SPARQL query as SQL query using Apache Spark is proposed in this paper. This paper presents techniques for distributed RDF data storage and querying schemes for HBase and Cassandra clusters. We also present a compiler that translates SPARQL queries into their Spark SQL equivalent for execution. An empirical comparison of HBase and Cassandra systems using datasets and queries from Berlin SPARQL Benchmark (BSBM) and SPARQL Performance Benchmark (SP2Bench) on Microsoft Azure cloud is presented.

AB - The rapid growth of semantic data in the form of RDF triples demands a scalable distributed storage and efficient query processing engine for its management and reuse. To overcome the limitation of native RDF stores and traditional relational database management systems and scale adequately with the exponential increase in the size of RDF datasets, Big Data processing infrastructure like Hadoop with MapReduce have been used. NoSQL databases such as HBase and Cassandra for storing large-scale RDF data and in-memory data processing to execute SPARQL query as SQL query using Apache Spark is proposed in this paper. This paper presents techniques for distributed RDF data storage and querying schemes for HBase and Cassandra clusters. We also present a compiler that translates SPARQL queries into their Spark SQL equivalent for execution. An empirical comparison of HBase and Cassandra systems using datasets and queries from Berlin SPARQL Benchmark (BSBM) and SPARQL Performance Benchmark (SP2Bench) on Microsoft Azure cloud is presented.

KW - Apache Spark

KW - Hadoop

KW - In-memory RDF processing

KW - Information reuse

KW - NoSQL

KW - Semantic RDF data storage

KW - SPARQL Querying

UR - http://www.scopus.com/inward/record.url?scp=85052328021&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85052328021&partnerID=8YFLogxK

U2 - 10.1109/IRI.2018.00061

DO - 10.1109/IRI.2018.00061

M3 - Conference contribution

SN - 9781538626597

SP - 364

EP - 371

BT - Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018

PB - Institute of Electrical and Electronics Engineers Inc.

ER -