Semantic data querying over NoSQL databases with apache spark

Mahmudul Hassan, Srividya Bansal

Research output: Chapter in Book/Report/Conference proceedingConference contribution

9 Scopus citations

Abstract

The rapid growth of semantic data in the form of RDF triples demands a scalable distributed storage and efficient query processing engine for its management and reuse. To overcome the limitation of native RDF stores and traditional relational database management systems and scale adequately with the exponential increase in the size of RDF datasets, Big Data processing infrastructure like Hadoop with MapReduce have been used. NoSQL databases such as HBase and Cassandra for storing large-scale RDF data and in-memory data processing to execute SPARQL query as SQL query using Apache Spark is proposed in this paper. This paper presents techniques for distributed RDF data storage and querying schemes for HBase and Cassandra clusters. We also present a compiler that translates SPARQL queries into their Spark SQL equivalent for execution. An empirical comparison of HBase and Cassandra systems using datasets and queries from Berlin SPARQL Benchmark (BSBM) and SPARQL Performance Benchmark (SP2Bench) on Microsoft Azure cloud is presented.

Original languageEnglish (US)
Title of host publicationProceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages364-371
Number of pages8
ISBN (Print)9781538626597
DOIs
StatePublished - Aug 2 2018
Event19th IEEE International Conference on Information Reuse and Integration for Data Science, IRI 2018 - Salt Lake City, United States
Duration: Jul 7 2018Jul 9 2018

Publication series

NameProceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018

Other

Other19th IEEE International Conference on Information Reuse and Integration for Data Science, IRI 2018
Country/TerritoryUnited States
CitySalt Lake City
Period7/7/187/9/18

Keywords

  • Apache Spark
  • Hadoop
  • In-memory RDF processing
  • Information reuse
  • NoSQL
  • SPARQL Querying
  • Semantic RDF data storage

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Software
  • Artificial Intelligence
  • Information Systems and Management
  • Safety, Risk, Reliability and Quality
  • Public Administration

Fingerprint

Dive into the research topics of 'Semantic data querying over NoSQL databases with apache spark'. Together they form a unique fingerprint.

Cite this