Semantic data querying over NoSQL databases with apache spark

Mahmudul Hassan; Srividya Bansal

doi:10.1109/IRI.2018.00061

Semantic data querying over NoSQL databases with apache spark

Mahmudul Hassan, Srividya Bansal

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

9 Scopus citations

Abstract

The rapid growth of semantic data in the form of RDF triples demands a scalable distributed storage and efficient query processing engine for its management and reuse. To overcome the limitation of native RDF stores and traditional relational database management systems and scale adequately with the exponential increase in the size of RDF datasets, Big Data processing infrastructure like Hadoop with MapReduce have been used. NoSQL databases such as HBase and Cassandra for storing large-scale RDF data and in-memory data processing to execute SPARQL query as SQL query using Apache Spark is proposed in this paper. This paper presents techniques for distributed RDF data storage and querying schemes for HBase and Cassandra clusters. We also present a compiler that translates SPARQL queries into their Spark SQL equivalent for execution. An empirical comparison of HBase and Cassandra systems using datasets and queries from Berlin SPARQL Benchmark (BSBM) and SPARQL Performance Benchmark (SP²Bench) on Microsoft Azure cloud is presented.

Original language	English (US)
Title of host publication	Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	364-371
Number of pages	8
ISBN (Print)	9781538626597
DOIs	https://doi.org/10.1109/IRI.2018.00061
State	Published - Aug 2 2018
Event	19th IEEE International Conference on Information Reuse and Integration for Data Science, IRI 2018 - Salt Lake City, United States Duration: Jul 7 2018 → Jul 9 2018

Publication series

Name	Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018

Other

Other	19th IEEE International Conference on Information Reuse and Integration for Data Science, IRI 2018
Country/Territory	United States
City	Salt Lake City
Period	7/7/18 → 7/9/18

Keywords

Apache Spark
Hadoop
In-memory RDF processing
Information reuse
NoSQL
SPARQL Querying
Semantic RDF data storage

ASJC Scopus subject areas

Computer Networks and Communications
Software
Artificial Intelligence
Information Systems and Management
Safety, Risk, Reliability and Quality
Public Administration

Access to Document

10.1109/IRI.2018.00061

Cite this

Hassan, M., & Bansal, S. (2018). Semantic data querying over NoSQL databases with apache spark. In Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018 (pp. 364-371). Article 8424732 (Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/IRI.2018.00061

Semantic data querying over NoSQL databases with apache spark. / Hassan, Mahmudul; Bansal, Srividya.
Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018. Institute of Electrical and Electronics Engineers Inc., 2018. p. 364-371 8424732 (Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Hassan, M & Bansal, S 2018, Semantic data querying over NoSQL databases with apache spark. in Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018., 8424732, Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018, Institute of Electrical and Electronics Engineers Inc., pp. 364-371, 19th IEEE International Conference on Information Reuse and Integration for Data Science, IRI 2018, Salt Lake City, United States, 7/7/18. https://doi.org/10.1109/IRI.2018.00061

Hassan M, Bansal S. Semantic data querying over NoSQL databases with apache spark. In Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018. Institute of Electrical and Electronics Engineers Inc. 2018. p. 364-371. 8424732. (Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018). doi: 10.1109/IRI.2018.00061

Hassan, Mahmudul ; Bansal, Srividya. / Semantic data querying over NoSQL databases with apache spark. Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018. Institute of Electrical and Electronics Engineers Inc., 2018. pp. 364-371 (Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018).

@inproceedings{0ed0efa6c9bd4c21a2bafcd4b002b5bd,

title = "Semantic data querying over NoSQL databases with apache spark",

abstract = "The rapid growth of semantic data in the form of RDF triples demands a scalable distributed storage and efficient query processing engine for its management and reuse. To overcome the limitation of native RDF stores and traditional relational database management systems and scale adequately with the exponential increase in the size of RDF datasets, Big Data processing infrastructure like Hadoop with MapReduce have been used. NoSQL databases such as HBase and Cassandra for storing large-scale RDF data and in-memory data processing to execute SPARQL query as SQL query using Apache Spark is proposed in this paper. This paper presents techniques for distributed RDF data storage and querying schemes for HBase and Cassandra clusters. We also present a compiler that translates SPARQL queries into their Spark SQL equivalent for execution. An empirical comparison of HBase and Cassandra systems using datasets and queries from Berlin SPARQL Benchmark (BSBM) and SPARQL Performance Benchmark (SP2Bench) on Microsoft Azure cloud is presented.",

keywords = "Apache Spark, Hadoop, In-memory RDF processing, Information reuse, NoSQL, SPARQL Querying, Semantic RDF data storage",

author = "Mahmudul Hassan and Srividya Bansal",

note = "Publisher Copyright: {\textcopyright} 2018 IEEE.; 19th IEEE International Conference on Information Reuse and Integration for Data Science, IRI 2018 ; Conference date: 07-07-2018 Through 09-07-2018",

year = "2018",

month = aug,

day = "2",

doi = "10.1109/IRI.2018.00061",

language = "English (US)",

isbn = "9781538626597",

series = "Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "364--371",

booktitle = "Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018",

}

TY - GEN

T1 - Semantic data querying over NoSQL databases with apache spark

AU - Hassan, Mahmudul

AU - Bansal, Srividya

PY - 2018/8/2

Y1 - 2018/8/2

N2 - The rapid growth of semantic data in the form of RDF triples demands a scalable distributed storage and efficient query processing engine for its management and reuse. To overcome the limitation of native RDF stores and traditional relational database management systems and scale adequately with the exponential increase in the size of RDF datasets, Big Data processing infrastructure like Hadoop with MapReduce have been used. NoSQL databases such as HBase and Cassandra for storing large-scale RDF data and in-memory data processing to execute SPARQL query as SQL query using Apache Spark is proposed in this paper. This paper presents techniques for distributed RDF data storage and querying schemes for HBase and Cassandra clusters. We also present a compiler that translates SPARQL queries into their Spark SQL equivalent for execution. An empirical comparison of HBase and Cassandra systems using datasets and queries from Berlin SPARQL Benchmark (BSBM) and SPARQL Performance Benchmark (SP2Bench) on Microsoft Azure cloud is presented.

AB - The rapid growth of semantic data in the form of RDF triples demands a scalable distributed storage and efficient query processing engine for its management and reuse. To overcome the limitation of native RDF stores and traditional relational database management systems and scale adequately with the exponential increase in the size of RDF datasets, Big Data processing infrastructure like Hadoop with MapReduce have been used. NoSQL databases such as HBase and Cassandra for storing large-scale RDF data and in-memory data processing to execute SPARQL query as SQL query using Apache Spark is proposed in this paper. This paper presents techniques for distributed RDF data storage and querying schemes for HBase and Cassandra clusters. We also present a compiler that translates SPARQL queries into their Spark SQL equivalent for execution. An empirical comparison of HBase and Cassandra systems using datasets and queries from Berlin SPARQL Benchmark (BSBM) and SPARQL Performance Benchmark (SP2Bench) on Microsoft Azure cloud is presented.

KW - Apache Spark

KW - Hadoop

KW - In-memory RDF processing

KW - Information reuse

KW - NoSQL

KW - SPARQL Querying

KW - Semantic RDF data storage

UR - http://www.scopus.com/inward/record.url?scp=85052328021&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85052328021&partnerID=8YFLogxK

U2 - 10.1109/IRI.2018.00061

DO - 10.1109/IRI.2018.00061

M3 - Conference contribution

AN - SCOPUS:85052328021

SN - 9781538626597

T3 - Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018

SP - 364

EP - 371

BT - Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 19th IEEE International Conference on Information Reuse and Integration for Data Science, IRI 2018

Y2 - 7 July 2018 through 9 July 2018

ER -

Semantic data querying over NoSQL databases with apache spark

Abstract

Publication series

Other

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this