S3QLRDF: distributed SPARQL query processing using Apache Spark—a comparative performance study

Mahmudul Hassan, Srividya Bansal

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

The proliferation of semantic data in the form of Resource Description Framework (RDF) triples demands an efficient, scalable, and distributed storage along with a highly available and fault-tolerant parallel processing strategy. There are three open issues with distributed RDF data management systems that are not well addressed altogether in existing work. First is the querying efficiency, second is that solutions are optimized for certain types of query patterns and don’t necessarily work well for all types, and third is concerned with reducing pre-processing cost. More precisely, the rapid growth of RDF data raises the need for an efficient partitioning strategy over distributed data management systems to improve SPARQL (SPARQL Protocol and RDF Query Language) query performance regardless of its pattern shape with minimized pre-processing time. In distributed RDF systems, both the data and the query processing are highly distributed. On the other hand, SPARQL workloads are dynamic and structurally diverse that can have different degrees of complexity. A complex SPARQL query over a large RDF graph in distributed systems requires combining a lot of distributed pieces of data through join operations. Therefore, designing an efficient data-partitioning schema and join strategy to minimize data transfer is the fundamental challenge in distributed RDF data management systems. In this context, we propose a new relational partitioning schema called Property Table Partitioning (PTP) for RDF data, that further partitions existing Property Table into multiple tables based on distinct properties (comprising of all subjects with non-null values for those distinct properties) in order to minimize input size and number of join operations of a query. This paper proposed a distributed RDF data management system called S3QLRDF, which is built on top of Spark and utilizes SQL to execute SPARQL queries over PTP schema. The experimental analysis with respect to preprocessing costs and query performance, using synthetic and real datasets shows that S3QLRDF outperforms state-of-the-art distributed RDF management systems.

Original languageEnglish (US)
Pages (from-to)191-231
Number of pages41
JournalDistributed and Parallel Databases
Volume41
Issue number3
DOIs
StatePublished - Sep 2023

Keywords

  • Data partitioning
  • RDF
  • SPARQL
  • Spark

ASJC Scopus subject areas

  • Software
  • Information Systems
  • Hardware and Architecture
  • Information Systems and Management

Fingerprint

Dive into the research topics of 'S3QLRDF: distributed SPARQL query processing using Apache Spark—a comparative performance study'. Together they form a unique fingerprint.

Cite this