TY - GEN
T1 - S3QLRDF
T2 - 2020 IEEE International Conference on Smart Data Services, SMDS 2020
AU - Hassan, Mahmudul
AU - Bansal, Srividya K.
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/10
Y1 - 2020/10
N2 - The proliferation of the semantic web in the form of Resource Description Framework (RDF) demands an efficient, scalable, and distributed storage along with a highly available and fault-tolerant parallel processing strategy. More precisely, the rapid growth of RDF data raises the need for an efficient partitioning strategy over distributed data management systems to improve SPARQL query performance regardless of its pattern shape with minimized pre-processing time. In this context, we propose a new relational partitioning scheme called Property Table Partitioning (PTP) for RDF data, that further partitions existing Property Table into multiple tables based on distinct properties (comprising of all subjects with non-null values for those distinct properties) in order to minimize input data and join operations. In this paper, we introduce a distributed RDF data management system called S3QLRDF, which is built on top of Spark and utilizes SQL to execute SPARQL queries over PTP schema. We perform an extensive experimental evaluation with respect to preprocessing costs and query performance, using Lehigh University Benchmark (LUBM) and Waterloo SPARQL Diversity Test Suite (WatDiv) datasets with up to 1.4 billion triples. Our results demonstrate that S3QLRDF outperforms state-of-the-art distributed RDF management systems.
AB - The proliferation of the semantic web in the form of Resource Description Framework (RDF) demands an efficient, scalable, and distributed storage along with a highly available and fault-tolerant parallel processing strategy. More precisely, the rapid growth of RDF data raises the need for an efficient partitioning strategy over distributed data management systems to improve SPARQL query performance regardless of its pattern shape with minimized pre-processing time. In this context, we propose a new relational partitioning scheme called Property Table Partitioning (PTP) for RDF data, that further partitions existing Property Table into multiple tables based on distinct properties (comprising of all subjects with non-null values for those distinct properties) in order to minimize input data and join operations. In this paper, we introduce a distributed RDF data management system called S3QLRDF, which is built on top of Spark and utilizes SQL to execute SPARQL queries over PTP schema. We perform an extensive experimental evaluation with respect to preprocessing costs and query performance, using Lehigh University Benchmark (LUBM) and Waterloo SPARQL Diversity Test Suite (WatDiv) datasets with up to 1.4 billion triples. Our results demonstrate that S3QLRDF outperforms state-of-the-art distributed RDF management systems.
KW - Resource Description Framework, Semantic Web, SPARQL Querying, Data Partitioning, Spark.
UR - http://www.scopus.com/inward/record.url?scp=85099258897&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85099258897&partnerID=8YFLogxK
U2 - 10.1109/SMDS49396.2020.00023
DO - 10.1109/SMDS49396.2020.00023
M3 - Conference contribution
AN - SCOPUS:85099258897
T3 - Proceedings - 2020 IEEE International Conference on Smart Data Services, SMDS 2020
SP - 133
EP - 140
BT - Proceedings - 2020 IEEE International Conference on Smart Data Services, SMDS 2020
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 18 October 2020 through 24 October 2020
ER -