String similarity join with different similarity thresholds based on novel indexing techniques

Chuitian Rong, Yasin Silva, Chunqing Li

Research output: Contribution to journalArticle

2 Scopus citations

Abstract

String similarity join is an essential operation of many applications that need to find all similar string pairs from two given collections. A quantitative way to determine whether two strings are similar is to compute their similarity based on a certain similarity function. The string pairs with similarity above a certain threshold are regarded as results. The current approach to solving the similarity join problem is to use a unique threshold value. There are, however, several scenarios that require the support of multiple thresholds, for instance, when the dataset includes strings of various lengths. In this scenario, longer string pairs typically tolerate much more typos than shorter ones. Therefore, we proposed a solution for string similarity joins that supports different similarity thresholds in a single operator. In order to support different thresholds, we devised two novel indexing techniques: partition based indexing and similarity aware indexing. To utilize the new indices and improve the join performance, we proposed new filtering methods and index probing techniques. To the best of our knowledge, this is the first work that addresses this problem. Experimental results on real-world datasets show that our solution performs efficiently while providing a more flexible threshold specification.

Original languageEnglish (US)
Pages (from-to)1-13
Number of pages13
JournalFrontiers of Computer Science
DOIs
StateAccepted/In press - Oct 11 2016

Keywords

  • similarity aware index
  • similarity join
  • similarity thresholds

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint Dive into the research topics of 'String similarity join with different similarity thresholds based on novel indexing techniques'. Together they form a unique fingerprint.

  • Cite this