Panther: Fast top-k similarity search on large networks

Jing Zhang, Jie Tang, Cong Ma, Hanghang Tong, Yu Jing, Juanzi Li

Research output: Chapter in Book/Report/Conference proceedingConference contribution

19 Citations (Scopus)

Abstract

Estimating similarity between vertices is a fundamental issue in network analysis across various domains, such as social networks and biological networks. Methods based on common neighbors and structural contexts have received much attention. However, both categories of methods are difficult to scale up to handle large networks (with billions of nodes). In this paper, we propose a sampling method that provably and accurately estimates the similarity between vertices. The algorithm is based on a novel idea of random path. Specifically, given a network, we perform R random walks, each starting from a randomly picked vertex and walking T steps. Theoretically, the algorithm guarantees that the sampling size R = 0(2ε-2 log2 T) depends on the error-bound ε, the confidence level (1 - δ), and the path length T of each random walk. We perform extensive empirical study on a Tencent microblogging network of 1,000,000,000 edges. We show that our algorithm can return top-k similar vertices for any vertex in a network 300x faster than the state-of-the-art methods. We also use two applications-identity resolution and structural hole spanner finding-to evaluate the accuracy of the estimated similarities. Our results demonstrate that the proposed algorithm achieves clearly better performance than several alternative methods.

Original languageEnglish (US)
Title of host publicationProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
PublisherAssociation for Computing Machinery
Pages1445-1454
Number of pages10
Volume2015-August
ISBN (Print)9781450336642
DOIs
StatePublished - Aug 10 2015
Event21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2015 - Sydney, Australia
Duration: Aug 10 2015Aug 13 2015

Other

Other21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2015
CountryAustralia
CitySydney
Period8/10/158/13/15

Fingerprint

Sampling
Electric network analysis

Keywords

  • Random path
  • Social network
  • Vertex similarity

ASJC Scopus subject areas

  • Software
  • Information Systems

Cite this

Zhang, J., Tang, J., Ma, C., Tong, H., Jing, Y., & Li, J. (2015). Panther: Fast top-k similarity search on large networks. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Vol. 2015-August, pp. 1445-1454). Association for Computing Machinery. https://doi.org/10.1145/2783258.2783267

Panther : Fast top-k similarity search on large networks. / Zhang, Jing; Tang, Jie; Ma, Cong; Tong, Hanghang; Jing, Yu; Li, Juanzi.

Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Vol. 2015-August Association for Computing Machinery, 2015. p. 1445-1454.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Zhang, J, Tang, J, Ma, C, Tong, H, Jing, Y & Li, J 2015, Panther: Fast top-k similarity search on large networks. in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. vol. 2015-August, Association for Computing Machinery, pp. 1445-1454, 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2015, Sydney, Australia, 8/10/15. https://doi.org/10.1145/2783258.2783267
Zhang J, Tang J, Ma C, Tong H, Jing Y, Li J. Panther: Fast top-k similarity search on large networks. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Vol. 2015-August. Association for Computing Machinery. 2015. p. 1445-1454 https://doi.org/10.1145/2783258.2783267
Zhang, Jing ; Tang, Jie ; Ma, Cong ; Tong, Hanghang ; Jing, Yu ; Li, Juanzi. / Panther : Fast top-k similarity search on large networks. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Vol. 2015-August Association for Computing Machinery, 2015. pp. 1445-1454
@inproceedings{2d32aae20abb49e99f20adc54c25925c,
title = "Panther: Fast top-k similarity search on large networks",
abstract = "Estimating similarity between vertices is a fundamental issue in network analysis across various domains, such as social networks and biological networks. Methods based on common neighbors and structural contexts have received much attention. However, both categories of methods are difficult to scale up to handle large networks (with billions of nodes). In this paper, we propose a sampling method that provably and accurately estimates the similarity between vertices. The algorithm is based on a novel idea of random path. Specifically, given a network, we perform R random walks, each starting from a randomly picked vertex and walking T steps. Theoretically, the algorithm guarantees that the sampling size R = 0(2ε-2 log2 T) depends on the error-bound ε, the confidence level (1 - δ), and the path length T of each random walk. We perform extensive empirical study on a Tencent microblogging network of 1,000,000,000 edges. We show that our algorithm can return top-k similar vertices for any vertex in a network 300x faster than the state-of-the-art methods. We also use two applications-identity resolution and structural hole spanner finding-to evaluate the accuracy of the estimated similarities. Our results demonstrate that the proposed algorithm achieves clearly better performance than several alternative methods.",
keywords = "Random path, Social network, Vertex similarity",
author = "Jing Zhang and Jie Tang and Cong Ma and Hanghang Tong and Yu Jing and Juanzi Li",
year = "2015",
month = "8",
day = "10",
doi = "10.1145/2783258.2783267",
language = "English (US)",
isbn = "9781450336642",
volume = "2015-August",
pages = "1445--1454",
booktitle = "Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining",
publisher = "Association for Computing Machinery",

}

TY - GEN

T1 - Panther

T2 - Fast top-k similarity search on large networks

AU - Zhang, Jing

AU - Tang, Jie

AU - Ma, Cong

AU - Tong, Hanghang

AU - Jing, Yu

AU - Li, Juanzi

PY - 2015/8/10

Y1 - 2015/8/10

N2 - Estimating similarity between vertices is a fundamental issue in network analysis across various domains, such as social networks and biological networks. Methods based on common neighbors and structural contexts have received much attention. However, both categories of methods are difficult to scale up to handle large networks (with billions of nodes). In this paper, we propose a sampling method that provably and accurately estimates the similarity between vertices. The algorithm is based on a novel idea of random path. Specifically, given a network, we perform R random walks, each starting from a randomly picked vertex and walking T steps. Theoretically, the algorithm guarantees that the sampling size R = 0(2ε-2 log2 T) depends on the error-bound ε, the confidence level (1 - δ), and the path length T of each random walk. We perform extensive empirical study on a Tencent microblogging network of 1,000,000,000 edges. We show that our algorithm can return top-k similar vertices for any vertex in a network 300x faster than the state-of-the-art methods. We also use two applications-identity resolution and structural hole spanner finding-to evaluate the accuracy of the estimated similarities. Our results demonstrate that the proposed algorithm achieves clearly better performance than several alternative methods.

AB - Estimating similarity between vertices is a fundamental issue in network analysis across various domains, such as social networks and biological networks. Methods based on common neighbors and structural contexts have received much attention. However, both categories of methods are difficult to scale up to handle large networks (with billions of nodes). In this paper, we propose a sampling method that provably and accurately estimates the similarity between vertices. The algorithm is based on a novel idea of random path. Specifically, given a network, we perform R random walks, each starting from a randomly picked vertex and walking T steps. Theoretically, the algorithm guarantees that the sampling size R = 0(2ε-2 log2 T) depends on the error-bound ε, the confidence level (1 - δ), and the path length T of each random walk. We perform extensive empirical study on a Tencent microblogging network of 1,000,000,000 edges. We show that our algorithm can return top-k similar vertices for any vertex in a network 300x faster than the state-of-the-art methods. We also use two applications-identity resolution and structural hole spanner finding-to evaluate the accuracy of the estimated similarities. Our results demonstrate that the proposed algorithm achieves clearly better performance than several alternative methods.

KW - Random path

KW - Social network

KW - Vertex similarity

UR - http://www.scopus.com/inward/record.url?scp=84954170138&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84954170138&partnerID=8YFLogxK

U2 - 10.1145/2783258.2783267

DO - 10.1145/2783258.2783267

M3 - Conference contribution

AN - SCOPUS:84954170138

SN - 9781450336642

VL - 2015-August

SP - 1445

EP - 1454

BT - Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

PB - Association for Computing Machinery

ER -