TY - GEN
T1 - Panther
T2 - 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2015
AU - Zhang, Jing
AU - Tang, Jie
AU - Ma, Cong
AU - Tong, Hanghang
AU - Jing, Yu
AU - Li, Juanzi
N1 - Publisher Copyright:
© 2015 ACM.
Copyright:
Copyright 2018 Elsevier B.V., All rights reserved.
PY - 2015/8/10
Y1 - 2015/8/10
N2 - Estimating similarity between vertices is a fundamental issue in network analysis across various domains, such as social networks and biological networks. Methods based on common neighbors and structural contexts have received much attention. However, both categories of methods are difficult to scale up to handle large networks (with billions of nodes). In this paper, we propose a sampling method that provably and accurately estimates the similarity between vertices. The algorithm is based on a novel idea of random path. Specifically, given a network, we perform R random walks, each starting from a randomly picked vertex and walking T steps. Theoretically, the algorithm guarantees that the sampling size R = 0(2ε-2 log2 T) depends on the error-bound ε, the confidence level (1 - δ), and the path length T of each random walk. We perform extensive empirical study on a Tencent microblogging network of 1,000,000,000 edges. We show that our algorithm can return top-k similar vertices for any vertex in a network 300x faster than the state-of-the-art methods. We also use two applications-identity resolution and structural hole spanner finding-to evaluate the accuracy of the estimated similarities. Our results demonstrate that the proposed algorithm achieves clearly better performance than several alternative methods.
AB - Estimating similarity between vertices is a fundamental issue in network analysis across various domains, such as social networks and biological networks. Methods based on common neighbors and structural contexts have received much attention. However, both categories of methods are difficult to scale up to handle large networks (with billions of nodes). In this paper, we propose a sampling method that provably and accurately estimates the similarity between vertices. The algorithm is based on a novel idea of random path. Specifically, given a network, we perform R random walks, each starting from a randomly picked vertex and walking T steps. Theoretically, the algorithm guarantees that the sampling size R = 0(2ε-2 log2 T) depends on the error-bound ε, the confidence level (1 - δ), and the path length T of each random walk. We perform extensive empirical study on a Tencent microblogging network of 1,000,000,000 edges. We show that our algorithm can return top-k similar vertices for any vertex in a network 300x faster than the state-of-the-art methods. We also use two applications-identity resolution and structural hole spanner finding-to evaluate the accuracy of the estimated similarities. Our results demonstrate that the proposed algorithm achieves clearly better performance than several alternative methods.
KW - Random path
KW - Social network
KW - Vertex similarity
UR - http://www.scopus.com/inward/record.url?scp=84954170138&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84954170138&partnerID=8YFLogxK
U2 - 10.1145/2783258.2783267
DO - 10.1145/2783258.2783267
M3 - Conference contribution
AN - SCOPUS:84954170138
T3 - Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
SP - 1445
EP - 1454
BT - KDD 2015 - Proceedings of the 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining
PB - Association for Computing Machinery
Y2 - 10 August 2015 through 13 August 2015
ER -