Fast and flexible top-k similarity search on large networks

Jing Zhang; Jie Tang; Cong Ma; Hanghang Tong; Yu Jing; Juanzi Li; Walter Luyten; Marie Francine Moens

doi:10.1145/3086695

Fast and flexible top-k similarity search on large networks

Jing Zhang, Jie Tang, Cong Ma, Hanghang Tong, Yu Jing, Juanzi Li, Walter Luyten, Marie Francine Moens

Research output: Contribution to journal › Article › peer-review

19 Scopus citations

Abstract

Similarity search is a fundamental problem in network analysis and can be applied in many applications, such as collaborator recommendation in coauthor networks, friend recommendation in social networks, and relation prediction in medical information networks. In this article, we propose a sampling-based method using random paths to estimate the similarities based on both common neighbors and structural contexts efficiently in very large homogeneous or heterogeneous information networks. We give a theoretical guarantee that the sampling size depends on the error-bound ϵ, the confidence level (1 - δ), and the path length T of each random walk. We perform an extensive empirical study on a Tencent microblogging network of 1,000,000,000 edges. We show that our algorithm can return top-k similar vertices for any vertex in a network 300× faster than the state-of-the-art methods.We develop a prototype system of recommending similar authors to demonstrate the effectiveness of our method.

Original language	English (US)
Article number	13
Journal	ACM Transactions on Information Systems
Volume	36
Issue number	2
DOIs	https://doi.org/10.1145/3086695
State	Published - Aug 2017

Keywords

Heterogeneous information network
Random path
Similarity search
Social network
Vertex similarity

ASJC Scopus subject areas

Information Systems
General Business, Management and Accounting
Computer Science Applications

Access to Document

10.1145/3086695

Cite this

@article{82762c4b34e54eaabd99f66d969195aa,

title = "Fast and flexible top-k similarity search on large networks",

abstract = "Similarity search is a fundamental problem in network analysis and can be applied in many applications, such as collaborator recommendation in coauthor networks, friend recommendation in social networks, and relation prediction in medical information networks. In this article, we propose a sampling-based method using random paths to estimate the similarities based on both common neighbors and structural contexts efficiently in very large homogeneous or heterogeneous information networks. We give a theoretical guarantee that the sampling size depends on the error-bound ϵ, the confidence level (1 - δ), and the path length T of each random walk. We perform an extensive empirical study on a Tencent microblogging network of 1,000,000,000 edges. We show that our algorithm can return top-k similar vertices for any vertex in a network 300× faster than the state-of-the-art methods.We develop a prototype system of recommending similar authors to demonstrate the effectiveness of our method.",

keywords = "Heterogeneous information network, Random path, Similarity search, Social network, Vertex similarity",

author = "Jing Zhang and Jie Tang and Cong Ma and Hanghang Tong and Yu Jing and Juanzi Li and Walter Luyten and Moens, {Marie Francine}",

note = "Funding Information: The authors thank Pei Lee, Laks V. S. Lakshmanan, Jeffrey Xu Yu, Ruoming Jin, Victor E. Lee, Hui Xiong, Keith Henderson, Brian Gallagher, Lei Li, Leman Akoglu, Tina Eliassi-Rad, and Christos Faloutsos for sharing codes of the comparison methods. We thank Tina Eliassi-Rad for sharing the datasets. We thank Gang Fu, Bin Chen, Ying Ding, and Abhik Seal for sharing the datasets and baseline results. This work was supported by the National Basic Research Program of China (2014CB340506, 2014CB340402), the National Natural Science Foundation of China (61631013,61561130160, 61532021), the National Social Science Foundation of China (13&ZD190), the National Key Research and Develop Plan (2016YFB1000702), a research fund supported by MSRA, the Royal Society-Newton Advanced Fellowship Award, the Fundamental Research Funds for the Central Universities, and the Research Funds of Renmin University of China (15XNLQ06, 17XNLF09). Publisher Copyright: {\textcopyright} 2017 ACM.",

year = "2017",

month = aug,

doi = "10.1145/3086695",

language = "English (US)",

volume = "36",

journal = "ACM Transactions on Information Systems",

issn = "1046-8188",

publisher = "Association for Computing Machinery (ACM)",

number = "2",

}

TY - JOUR

T1 - Fast and flexible top-k similarity search on large networks

AU - Zhang, Jing

AU - Tang, Jie

AU - Ma, Cong

AU - Tong, Hanghang

AU - Jing, Yu

AU - Li, Juanzi

AU - Luyten, Walter

AU - Moens, Marie Francine

N1 - Funding Information: The authors thank Pei Lee, Laks V. S. Lakshmanan, Jeffrey Xu Yu, Ruoming Jin, Victor E. Lee, Hui Xiong, Keith Henderson, Brian Gallagher, Lei Li, Leman Akoglu, Tina Eliassi-Rad, and Christos Faloutsos for sharing codes of the comparison methods. We thank Tina Eliassi-Rad for sharing the datasets. We thank Gang Fu, Bin Chen, Ying Ding, and Abhik Seal for sharing the datasets and baseline results. This work was supported by the National Basic Research Program of China (2014CB340506, 2014CB340402), the National Natural Science Foundation of China (61631013,61561130160, 61532021), the National Social Science Foundation of China (13&ZD190), the National Key Research and Develop Plan (2016YFB1000702), a research fund supported by MSRA, the Royal Society-Newton Advanced Fellowship Award, the Fundamental Research Funds for the Central Universities, and the Research Funds of Renmin University of China (15XNLQ06, 17XNLF09). Publisher Copyright: © 2017 ACM.

PY - 2017/8

Y1 - 2017/8

N2 - Similarity search is a fundamental problem in network analysis and can be applied in many applications, such as collaborator recommendation in coauthor networks, friend recommendation in social networks, and relation prediction in medical information networks. In this article, we propose a sampling-based method using random paths to estimate the similarities based on both common neighbors and structural contexts efficiently in very large homogeneous or heterogeneous information networks. We give a theoretical guarantee that the sampling size depends on the error-bound ϵ, the confidence level (1 - δ), and the path length T of each random walk. We perform an extensive empirical study on a Tencent microblogging network of 1,000,000,000 edges. We show that our algorithm can return top-k similar vertices for any vertex in a network 300× faster than the state-of-the-art methods.We develop a prototype system of recommending similar authors to demonstrate the effectiveness of our method.

AB - Similarity search is a fundamental problem in network analysis and can be applied in many applications, such as collaborator recommendation in coauthor networks, friend recommendation in social networks, and relation prediction in medical information networks. In this article, we propose a sampling-based method using random paths to estimate the similarities based on both common neighbors and structural contexts efficiently in very large homogeneous or heterogeneous information networks. We give a theoretical guarantee that the sampling size depends on the error-bound ϵ, the confidence level (1 - δ), and the path length T of each random walk. We perform an extensive empirical study on a Tencent microblogging network of 1,000,000,000 edges. We show that our algorithm can return top-k similar vertices for any vertex in a network 300× faster than the state-of-the-art methods.We develop a prototype system of recommending similar authors to demonstrate the effectiveness of our method.

KW - Heterogeneous information network

KW - Random path

KW - Similarity search

KW - Social network

KW - Vertex similarity

UR - http://www.scopus.com/inward/record.url?scp=85028562955&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85028562955&partnerID=8YFLogxK

U2 - 10.1145/3086695

DO - 10.1145/3086695

M3 - Article

AN - SCOPUS:85028562955

SN - 1046-8188

VL - 36

JO - ACM Transactions on Information Systems

JF - ACM Transactions on Information Systems

IS - 2

M1 - 13

ER -

Fast and flexible top-k similarity search on large networks

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this