TY - GEN
T1 - On consistency of graph-based semi-supervised learning
AU - Du, Chengan
AU - Zhao, Yunpeng
AU - Wang, Feng
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/7
Y1 - 2019/7
N2 - Graph-based semi-supervised learning is one of the most popular methods in machine learning. Some of its theoretical properties such as bounds for the generalization error and the convergence of the graph Laplacian regularizer have been studied in computer science and statistics literature. However, a fundamental statistical property - consistency - that is, the prediction by the algorithm can identify the underlying truth with unlimited data. This is not to be confused with the existence of solutions in an equation system, which is a term used in algebra. - has not been proved. In this article, we study the consistency problem under a non-parametric framework. We obtain the following two results: 1) We prove that graph-based semi-supervised learning on the test data is consistent in the case that the estimated scores are enforced to be equal to the observed responses for the labeled data (the hard criterion). The sample size of unlabeled data are allowed to grow at a slower rate than the size of the labeled data in this result. 2) We give a counterexample demonstrating that the estimator can be inconsistent for the case when the estimated scores are not required to be equal to the observed responses (the soft criterion), where a tuning parameter is used to balance the loss function and the graph Laplacian regularizer. These somewhat surprising theoretical findings are supported by numerical studies on both synthetic and real datasets. Moreover, numerical studies show that the hard criterion constantly outperforms the soft criterion even when the sample size of unlabeled data is smaller than the size of labeled data. This suggests that practitioners can safely choose the hard criterion without the burden of selecting the tuning parameter in the soft criterion.
AB - Graph-based semi-supervised learning is one of the most popular methods in machine learning. Some of its theoretical properties such as bounds for the generalization error and the convergence of the graph Laplacian regularizer have been studied in computer science and statistics literature. However, a fundamental statistical property - consistency - that is, the prediction by the algorithm can identify the underlying truth with unlimited data. This is not to be confused with the existence of solutions in an equation system, which is a term used in algebra. - has not been proved. In this article, we study the consistency problem under a non-parametric framework. We obtain the following two results: 1) We prove that graph-based semi-supervised learning on the test data is consistent in the case that the estimated scores are enforced to be equal to the observed responses for the labeled data (the hard criterion). The sample size of unlabeled data are allowed to grow at a slower rate than the size of the labeled data in this result. 2) We give a counterexample demonstrating that the estimator can be inconsistent for the case when the estimated scores are not required to be equal to the observed responses (the soft criterion), where a tuning parameter is used to balance the loss function and the graph Laplacian regularizer. These somewhat surprising theoretical findings are supported by numerical studies on both synthetic and real datasets. Moreover, numerical studies show that the hard criterion constantly outperforms the soft criterion even when the sample size of unlabeled data is smaller than the size of labeled data. This suggests that practitioners can safely choose the hard criterion without the burden of selecting the tuning parameter in the soft criterion.
KW - Consistency
KW - Graph Laplacian
KW - Semi-supervised learning
UR - http://www.scopus.com/inward/record.url?scp=85074844007&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85074844007&partnerID=8YFLogxK
U2 - 10.1109/ICDCS.2019.00055
DO - 10.1109/ICDCS.2019.00055
M3 - Conference contribution
AN - SCOPUS:85074844007
T3 - Proceedings - International Conference on Distributed Computing Systems
SP - 483
EP - 491
BT - Proceedings - 2019 39th IEEE International Conference on Distributed Computing Systems, ICDCS 2019
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 39th IEEE International Conference on Distributed Computing Systems, ICDCS 2019
Y2 - 7 July 2019 through 9 July 2019
ER -