TY - GEN
T1 - Using undiagnosed data to enhance computerized breast cancer analysis with a three stage data labeling method
AU - Sun, Wenqing
AU - Tseng, Tzu Liang
AU - Zheng, Bin
AU - Lure, Flemin
AU - Wu, Teresa
AU - Francia, Giulio
AU - Cabrera, Sergio
AU - Zhang, Jianying
AU - Vélez-Reyesv, Miguel
AU - Qian, Wei
PY - 2014
Y1 - 2014
N2 - A novel three stage Semi-Supervised Learning (SSL) approach is proposed for improving performance of computerized breast cancer analysis with undiagnosed data. These three stages include: (1) Instance selection, which is barely used in SSL or computerized cancer analysis systems, (2) Feature selection and (3) Newly designed Divide Co-traininga' data labeling method. 379 suspicious early breast cancer area samples from 121 mammograms were used in our research. Our proposed Divide Co-traininga' method is able to generate two classifiers through split original diagnosed dataset (labeled data), and label the undiagnosed data (unlabeled data) when they reached an agreement. The highest AUC (Area Under Curve, also called Az value) using labeled data only was 0.832 and it increased to 0.889 when undiagnosed data were included. The results indicate instance selection module could eliminate untypical data or noise data and enhance the following semi-supervised data labeling performance. Based on analyzing different data sizes, it can be observed that the AUC and accuracy go higher with the increase of either diagnosed data or undiagnosed data, and reach the best improvement (ΔAUC = 0.078, ΔAccuracy = 7.6%) with 40 of labeled data and 300 of unlabeled data.
AB - A novel three stage Semi-Supervised Learning (SSL) approach is proposed for improving performance of computerized breast cancer analysis with undiagnosed data. These three stages include: (1) Instance selection, which is barely used in SSL or computerized cancer analysis systems, (2) Feature selection and (3) Newly designed Divide Co-traininga' data labeling method. 379 suspicious early breast cancer area samples from 121 mammograms were used in our research. Our proposed Divide Co-traininga' method is able to generate two classifiers through split original diagnosed dataset (labeled data), and label the undiagnosed data (unlabeled data) when they reached an agreement. The highest AUC (Area Under Curve, also called Az value) using labeled data only was 0.832 and it increased to 0.889 when undiagnosed data were included. The results indicate instance selection module could eliminate untypical data or noise data and enhance the following semi-supervised data labeling performance. Based on analyzing different data sizes, it can be observed that the AUC and accuracy go higher with the increase of either diagnosed data or undiagnosed data, and reach the best improvement (ΔAUC = 0.078, ΔAccuracy = 7.6%) with 40 of labeled data and 300 of unlabeled data.
KW - Computerized breast cancer analysis
KW - Semi-supervised learning
KW - Undiagnosed data
UR - http://www.scopus.com/inward/record.url?scp=84902106527&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84902106527&partnerID=8YFLogxK
U2 - 10.1117/12.2043708
DO - 10.1117/12.2043708
M3 - Conference contribution
AN - SCOPUS:84902106527
SN - 9780819498281
T3 - Progress in Biomedical Optics and Imaging - Proceedings of SPIE
BT - Medical Imaging 2014
PB - SPIE
T2 - Medical Imaging 2014: Computer-Aided Diagnosis
Y2 - 18 February 2014 through 20 February 2014
ER -