Transfer learning for bilingual content classification

Qian Sun; Mohammad S. Amin; Baoshi Yan; Craig Martell; Vita Markman; Anmol Bhasin; Jieping Ye

doi:10.1145/2783258.2788575

Transfer learning for bilingual content classification

Qian Sun, Mohammad S. Amin, Baoshi Yan, Craig Martell, Vita Markman, Anmol Bhasin, Jieping Ye

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

12 Scopus citations

Abstract

LinkedIn Groups provide a platform on which professionals with similar background, target and specialities can share content, take part in discussions and establish opinions on industry topics. As in most online social communities, spam content in LinkedIn Groups poses great challenges to the user experience and could eventually lead to substantial loss of active users. Building an intelligent and scalable spam detection system is highly desirable but faces difficulties such as lack of labeled training data, particularly for languages other than English. In this paper, we take the spam (Spanish) job posting detection as the target problem and build a generic machine learning pipeline for multi-lingual spam detection. The main components are feature generation and knowledge migration via transfer learning. Specifically, in the feature generation phase, a relatively large labeled data set is generated via machine translation. Together with a large set of unlabeled human written Spanish data, unigram features are generated based on the frequency. In the second phase, machine translated data are properly reweighted to capture the discrepancy from human written ones and classifiers can be built on top of them. To make effective use of a small portion of labeled data available in human written Spanish, an adaptive transfer learning algorithm is proposed to further improve the performance. We evaluate the proposed method on LinkedIn's production data and the promising results verify the efficacy of our proposed algorithm. The pipeline is ready for production.

Original language	English (US)
Title of host publication	KDD 2015 - Proceedings of the 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining
Publisher	Association for Computing Machinery
Pages	2147-2156
Number of pages	10
ISBN (Electronic)	9781450336642
DOIs	https://doi.org/10.1145/2783258.2788575
State	Published - Aug 10 2015
Externally published	Yes
Event	21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2015 - Sydney, Australia Duration: Aug 10 2015 → Aug 13 2015

Publication series

Name	Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Volume	2015-August

Conference

Conference	21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2015
Country/Territory	Australia
City	Sydney
Period	8/10/15 → 8/13/15

Keywords

Classification
NLP
Text mining
Transfer learning

ASJC Scopus subject areas

Software
Information Systems

Access to Document

10.1145/2783258.2788575

Cite this

Sun, Q., Amin, M. S., Yan, B., Martell, C., Markman, V., Bhasin, A., & Ye, J. (2015). Transfer learning for bilingual content classification. In KDD 2015 - Proceedings of the 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (pp. 2147-2156). (Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Vol. 2015-August). Association for Computing Machinery. https://doi.org/10.1145/2783258.2788575

Transfer learning for bilingual content classification. / Sun, Qian; Amin, Mohammad S.; Yan, Baoshi et al.
KDD 2015 - Proceedings of the 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, 2015. p. 2147-2156 (Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Vol. 2015-August).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Sun, Q, Amin, MS, Yan, B, Martell, C, Markman, V, Bhasin, A & Ye, J 2015, Transfer learning for bilingual content classification. in KDD 2015 - Proceedings of the 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, vol. 2015-August, Association for Computing Machinery, pp. 2147-2156, 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2015, Sydney, Australia, 8/10/15. https://doi.org/10.1145/2783258.2788575

Sun Q, Amin MS, Yan B, Martell C, Markman V, Bhasin A et al. Transfer learning for bilingual content classification. In KDD 2015 - Proceedings of the 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery. 2015. p. 2147-2156. (Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining). doi: 10.1145/2783258.2788575

@inproceedings{1a10e14f074f4e97acc039b784115977,

title = "Transfer learning for bilingual content classification",

abstract = "LinkedIn Groups provide a platform on which professionals with similar background, target and specialities can share content, take part in discussions and establish opinions on industry topics. As in most online social communities, spam content in LinkedIn Groups poses great challenges to the user experience and could eventually lead to substantial loss of active users. Building an intelligent and scalable spam detection system is highly desirable but faces difficulties such as lack of labeled training data, particularly for languages other than English. In this paper, we take the spam (Spanish) job posting detection as the target problem and build a generic machine learning pipeline for multi-lingual spam detection. The main components are feature generation and knowledge migration via transfer learning. Specifically, in the feature generation phase, a relatively large labeled data set is generated via machine translation. Together with a large set of unlabeled human written Spanish data, unigram features are generated based on the frequency. In the second phase, machine translated data are properly reweighted to capture the discrepancy from human written ones and classifiers can be built on top of them. To make effective use of a small portion of labeled data available in human written Spanish, an adaptive transfer learning algorithm is proposed to further improve the performance. We evaluate the proposed method on LinkedIn's production data and the promising results verify the efficacy of our proposed algorithm. The pipeline is ready for production.",

keywords = "Classification, NLP, Text mining, Transfer learning",

author = "Qian Sun and Amin, {Mohammad S.} and Baoshi Yan and Craig Martell and Vita Markman and Anmol Bhasin and Jieping Ye",

note = "Publisher Copyright: {\textcopyright} 2015 ACM.; 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2015 ; Conference date: 10-08-2015 Through 13-08-2015",

year = "2015",

month = aug,

day = "10",

doi = "10.1145/2783258.2788575",

language = "English (US)",

series = "Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining",

publisher = "Association for Computing Machinery",

pages = "2147--2156",

booktitle = "KDD 2015 - Proceedings of the 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining",

}

TY - GEN

T1 - Transfer learning for bilingual content classification

AU - Sun, Qian

AU - Amin, Mohammad S.

AU - Yan, Baoshi

AU - Martell, Craig

AU - Markman, Vita

AU - Bhasin, Anmol

AU - Ye, Jieping

PY - 2015/8/10

Y1 - 2015/8/10

N2 - LinkedIn Groups provide a platform on which professionals with similar background, target and specialities can share content, take part in discussions and establish opinions on industry topics. As in most online social communities, spam content in LinkedIn Groups poses great challenges to the user experience and could eventually lead to substantial loss of active users. Building an intelligent and scalable spam detection system is highly desirable but faces difficulties such as lack of labeled training data, particularly for languages other than English. In this paper, we take the spam (Spanish) job posting detection as the target problem and build a generic machine learning pipeline for multi-lingual spam detection. The main components are feature generation and knowledge migration via transfer learning. Specifically, in the feature generation phase, a relatively large labeled data set is generated via machine translation. Together with a large set of unlabeled human written Spanish data, unigram features are generated based on the frequency. In the second phase, machine translated data are properly reweighted to capture the discrepancy from human written ones and classifiers can be built on top of them. To make effective use of a small portion of labeled data available in human written Spanish, an adaptive transfer learning algorithm is proposed to further improve the performance. We evaluate the proposed method on LinkedIn's production data and the promising results verify the efficacy of our proposed algorithm. The pipeline is ready for production.

AB - LinkedIn Groups provide a platform on which professionals with similar background, target and specialities can share content, take part in discussions and establish opinions on industry topics. As in most online social communities, spam content in LinkedIn Groups poses great challenges to the user experience and could eventually lead to substantial loss of active users. Building an intelligent and scalable spam detection system is highly desirable but faces difficulties such as lack of labeled training data, particularly for languages other than English. In this paper, we take the spam (Spanish) job posting detection as the target problem and build a generic machine learning pipeline for multi-lingual spam detection. The main components are feature generation and knowledge migration via transfer learning. Specifically, in the feature generation phase, a relatively large labeled data set is generated via machine translation. Together with a large set of unlabeled human written Spanish data, unigram features are generated based on the frequency. In the second phase, machine translated data are properly reweighted to capture the discrepancy from human written ones and classifiers can be built on top of them. To make effective use of a small portion of labeled data available in human written Spanish, an adaptive transfer learning algorithm is proposed to further improve the performance. We evaluate the proposed method on LinkedIn's production data and the promising results verify the efficacy of our proposed algorithm. The pipeline is ready for production.

KW - Classification

KW - NLP

KW - Text mining

KW - Transfer learning

UR - http://www.scopus.com/inward/record.url?scp=84954145980&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84954145980&partnerID=8YFLogxK

U2 - 10.1145/2783258.2788575

DO - 10.1145/2783258.2788575

M3 - Conference contribution

AN - SCOPUS:84954145980

T3 - Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

SP - 2147

EP - 2156

BT - KDD 2015 - Proceedings of the 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

PB - Association for Computing Machinery

T2 - 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2015

Y2 - 10 August 2015 through 13 August 2015

ER -

Transfer learning for bilingual content classification

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this