Transfer learning for bilingual content classification

Qian Sun, Mohammad S. Amin, Baoshi Yan, Craig Martell, Vita Markman, Anmol Bhasin, Jieping Ye

Research output: Chapter in Book/Report/Conference proceedingConference contribution

12 Scopus citations

Abstract

LinkedIn Groups provide a platform on which professionals with similar background, target and specialities can share content, take part in discussions and establish opinions on industry topics. As in most online social communities, spam content in LinkedIn Groups poses great challenges to the user experience and could eventually lead to substantial loss of active users. Building an intelligent and scalable spam detection system is highly desirable but faces difficulties such as lack of labeled training data, particularly for languages other than English. In this paper, we take the spam (Spanish) job posting detection as the target problem and build a generic machine learning pipeline for multi-lingual spam detection. The main components are feature generation and knowledge migration via transfer learning. Specifically, in the feature generation phase, a relatively large labeled data set is generated via machine translation. Together with a large set of unlabeled human written Spanish data, unigram features are generated based on the frequency. In the second phase, machine translated data are properly reweighted to capture the discrepancy from human written ones and classifiers can be built on top of them. To make effective use of a small portion of labeled data available in human written Spanish, an adaptive transfer learning algorithm is proposed to further improve the performance. We evaluate the proposed method on LinkedIn's production data and the promising results verify the efficacy of our proposed algorithm. The pipeline is ready for production.

Original languageEnglish (US)
Title of host publicationKDD 2015 - Proceedings of the 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining
PublisherAssociation for Computing Machinery
Pages2147-2156
Number of pages10
ISBN (Electronic)9781450336642
DOIs
StatePublished - Aug 10 2015
Externally publishedYes
Event21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2015 - Sydney, Australia
Duration: Aug 10 2015Aug 13 2015

Publication series

NameProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Volume2015-August

Conference

Conference21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2015
Country/TerritoryAustralia
CitySydney
Period8/10/158/13/15

Keywords

  • Classification
  • NLP
  • Text mining
  • Transfer learning

ASJC Scopus subject areas

  • Software
  • Information Systems

Fingerprint

Dive into the research topics of 'Transfer learning for bilingual content classification'. Together they form a unique fingerprint.

Cite this