Beyond word2vec: Distance-graph tensor factorization for word and document embeddings

Suhang Wang; Charu Aggarwal; Huan Liu

doi:10.1145/3357384.3358051

Beyond word2vec: Distance-graph tensor factorization for word and document embeddings

Suhang Wang, Charu Aggarwal, Huan Liu

Engineering, Ira A. Fulton Schools of (IAFSE)

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

5 Scopus citations

Abstract

The word2vec methodology such as Skip-gram and CBOW has seen significant interest in recent years because of its ability to model semantic notions of word similarity and distances in sentences. A related methodology, referred to as doc2vec is also able to embed sentences and paragraphs. These methodologies, however, lead to different embeddings that cannot be related to one another. In this paper, we present a tensor factorization methodology, which simultaneously embeds words and sentences into latent representations in one shot. Furthermore, these latent representations are concretely related to one another via tensor factorization. Whereas word2vec and doc2vec are dependent on the use of contextual windows in order to create the projections, our approach treats each document as a structural graph on words. Therefore, all the documents in the corpus are jointly factorized in order to simultaneously create an embedding for the individual documents and the words. Since the graphical representation of a document is much richer than a contextual window, the approach is capable of designing more powerful representations than those using the word2vec family of methods. We use a carefully designed negative sampling methodology to provide an efficient implementation of the approach. We relate the approach to factorization machines, which provides an efficient alternative for its implementation. We present experimental results illustrating the effectiveness of the approach for document classification, information retrieval and visualization.

Original language	English (US)
Title of host publication	CIKM 2019 - Proceedings of the 28th ACM International Conference on Information and Knowledge Management
Publisher	Association for Computing Machinery
Pages	1041-1050
Number of pages	10
ISBN (Electronic)	9781450369763
DOIs	https://doi.org/10.1145/3357384.3358051
State	Published - Nov 3 2019
Event	28th ACM International Conference on Information and Knowledge Management, CIKM 2019 - Beijing, China Duration: Nov 3 2019 → Nov 7 2019

Publication series

Name	International Conference on Information and Knowledge Management, Proceedings

Conference

Conference	28th ACM International Conference on Information and Knowledge Management, CIKM 2019
Country/Territory	China
City	Beijing
Period	11/3/19 → 11/7/19

Keywords

Document Embedding
Pairwise Factorization
Word Embedding

ASJC Scopus subject areas

General Decision Sciences
General Business, Management and Accounting

Access to Document

10.1145/3357384.3358051

Cite this

Wang, S., Aggarwal, C., & Liu, H. (2019). Beyond word2vec: Distance-graph tensor factorization for word and document embeddings. In CIKM 2019 - Proceedings of the 28th ACM International Conference on Information and Knowledge Management (pp. 1041-1050). (International Conference on Information and Knowledge Management, Proceedings). Association for Computing Machinery. https://doi.org/10.1145/3357384.3358051

Beyond word2vec: Distance-graph tensor factorization for word and document embeddings. / Wang, Suhang; Aggarwal, Charu; Liu, Huan.
CIKM 2019 - Proceedings of the 28th ACM International Conference on Information and Knowledge Management. Association for Computing Machinery, 2019. p. 1041-1050 (International Conference on Information and Knowledge Management, Proceedings).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Wang, S, Aggarwal, C & Liu, H 2019, Beyond word2vec: Distance-graph tensor factorization for word and document embeddings. in CIKM 2019 - Proceedings of the 28th ACM International Conference on Information and Knowledge Management. International Conference on Information and Knowledge Management, Proceedings, Association for Computing Machinery, pp. 1041-1050, 28th ACM International Conference on Information and Knowledge Management, CIKM 2019, Beijing, China, 11/3/19. https://doi.org/10.1145/3357384.3358051

Wang S, Aggarwal C, Liu H. Beyond word2vec: Distance-graph tensor factorization for word and document embeddings. In CIKM 2019 - Proceedings of the 28th ACM International Conference on Information and Knowledge Management. Association for Computing Machinery. 2019. p. 1041-1050. (International Conference on Information and Knowledge Management, Proceedings). doi: 10.1145/3357384.3358051

Wang, Suhang ; Aggarwal, Charu ; Liu, Huan. / Beyond word2vec : Distance-graph tensor factorization for word and document embeddings. CIKM 2019 - Proceedings of the 28th ACM International Conference on Information and Knowledge Management. Association for Computing Machinery, 2019. pp. 1041-1050 (International Conference on Information and Knowledge Management, Proceedings).

@inproceedings{6244ec3709c9461dad0bcb1d36a3c839,

title = "Beyond word2vec: Distance-graph tensor factorization for word and document embeddings",

abstract = "The word2vec methodology such as Skip-gram and CBOW has seen significant interest in recent years because of its ability to model semantic notions of word similarity and distances in sentences. A related methodology, referred to as doc2vec is also able to embed sentences and paragraphs. These methodologies, however, lead to different embeddings that cannot be related to one another. In this paper, we present a tensor factorization methodology, which simultaneously embeds words and sentences into latent representations in one shot. Furthermore, these latent representations are concretely related to one another via tensor factorization. Whereas word2vec and doc2vec are dependent on the use of contextual windows in order to create the projections, our approach treats each document as a structural graph on words. Therefore, all the documents in the corpus are jointly factorized in order to simultaneously create an embedding for the individual documents and the words. Since the graphical representation of a document is much richer than a contextual window, the approach is capable of designing more powerful representations than those using the word2vec family of methods. We use a carefully designed negative sampling methodology to provide an efficient implementation of the approach. We relate the approach to factorization machines, which provides an efficient alternative for its implementation. We present experimental results illustrating the effectiveness of the approach for document classification, information retrieval and visualization.",

keywords = "Document Embedding, Pairwise Factorization, Word Embedding",

author = "Suhang Wang and Charu Aggarwal and Huan Liu",

note = "Funding Information: This material is based upon work supported by, or in part by, the National Science Foundation (NSF) under grants #1614576 and #1610282, and the Office of Naval Research (ONR) under grant N00014-17-1-2605. Publisher Copyright: {\textcopyright} 2019 Association for Computing Machinery.; 28th ACM International Conference on Information and Knowledge Management, CIKM 2019 ; Conference date: 03-11-2019 Through 07-11-2019",

year = "2019",

month = nov,

day = "3",

doi = "10.1145/3357384.3358051",

language = "English (US)",

series = "International Conference on Information and Knowledge Management, Proceedings",

publisher = "Association for Computing Machinery",

pages = "1041--1050",

booktitle = "CIKM 2019 - Proceedings of the 28th ACM International Conference on Information and Knowledge Management",

}

TY - GEN

T1 - Beyond word2vec

T2 - 28th ACM International Conference on Information and Knowledge Management, CIKM 2019

AU - Wang, Suhang

AU - Aggarwal, Charu

AU - Liu, Huan

N1 - Funding Information: This material is based upon work supported by, or in part by, the National Science Foundation (NSF) under grants #1614576 and #1610282, and the Office of Naval Research (ONR) under grant N00014-17-1-2605. Publisher Copyright: © 2019 Association for Computing Machinery.

PY - 2019/11/3

Y1 - 2019/11/3

N2 - The word2vec methodology such as Skip-gram and CBOW has seen significant interest in recent years because of its ability to model semantic notions of word similarity and distances in sentences. A related methodology, referred to as doc2vec is also able to embed sentences and paragraphs. These methodologies, however, lead to different embeddings that cannot be related to one another. In this paper, we present a tensor factorization methodology, which simultaneously embeds words and sentences into latent representations in one shot. Furthermore, these latent representations are concretely related to one another via tensor factorization. Whereas word2vec and doc2vec are dependent on the use of contextual windows in order to create the projections, our approach treats each document as a structural graph on words. Therefore, all the documents in the corpus are jointly factorized in order to simultaneously create an embedding for the individual documents and the words. Since the graphical representation of a document is much richer than a contextual window, the approach is capable of designing more powerful representations than those using the word2vec family of methods. We use a carefully designed negative sampling methodology to provide an efficient implementation of the approach. We relate the approach to factorization machines, which provides an efficient alternative for its implementation. We present experimental results illustrating the effectiveness of the approach for document classification, information retrieval and visualization.

AB - The word2vec methodology such as Skip-gram and CBOW has seen significant interest in recent years because of its ability to model semantic notions of word similarity and distances in sentences. A related methodology, referred to as doc2vec is also able to embed sentences and paragraphs. These methodologies, however, lead to different embeddings that cannot be related to one another. In this paper, we present a tensor factorization methodology, which simultaneously embeds words and sentences into latent representations in one shot. Furthermore, these latent representations are concretely related to one another via tensor factorization. Whereas word2vec and doc2vec are dependent on the use of contextual windows in order to create the projections, our approach treats each document as a structural graph on words. Therefore, all the documents in the corpus are jointly factorized in order to simultaneously create an embedding for the individual documents and the words. Since the graphical representation of a document is much richer than a contextual window, the approach is capable of designing more powerful representations than those using the word2vec family of methods. We use a carefully designed negative sampling methodology to provide an efficient implementation of the approach. We relate the approach to factorization machines, which provides an efficient alternative for its implementation. We present experimental results illustrating the effectiveness of the approach for document classification, information retrieval and visualization.

KW - Document Embedding

KW - Pairwise Factorization

KW - Word Embedding

UR - http://www.scopus.com/inward/record.url?scp=85075438955&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85075438955&partnerID=8YFLogxK

U2 - 10.1145/3357384.3358051

DO - 10.1145/3357384.3358051

M3 - Conference contribution

AN - SCOPUS:85075438955

T3 - International Conference on Information and Knowledge Management, Proceedings

SP - 1041

EP - 1050

BT - CIKM 2019 - Proceedings of the 28th ACM International Conference on Information and Knowledge Management

PB - Association for Computing Machinery

Y2 - 3 November 2019 through 7 November 2019

ER -

Beyond word2vec: Distance-graph tensor factorization for word and document embeddings

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this