Document clustering via matrix representation

Xufei Wang; Jiliang Tang; Huan Liu

doi:10.1109/ICDM.2011.59

Document clustering via matrix representation

Xufei Wang, Jiliang Tang, Huan Liu

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

9 Scopus citations

Abstract

Vector Space Model (VSM) is widely used to represent documents and web pages. It is simple and easy to deal computationally, but it also oversimplifies a document into a vector, susceptible to noise, and cannot explicitly represent underlying topics of a document. A matrix representation of document is proposed in this paper: rows represent distinct terms and columns represent cohesive segments. The matrix model views a document as a set of segments, and each segment is a probability distribution over a limited number of latent topics which can be mapped to clustering structures. The latent topic extraction based on the matrix representation of documents is formulated as a constraint optimization problem in which each matrix (i.e., a document) A _i is factorized into a common base determined by non-negative matrices L and R ^T, and a non-negative weight matrix Mi such that the sum of reconstruction error on all documents is minimized. Empirical evaluation demonstrates that it is feasible to use the matrix model for document clustering: (1) compared with vector representation, using matrix representation improves clustering quality consistently, and the proposed approach achieves a relative accuracy improvement up to 66% on the studied datasets; and (2) the proposed method outperforms baseline methods such as k-means and NMF, and complements the state-of-the-art methods like LDA and PLSI. Furthermore, the proposed matrix model allows more refined information retrieval at a segment level instead of at a document level, which enables the return of more relevant documents in information retrieval tasks.

Original language	English (US)
Title of host publication	Proceedings - 11th IEEE International Conference on Data Mining, ICDM 2011
Pages	804-813
Number of pages	10
DOIs	https://doi.org/10.1109/ICDM.2011.59
State	Published - 2011
Event	11th IEEE International Conference on Data Mining, ICDM 2011 - Vancouver, BC, Canada Duration: Dec 11 2011 → Dec 14 2011

Publication series

Name	Proceedings - IEEE International Conference on Data Mining, ICDM
ISSN (Print)	1550-4786

Other

Other	11th IEEE International Conference on Data Mining, ICDM 2011
Country/Territory	Canada
City	Vancouver, BC
Period	12/11/11 → 12/14/11

Keywords

Document clustering
Document representation
Matrix representation
Non-negative matrix approximation

ASJC Scopus subject areas

General Engineering

Access to Document

10.1109/ICDM.2011.59

Cite this

Wang, X, Tang, J & Liu, H 2011, Document clustering via matrix representation. in Proceedings - 11th IEEE International Conference on Data Mining, ICDM 2011., 6137285, Proceedings - IEEE International Conference on Data Mining, ICDM, pp. 804-813, 11th IEEE International Conference on Data Mining, ICDM 2011, Vancouver, BC, Canada, 12/11/11. https://doi.org/10.1109/ICDM.2011.59

@inproceedings{c88fc1b8ed104c69bf00f461b6461534,

title = "Document clustering via matrix representation",

abstract = "Vector Space Model (VSM) is widely used to represent documents and web pages. It is simple and easy to deal computationally, but it also oversimplifies a document into a vector, susceptible to noise, and cannot explicitly represent underlying topics of a document. A matrix representation of document is proposed in this paper: rows represent distinct terms and columns represent cohesive segments. The matrix model views a document as a set of segments, and each segment is a probability distribution over a limited number of latent topics which can be mapped to clustering structures. The latent topic extraction based on the matrix representation of documents is formulated as a constraint optimization problem in which each matrix (i.e., a document) A i is factorized into a common base determined by non-negative matrices L and R T, and a non-negative weight matrix Mi such that the sum of reconstruction error on all documents is minimized. Empirical evaluation demonstrates that it is feasible to use the matrix model for document clustering: (1) compared with vector representation, using matrix representation improves clustering quality consistently, and the proposed approach achieves a relative accuracy improvement up to 66% on the studied datasets; and (2) the proposed method outperforms baseline methods such as k-means and NMF, and complements the state-of-the-art methods like LDA and PLSI. Furthermore, the proposed matrix model allows more refined information retrieval at a segment level instead of at a document level, which enables the return of more relevant documents in information retrieval tasks.",

keywords = "Document clustering, Document representation, Matrix representation, Non-negative matrix approximation",

author = "Xufei Wang and Jiliang Tang and Huan Liu",

year = "2011",

doi = "10.1109/ICDM.2011.59",

language = "English (US)",

isbn = "9780769544083",

series = "Proceedings - IEEE International Conference on Data Mining, ICDM",

pages = "804--813",

booktitle = "Proceedings - 11th IEEE International Conference on Data Mining, ICDM 2011",

note = "11th IEEE International Conference on Data Mining, ICDM 2011 ; Conference date: 11-12-2011 Through 14-12-2011",

}

TY - GEN

T1 - Document clustering via matrix representation

AU - Wang, Xufei

AU - Tang, Jiliang

AU - Liu, Huan

PY - 2011

Y1 - 2011

N2 - Vector Space Model (VSM) is widely used to represent documents and web pages. It is simple and easy to deal computationally, but it also oversimplifies a document into a vector, susceptible to noise, and cannot explicitly represent underlying topics of a document. A matrix representation of document is proposed in this paper: rows represent distinct terms and columns represent cohesive segments. The matrix model views a document as a set of segments, and each segment is a probability distribution over a limited number of latent topics which can be mapped to clustering structures. The latent topic extraction based on the matrix representation of documents is formulated as a constraint optimization problem in which each matrix (i.e., a document) A i is factorized into a common base determined by non-negative matrices L and R T, and a non-negative weight matrix Mi such that the sum of reconstruction error on all documents is minimized. Empirical evaluation demonstrates that it is feasible to use the matrix model for document clustering: (1) compared with vector representation, using matrix representation improves clustering quality consistently, and the proposed approach achieves a relative accuracy improvement up to 66% on the studied datasets; and (2) the proposed method outperforms baseline methods such as k-means and NMF, and complements the state-of-the-art methods like LDA and PLSI. Furthermore, the proposed matrix model allows more refined information retrieval at a segment level instead of at a document level, which enables the return of more relevant documents in information retrieval tasks.

AB - Vector Space Model (VSM) is widely used to represent documents and web pages. It is simple and easy to deal computationally, but it also oversimplifies a document into a vector, susceptible to noise, and cannot explicitly represent underlying topics of a document. A matrix representation of document is proposed in this paper: rows represent distinct terms and columns represent cohesive segments. The matrix model views a document as a set of segments, and each segment is a probability distribution over a limited number of latent topics which can be mapped to clustering structures. The latent topic extraction based on the matrix representation of documents is formulated as a constraint optimization problem in which each matrix (i.e., a document) A i is factorized into a common base determined by non-negative matrices L and R T, and a non-negative weight matrix Mi such that the sum of reconstruction error on all documents is minimized. Empirical evaluation demonstrates that it is feasible to use the matrix model for document clustering: (1) compared with vector representation, using matrix representation improves clustering quality consistently, and the proposed approach achieves a relative accuracy improvement up to 66% on the studied datasets; and (2) the proposed method outperforms baseline methods such as k-means and NMF, and complements the state-of-the-art methods like LDA and PLSI. Furthermore, the proposed matrix model allows more refined information retrieval at a segment level instead of at a document level, which enables the return of more relevant documents in information retrieval tasks.

KW - Document clustering

KW - Document representation

KW - Matrix representation

KW - Non-negative matrix approximation

UR - http://www.scopus.com/inward/record.url?scp=84863121707&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84863121707&partnerID=8YFLogxK

U2 - 10.1109/ICDM.2011.59

DO - 10.1109/ICDM.2011.59

M3 - Conference contribution

AN - SCOPUS:84863121707

SN - 9780769544083

T3 - Proceedings - IEEE International Conference on Data Mining, ICDM

SP - 804

EP - 813

BT - Proceedings - 11th IEEE International Conference on Data Mining, ICDM 2011

T2 - 11th IEEE International Conference on Data Mining, ICDM 2011

Y2 - 11 December 2011 through 14 December 2011

ER -

Document clustering via matrix representation

Abstract

Publication series

Other

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this