TY - GEN
T1 - Document clustering via matrix representation
AU - Wang, Xufei
AU - Tang, Jiliang
AU - Liu, Huan
PY - 2011
Y1 - 2011
N2 - Vector Space Model (VSM) is widely used to represent documents and web pages. It is simple and easy to deal computationally, but it also oversimplifies a document into a vector, susceptible to noise, and cannot explicitly represent underlying topics of a document. A matrix representation of document is proposed in this paper: rows represent distinct terms and columns represent cohesive segments. The matrix model views a document as a set of segments, and each segment is a probability distribution over a limited number of latent topics which can be mapped to clustering structures. The latent topic extraction based on the matrix representation of documents is formulated as a constraint optimization problem in which each matrix (i.e., a document) A i is factorized into a common base determined by non-negative matrices L and R T, and a non-negative weight matrix Mi such that the sum of reconstruction error on all documents is minimized. Empirical evaluation demonstrates that it is feasible to use the matrix model for document clustering: (1) compared with vector representation, using matrix representation improves clustering quality consistently, and the proposed approach achieves a relative accuracy improvement up to 66% on the studied datasets; and (2) the proposed method outperforms baseline methods such as k-means and NMF, and complements the state-of-the-art methods like LDA and PLSI. Furthermore, the proposed matrix model allows more refined information retrieval at a segment level instead of at a document level, which enables the return of more relevant documents in information retrieval tasks.
AB - Vector Space Model (VSM) is widely used to represent documents and web pages. It is simple and easy to deal computationally, but it also oversimplifies a document into a vector, susceptible to noise, and cannot explicitly represent underlying topics of a document. A matrix representation of document is proposed in this paper: rows represent distinct terms and columns represent cohesive segments. The matrix model views a document as a set of segments, and each segment is a probability distribution over a limited number of latent topics which can be mapped to clustering structures. The latent topic extraction based on the matrix representation of documents is formulated as a constraint optimization problem in which each matrix (i.e., a document) A i is factorized into a common base determined by non-negative matrices L and R T, and a non-negative weight matrix Mi such that the sum of reconstruction error on all documents is minimized. Empirical evaluation demonstrates that it is feasible to use the matrix model for document clustering: (1) compared with vector representation, using matrix representation improves clustering quality consistently, and the proposed approach achieves a relative accuracy improvement up to 66% on the studied datasets; and (2) the proposed method outperforms baseline methods such as k-means and NMF, and complements the state-of-the-art methods like LDA and PLSI. Furthermore, the proposed matrix model allows more refined information retrieval at a segment level instead of at a document level, which enables the return of more relevant documents in information retrieval tasks.
KW - Document clustering
KW - Document representation
KW - Matrix representation
KW - Non-negative matrix approximation
UR - http://www.scopus.com/inward/record.url?scp=84863121707&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84863121707&partnerID=8YFLogxK
U2 - 10.1109/ICDM.2011.59
DO - 10.1109/ICDM.2011.59
M3 - Conference contribution
AN - SCOPUS:84863121707
SN - 9780769544083
T3 - Proceedings - IEEE International Conference on Data Mining, ICDM
SP - 804
EP - 813
BT - Proceedings - 11th IEEE International Conference on Data Mining, ICDM 2011
T2 - 11th IEEE International Conference on Data Mining, ICDM 2011
Y2 - 11 December 2011 through 14 December 2011
ER -