TY - GEN

T1 - Mining discrete patterns via binary matrix factorization

AU - Shen, Bao Hong

AU - Ji, Shuiwang

AU - Ye, Jieping

PY - 2009/11/9

Y1 - 2009/11/9

N2 - Mining discrete patterns in binary data is important for sub- sampling, compression, and clustering. We consider rank- one binary matrix approximations that identify the dominant patterns of the data, while preserving its discrete property. A best approximation on such data has a minimum set of inconsistent entries, i.e., mismatches between the given binary data and the approximate matrix. Due to the hardness of the problem, previous accounts of such problems employ heuristics and the resulting approximation may be far away from the optimal one. In this paper, we show that the rank-one binary matrix approximation can be reformulated as a 0-1 integer linear program (ILP). However, the ILP formulation is computationally expensive even for small-size matrices. We propose a linear program (LP) relaxation, which is shown to achieve a guaranteed approximation error bound. We further extend the proposed formulations using the regularization technique, which is commonly employed to address overfitting. The LP formulation is restricted to medium-size matrices, due to the large number of variables involved for large matrices. Interestingly, we show that the proposed approximate formulation can be transformed into an instance of the minimum s-t cut problem, which can be solved efficiently by finding maximum flows. Our empirical study shows the efficiency of the proposed algorithm based on the maximum flow. Results also confirm the established theoretical bounds.

AB - Mining discrete patterns in binary data is important for sub- sampling, compression, and clustering. We consider rank- one binary matrix approximations that identify the dominant patterns of the data, while preserving its discrete property. A best approximation on such data has a minimum set of inconsistent entries, i.e., mismatches between the given binary data and the approximate matrix. Due to the hardness of the problem, previous accounts of such problems employ heuristics and the resulting approximation may be far away from the optimal one. In this paper, we show that the rank-one binary matrix approximation can be reformulated as a 0-1 integer linear program (ILP). However, the ILP formulation is computationally expensive even for small-size matrices. We propose a linear program (LP) relaxation, which is shown to achieve a guaranteed approximation error bound. We further extend the proposed formulations using the regularization technique, which is commonly employed to address overfitting. The LP formulation is restricted to medium-size matrices, due to the large number of variables involved for large matrices. Interestingly, we show that the proposed approximate formulation can be transformed into an instance of the minimum s-t cut problem, which can be solved efficiently by finding maximum flows. Our empirical study shows the efficiency of the proposed algorithm based on the maximum flow. Results also confirm the established theoretical bounds.

KW - Binary matrix factorization

KW - Integer linear program

KW - Maximum flow

KW - Minimum cut

KW - Rank-one

KW - Regularization

UR - http://www.scopus.com/inward/record.url?scp=70350623326&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=70350623326&partnerID=8YFLogxK

U2 - 10.1145/1557019.1557103

DO - 10.1145/1557019.1557103

M3 - Conference contribution

AN - SCOPUS:70350623326

SN - 9781605584959

T3 - Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

SP - 757

EP - 765

BT - KDD '09

T2 - 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '09

Y2 - 28 June 2009 through 1 July 2009

ER -