A study on the document zone content classification problem

Yalin Wang; Ihsin T. Phillips; Robert M. Haralick

doi:10.1007/3-540-45869-7_25

A study on the document zone content classification problem

Yalin Wang, Ihsin T. Phillips, Robert M. Haralick

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

7 Scopus citations

Abstract

A document can be divided into zones on the basis of its content. For example, a zone can be either text or non-text. Given the segmented document zones, correctly determining the zone content type is very important for the subsequent processes within any document image understanding system. This paper describes an algorithm for the determination of zone type of a given zone within an input document image. In our zone classification algorithm, zones are represented as feature vectors. Each feature vector consists of a set of 25 measurements of pre-defined properties. A probabilistic model, decision tree, is used to classify each zone on the basis of its feature vector. Two methods are used to optimize the decision tree classifier to eliminate the data over-fitting problem. To enrich our probabilistic model, we incorporate context constraints for certain zones within their neighboring zones.We also model zone class context constraints as a Hidden Markov Model and usedViterbi algorithm to obtain optimal classification results. The training, pruning and testing data set for the algorithm include 1, 600 images drawn from theUWCDROM-III document image database.With a total of 24, 177 zones within the data set, the cross-validation methodwas used in the performance evaluation of the classifier. The classifier is able to classify each given scientific and technical document zone into one of the nine classes, 2 text classes (of font size 4-18pt and font size 19-32 pt), math, table, halftone, map/drawing, ruling, logo, and others. A zone content classification performance evaluation protocol is proposed. Using this protocol, our algorithm accuracy is 98.45% with a mean false alarm rate of 0.50%.

Original language	English (US)
Title of host publication	Document Analysis Systems V - 5th International Workshop, DAS 2002, Proceedings
Editors	Daniel Lopresti, Jianying Hu, Ramanujan Kashi
Publisher	Springer Verlag
Pages	212-223
Number of pages	12
ISBN (Print)	3540440682, 9783540440680
DOIs	https://doi.org/10.1007/3-540-45869-7_25
State	Published - 2002
Externally published	Yes
Event	5th International Workshop on Document Analysis Systems, DAS 2002 - Princeton, United States Duration: Aug 19 2002 → Aug 21 2002

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	2423
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Other

Other	5th International Workshop on Document Analysis Systems, DAS 2002
Country/Territory	United States
City	Princeton
Period	8/19/02 → 8/21/02

ASJC Scopus subject areas

Theoretical Computer Science
General Computer Science

Access to Document

10.1007/3-540-45869-7_25

Cite this

Wang, Y., Phillips, I. T., & Haralick, R. M. (2002). A study on the document zone content classification problem. In D. Lopresti, J. Hu, & R. Kashi (Eds.), Document Analysis Systems V - 5th International Workshop, DAS 2002, Proceedings (pp. 212-223). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 2423). Springer Verlag. https://doi.org/10.1007/3-540-45869-7_25

A study on the document zone content classification problem. / Wang, Yalin; Phillips, Ihsin T.; Haralick, Robert M.
Document Analysis Systems V - 5th International Workshop, DAS 2002, Proceedings. ed. / Daniel Lopresti; Jianying Hu; Ramanujan Kashi. Springer Verlag, 2002. p. 212-223 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 2423).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Wang, Y, Phillips, IT & Haralick, RM 2002, A study on the document zone content classification problem. in D Lopresti, J Hu & R Kashi (eds), Document Analysis Systems V - 5th International Workshop, DAS 2002, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 2423, Springer Verlag, pp. 212-223, 5th International Workshop on Document Analysis Systems, DAS 2002, Princeton, United States, 8/19/02. https://doi.org/10.1007/3-540-45869-7_25

Wang Y, Phillips IT, Haralick RM. A study on the document zone content classification problem. In Lopresti D, Hu J, Kashi R, editors, Document Analysis Systems V - 5th International Workshop, DAS 2002, Proceedings. Springer Verlag. 2002. p. 212-223. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/3-540-45869-7_25

Wang, Yalin ; Phillips, Ihsin T. ; Haralick, Robert M. / A study on the document zone content classification problem. Document Analysis Systems V - 5th International Workshop, DAS 2002, Proceedings. editor / Daniel Lopresti ; Jianying Hu ; Ramanujan Kashi. Springer Verlag, 2002. pp. 212-223 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{c4afdc9762644204a999e62bfb781ec3,

title = "A study on the document zone content classification problem",

abstract = "A document can be divided into zones on the basis of its content. For example, a zone can be either text or non-text. Given the segmented document zones, correctly determining the zone content type is very important for the subsequent processes within any document image understanding system. This paper describes an algorithm for the determination of zone type of a given zone within an input document image. In our zone classification algorithm, zones are represented as feature vectors. Each feature vector consists of a set of 25 measurements of pre-defined properties. A probabilistic model, decision tree, is used to classify each zone on the basis of its feature vector. Two methods are used to optimize the decision tree classifier to eliminate the data over-fitting problem. To enrich our probabilistic model, we incorporate context constraints for certain zones within their neighboring zones.We also model zone class context constraints as a Hidden Markov Model and usedViterbi algorithm to obtain optimal classification results. The training, pruning and testing data set for the algorithm include 1, 600 images drawn from theUWCDROM-III document image database.With a total of 24, 177 zones within the data set, the cross-validation methodwas used in the performance evaluation of the classifier. The classifier is able to classify each given scientific and technical document zone into one of the nine classes, 2 text classes (of font size 4-18pt and font size 19-32 pt), math, table, halftone, map/drawing, ruling, logo, and others. A zone content classification performance evaluation protocol is proposed. Using this protocol, our algorithm accuracy is 98.45% with a mean false alarm rate of 0.50%.",

author = "Yalin Wang and Phillips, {Ihsin T.} and Haralick, {Robert M.}",

note = "Publisher Copyright: {\textcopyright} Springer-Verlag Berlin Heidelberg 2002.; 5th International Workshop on Document Analysis Systems, DAS 2002 ; Conference date: 19-08-2002 Through 21-08-2002",

year = "2002",

doi = "10.1007/3-540-45869-7_25",

language = "English (US)",

isbn = "3540440682",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Verlag",

pages = "212--223",

editor = "Daniel Lopresti and Jianying Hu and Ramanujan Kashi",

booktitle = "Document Analysis Systems V - 5th International Workshop, DAS 2002, Proceedings",

}

TY - GEN

T1 - A study on the document zone content classification problem

AU - Wang, Yalin

AU - Phillips, Ihsin T.

AU - Haralick, Robert M.

N1 - Publisher Copyright: © Springer-Verlag Berlin Heidelberg 2002.

PY - 2002

Y1 - 2002

N2 - A document can be divided into zones on the basis of its content. For example, a zone can be either text or non-text. Given the segmented document zones, correctly determining the zone content type is very important for the subsequent processes within any document image understanding system. This paper describes an algorithm for the determination of zone type of a given zone within an input document image. In our zone classification algorithm, zones are represented as feature vectors. Each feature vector consists of a set of 25 measurements of pre-defined properties. A probabilistic model, decision tree, is used to classify each zone on the basis of its feature vector. Two methods are used to optimize the decision tree classifier to eliminate the data over-fitting problem. To enrich our probabilistic model, we incorporate context constraints for certain zones within their neighboring zones.We also model zone class context constraints as a Hidden Markov Model and usedViterbi algorithm to obtain optimal classification results. The training, pruning and testing data set for the algorithm include 1, 600 images drawn from theUWCDROM-III document image database.With a total of 24, 177 zones within the data set, the cross-validation methodwas used in the performance evaluation of the classifier. The classifier is able to classify each given scientific and technical document zone into one of the nine classes, 2 text classes (of font size 4-18pt and font size 19-32 pt), math, table, halftone, map/drawing, ruling, logo, and others. A zone content classification performance evaluation protocol is proposed. Using this protocol, our algorithm accuracy is 98.45% with a mean false alarm rate of 0.50%.

AB - A document can be divided into zones on the basis of its content. For example, a zone can be either text or non-text. Given the segmented document zones, correctly determining the zone content type is very important for the subsequent processes within any document image understanding system. This paper describes an algorithm for the determination of zone type of a given zone within an input document image. In our zone classification algorithm, zones are represented as feature vectors. Each feature vector consists of a set of 25 measurements of pre-defined properties. A probabilistic model, decision tree, is used to classify each zone on the basis of its feature vector. Two methods are used to optimize the decision tree classifier to eliminate the data over-fitting problem. To enrich our probabilistic model, we incorporate context constraints for certain zones within their neighboring zones.We also model zone class context constraints as a Hidden Markov Model and usedViterbi algorithm to obtain optimal classification results. The training, pruning and testing data set for the algorithm include 1, 600 images drawn from theUWCDROM-III document image database.With a total of 24, 177 zones within the data set, the cross-validation methodwas used in the performance evaluation of the classifier. The classifier is able to classify each given scientific and technical document zone into one of the nine classes, 2 text classes (of font size 4-18pt and font size 19-32 pt), math, table, halftone, map/drawing, ruling, logo, and others. A zone content classification performance evaluation protocol is proposed. Using this protocol, our algorithm accuracy is 98.45% with a mean false alarm rate of 0.50%.

UR - http://www.scopus.com/inward/record.url?scp=84947757232&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84947757232&partnerID=8YFLogxK

U2 - 10.1007/3-540-45869-7_25

DO - 10.1007/3-540-45869-7_25

M3 - Conference contribution

AN - SCOPUS:84947757232

SN - 3540440682

SN - 9783540440680

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 212

EP - 223

BT - Document Analysis Systems V - 5th International Workshop, DAS 2002, Proceedings

A2 - Lopresti, Daniel

A2 - Hu, Jianying

A2 - Kashi, Ramanujan

PB - Springer Verlag

T2 - 5th International Workshop on Document Analysis Systems, DAS 2002

Y2 - 19 August 2002 through 21 August 2002

ER -

A study on the document zone content classification problem

Abstract

Publication series

Other

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this