Document zone content classification and its performance evaluation

Yalin Wang, Ihsin T. Phillips, Robert M. Haralick

Research output: Contribution to journalArticle

51 Citations (Scopus)

Abstract

This paper describes an algorithm for the determination of zone content type of a given zone within a document image. We take a statistical based approach and represent each zone with 25 dimensional feature vectors. An optimized decision tree classifier is used to classify each zone into one of nine zone content classes. A performance evaluation protocol is proposed. The training and testing data sets include a total of 24,177 zones from the University of Washington English Document Image database III. The algorithm accuracy is 98.45% with a mean false alarm rate of 0.50%.

Original languageEnglish (US)
Pages (from-to)57-73
Number of pages17
JournalPattern Recognition
Volume39
Issue number1
DOIs
StatePublished - Jan 2006
Externally publishedYes

Fingerprint

Decision trees
Classifiers
Testing

Keywords

  • Background analysis
  • Decision tree classifier
  • Document image analysis
  • Document layout analysis
  • Pattern recognition
  • Viterbi algorithm
  • Zone content classification

ASJC Scopus subject areas

  • Computer Vision and Pattern Recognition
  • Signal Processing
  • Electrical and Electronic Engineering

Cite this

Document zone content classification and its performance evaluation. / Wang, Yalin; Phillips, Ihsin T.; Haralick, Robert M.

In: Pattern Recognition, Vol. 39, No. 1, 01.2006, p. 57-73.

Research output: Contribution to journalArticle

Wang, Yalin ; Phillips, Ihsin T. ; Haralick, Robert M. / Document zone content classification and its performance evaluation. In: Pattern Recognition. 2006 ; Vol. 39, No. 1. pp. 57-73.
@article{ab66b2b60d124a7581017cd3942701ed,
title = "Document zone content classification and its performance evaluation",
abstract = "This paper describes an algorithm for the determination of zone content type of a given zone within a document image. We take a statistical based approach and represent each zone with 25 dimensional feature vectors. An optimized decision tree classifier is used to classify each zone into one of nine zone content classes. A performance evaluation protocol is proposed. The training and testing data sets include a total of 24,177 zones from the University of Washington English Document Image database III. The algorithm accuracy is 98.45{\%} with a mean false alarm rate of 0.50{\%}.",
keywords = "Background analysis, Decision tree classifier, Document image analysis, Document layout analysis, Pattern recognition, Viterbi algorithm, Zone content classification",
author = "Yalin Wang and Phillips, {Ihsin T.} and Haralick, {Robert M.}",
year = "2006",
month = "1",
doi = "10.1016/j.patcog.2005.06.009",
language = "English (US)",
volume = "39",
pages = "57--73",
journal = "Pattern Recognition",
issn = "0031-3203",
publisher = "Elsevier Limited",
number = "1",

}

TY - JOUR

T1 - Document zone content classification and its performance evaluation

AU - Wang, Yalin

AU - Phillips, Ihsin T.

AU - Haralick, Robert M.

PY - 2006/1

Y1 - 2006/1

N2 - This paper describes an algorithm for the determination of zone content type of a given zone within a document image. We take a statistical based approach and represent each zone with 25 dimensional feature vectors. An optimized decision tree classifier is used to classify each zone into one of nine zone content classes. A performance evaluation protocol is proposed. The training and testing data sets include a total of 24,177 zones from the University of Washington English Document Image database III. The algorithm accuracy is 98.45% with a mean false alarm rate of 0.50%.

AB - This paper describes an algorithm for the determination of zone content type of a given zone within a document image. We take a statistical based approach and represent each zone with 25 dimensional feature vectors. An optimized decision tree classifier is used to classify each zone into one of nine zone content classes. A performance evaluation protocol is proposed. The training and testing data sets include a total of 24,177 zones from the University of Washington English Document Image database III. The algorithm accuracy is 98.45% with a mean false alarm rate of 0.50%.

KW - Background analysis

KW - Decision tree classifier

KW - Document image analysis

KW - Document layout analysis

KW - Pattern recognition

KW - Viterbi algorithm

KW - Zone content classification

UR - http://www.scopus.com/inward/record.url?scp=27344433193&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=27344433193&partnerID=8YFLogxK

U2 - 10.1016/j.patcog.2005.06.009

DO - 10.1016/j.patcog.2005.06.009

M3 - Article

VL - 39

SP - 57

EP - 73

JO - Pattern Recognition

JF - Pattern Recognition

SN - 0031-3203

IS - 1

ER -