Statistical-based approach to word segmentation

Yalin Wang, Ihsin T. Phillips, Robert Haralick

Research output: Chapter in Book/Report/Conference proceedingChapter

10 Scopus citations


This paper presents a text word extraction algorithm that takes a set of bounding boxes of glyphs and their associated text lines of a given document and partitions the glyphs into a set of text words, using only the geometric information of the input glyphs. The algorithm is probability based. An iterative, relaxation-like method is used to find the partitioning solution that maximizes the joint probability. To evaluate the performance of our text word extraction algorithm, we used a 3-fold validation method and developed a quantitative performance measure. The algorithm was evaluated on the UW-III database of some 1600 scanned document image pages. An area-overlap measure was used to find the correspondence between the detected entities and the ground-truth. For a total of 827, 433 ground truth words, the algorithm identified and segmented 806, 149 words correctly, an accuracy of 97.43%.

Original languageEnglish (US)
Title of host publicationProceedings - International Conference on Pattern Recognition
Number of pages4
StatePublished - 2000
Externally publishedYes


ASJC Scopus subject areas

  • Electrical and Electronic Engineering
  • Computer Vision and Pattern Recognition
  • Hardware and Architecture

Cite this

Wang, Y., Phillips, I. T., & Haralick, R. (2000). Statistical-based approach to word segmentation. In Proceedings - International Conference on Pattern Recognition (4 ed., Vol. 15, pp. 555-558)