Statistical-based approach to word segmentation

Yalin Wang, Ihsin T. Phillips, Robert Haralick

Research output: Chapter in Book/Report/Conference proceedingChapter

9 Citations (Scopus)

Abstract

This paper presents a text word extraction algorithm that takes a set of bounding boxes of glyphs and their associated text lines of a given document and partitions the glyphs into a set of text words, using only the geometric information of the input glyphs. The algorithm is probability based. An iterative, relaxation-like method is used to find the partitioning solution that maximizes the joint probability. To evaluate the performance of our text word extraction algorithm, we used a 3-fold validation method and developed a quantitative performance measure. The algorithm was evaluated on the UW-III database of some 1600 scanned document image pages. An area-overlap measure was used to find the correspondence between the detected entities and the ground-truth. For a total of 827, 433 ground truth words, the algorithm identified and segmented 806, 149 words correctly, an accuracy of 97.43%.

Original languageEnglish (US)
Title of host publicationProceedings - International Conference on Pattern Recognition
Pages555-558
Number of pages4
Volume15
Edition4
StatePublished - 2000
Externally publishedYes

ASJC Scopus subject areas

  • Electrical and Electronic Engineering
  • Computer Vision and Pattern Recognition
  • Hardware and Architecture

Cite this

Wang, Y., Phillips, I. T., & Haralick, R. (2000). Statistical-based approach to word segmentation. In Proceedings - International Conference on Pattern Recognition (4 ed., Vol. 15, pp. 555-558)

Statistical-based approach to word segmentation. / Wang, Yalin; Phillips, Ihsin T.; Haralick, Robert.

Proceedings - International Conference on Pattern Recognition. Vol. 15 4. ed. 2000. p. 555-558.

Research output: Chapter in Book/Report/Conference proceedingChapter

Wang, Y, Phillips, IT & Haralick, R 2000, Statistical-based approach to word segmentation. in Proceedings - International Conference on Pattern Recognition. 4 edn, vol. 15, pp. 555-558.
Wang Y, Phillips IT, Haralick R. Statistical-based approach to word segmentation. In Proceedings - International Conference on Pattern Recognition. 4 ed. Vol. 15. 2000. p. 555-558
Wang, Yalin ; Phillips, Ihsin T. ; Haralick, Robert. / Statistical-based approach to word segmentation. Proceedings - International Conference on Pattern Recognition. Vol. 15 4. ed. 2000. pp. 555-558
@inbook{0010700118d94abdbdf8fa1b1760b027,
title = "Statistical-based approach to word segmentation",
abstract = "This paper presents a text word extraction algorithm that takes a set of bounding boxes of glyphs and their associated text lines of a given document and partitions the glyphs into a set of text words, using only the geometric information of the input glyphs. The algorithm is probability based. An iterative, relaxation-like method is used to find the partitioning solution that maximizes the joint probability. To evaluate the performance of our text word extraction algorithm, we used a 3-fold validation method and developed a quantitative performance measure. The algorithm was evaluated on the UW-III database of some 1600 scanned document image pages. An area-overlap measure was used to find the correspondence between the detected entities and the ground-truth. For a total of 827, 433 ground truth words, the algorithm identified and segmented 806, 149 words correctly, an accuracy of 97.43{\%}.",
author = "Yalin Wang and Phillips, {Ihsin T.} and Robert Haralick",
year = "2000",
language = "English (US)",
volume = "15",
pages = "555--558",
booktitle = "Proceedings - International Conference on Pattern Recognition",
edition = "4",

}

TY - CHAP

T1 - Statistical-based approach to word segmentation

AU - Wang, Yalin

AU - Phillips, Ihsin T.

AU - Haralick, Robert

PY - 2000

Y1 - 2000

N2 - This paper presents a text word extraction algorithm that takes a set of bounding boxes of glyphs and their associated text lines of a given document and partitions the glyphs into a set of text words, using only the geometric information of the input glyphs. The algorithm is probability based. An iterative, relaxation-like method is used to find the partitioning solution that maximizes the joint probability. To evaluate the performance of our text word extraction algorithm, we used a 3-fold validation method and developed a quantitative performance measure. The algorithm was evaluated on the UW-III database of some 1600 scanned document image pages. An area-overlap measure was used to find the correspondence between the detected entities and the ground-truth. For a total of 827, 433 ground truth words, the algorithm identified and segmented 806, 149 words correctly, an accuracy of 97.43%.

AB - This paper presents a text word extraction algorithm that takes a set of bounding boxes of glyphs and their associated text lines of a given document and partitions the glyphs into a set of text words, using only the geometric information of the input glyphs. The algorithm is probability based. An iterative, relaxation-like method is used to find the partitioning solution that maximizes the joint probability. To evaluate the performance of our text word extraction algorithm, we used a 3-fold validation method and developed a quantitative performance measure. The algorithm was evaluated on the UW-III database of some 1600 scanned document image pages. An area-overlap measure was used to find the correspondence between the detected entities and the ground-truth. For a total of 827, 433 ground truth words, the algorithm identified and segmented 806, 149 words correctly, an accuracy of 97.43%.

UR - http://www.scopus.com/inward/record.url?scp=2442601780&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=2442601780&partnerID=8YFLogxK

M3 - Chapter

VL - 15

SP - 555

EP - 558

BT - Proceedings - International Conference on Pattern Recognition

ER -