Computational aspects of resilient data extraction from semistructured sources

Hasan Davulcu, Guizhen Yang, Michael Kifer, I. V. Ramakrishnan

Research output: Chapter in Book/Report/Conference proceedingConference contribution

13 Scopus citations

Abstract

Automatic data extraction from semistructured sources such as HTML pages is rapidly growing into a problem of significant importance, spurred by the growing popularity of the so called `shopbots' that enable end users to compare prices of goods and other services at various web sites without having to manually browse and fill out forms at each one of these sites. The main problem one has to contend with when designing data extraction techniques is that the contents of a web page changes frequently, either because its data is generated dynamically, in response to filling out a form, or because of changes to its presentation format. This makes the problem of data extraction particularly challenging, since a desirable requirement of any data extraction technique is that it be `resilient', i.e., using it we should always be able to locate the object of interest in a page (such as a form or an element in a table generated by a form fill-out) in spite of changes to the page's content and layout. In this paper we propose a formal computation model for developing resilient data extraction techniques from semistructured sources. Specifically we formalize the problem of data extraction as one of generating unambiguous extraction expressions, which are regular expressions with some additional structure. The problem of resilience is then formalized as one of generating a maximal extraction expression of this kind. We present characterization theorems for maximal extraction expressions, complexity results for testing them, and algorithms for synthesizing them.

Original languageEnglish (US)
Title of host publicationProceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems
Place of PublicationNew York, NY, United States
PublisherACM
Pages136-144
Number of pages9
StatePublished - 2000
Externally publishedYes
EventPODS 2000 - 19th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems - Dallas, TX, USA
Duration: May 15 2000May 17 2000

Other

OtherPODS 2000 - 19th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems
CityDallas, TX, USA
Period5/15/005/17/00

ASJC Scopus subject areas

  • Software

Fingerprint Dive into the research topics of 'Computational aspects of resilient data extraction from semistructured sources'. Together they form a unique fingerprint.

  • Cite this

    Davulcu, H., Yang, G., Kifer, M., & Ramakrishnan, I. V. (2000). Computational aspects of resilient data extraction from semistructured sources. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (pp. 136-144). ACM.