Narrative-based taxonomy distillation for effective indexing of text collections

Mario Cataldi; Kasim Candan; Maria Luisa Sapino

doi:10.1016/j.datak.2011.09.008

Narrative-based taxonomy distillation for effective indexing of text collections

Mario Cataldi, Kasim Candan, Maria Luisa Sapino

Research output: Contribution to journal › Article › peer-review

3 Scopus citations

Abstract

Taxonomies embody formalized knowledge and define aggregations between concepts/categories in a given domain, facilitating the organization of the data and making the contents easily accessible to the users. Since taxonomies have significant roles in data annotation, search and navigation, they are often carefully engineered. However, especially in domains, such as news, where content dynamically evolves, they do not necessarily reflect the content knowledge. Thus, in this paper, we ask and answer, in the positive, the following question: "is it possible to efficiently and effectively adapt a given taxonomy to a usage context defined by a corpus of documents?" In particular, we recognize that the primary role of a taxonomy is to describe or narrate the natural relationships between concepts in a given document corpus. Therefore, a corpus-aware adaptation of a taxonomy should essentially distill the structure of the existing taxonomy by appropriately segmenting and, if needed, summarizing this narrative relative to the content of the corpus. Based on this key observation, we propose A Narrative Interpretation of Taxonomies for their Adaptation (ANITA) for re-structuring existing taxonomies to varying application contexts and we evaluate the proposed scheme using different text collections. Finally we provide user studies that show that the proposed algorithm is able to adapt the taxonomy in a new compact and understandable structure.

Original language	English (US)
Pages (from-to)	103-125
Number of pages	23
Journal	Data and Knowledge Engineering
Volume	72
DOIs	https://doi.org/10.1016/j.datak.2011.09.008
State	Published - Feb 2012

Keywords

Information Retrieval and Filtering
Metadata
Taxonomy Classification
Taxonomy Summarization

ASJC Scopus subject areas

Information Systems and Management

Access to Document

10.1016/j.datak.2011.09.008

Cite this

@article{f2de7e9b41aa444385841ee50ae73b98,

title = "Narrative-based taxonomy distillation for effective indexing of text collections",

abstract = "Taxonomies embody formalized knowledge and define aggregations between concepts/categories in a given domain, facilitating the organization of the data and making the contents easily accessible to the users. Since taxonomies have significant roles in data annotation, search and navigation, they are often carefully engineered. However, especially in domains, such as news, where content dynamically evolves, they do not necessarily reflect the content knowledge. Thus, in this paper, we ask and answer, in the positive, the following question: {"}is it possible to efficiently and effectively adapt a given taxonomy to a usage context defined by a corpus of documents?{"} In particular, we recognize that the primary role of a taxonomy is to describe or narrate the natural relationships between concepts in a given document corpus. Therefore, a corpus-aware adaptation of a taxonomy should essentially distill the structure of the existing taxonomy by appropriately segmenting and, if needed, summarizing this narrative relative to the content of the corpus. Based on this key observation, we propose A Narrative Interpretation of Taxonomies for their Adaptation (ANITA) for re-structuring existing taxonomies to varying application contexts and we evaluate the proposed scheme using different text collections. Finally we provide user studies that show that the proposed algorithm is able to adapt the taxonomy in a new compact and understandable structure.",

keywords = "Information Retrieval and Filtering, Metadata, Taxonomy Classification, Taxonomy Summarization",

author = "Mario Cataldi and Kasim Candan and Sapino, {Maria Luisa}",

note = "Funding Information: K. Selcuk Candan is a Professor of Computer Science and Engineering at the School of Computing, Informatics, and Decision Science Engineering at the Arizona State University and is leading the EmitLab research group. He joined the department in August 1997, after receiving his Ph.D. from the Computer Science Department at the University of Maryland at College Park. Prof. Candan{\textquoteright}s primary research interest is in the area of management of non-traditional, heterogeneous, and imprecise (such as multimedia, web, and scientific) data. His various research projects in this domain are funded by diverse sources, including the National Science Foundation, Department of Defense, Mellon Foundation, and DES/RSA (Rehabilitation Services Administration). He has published over 140 articles and many book chapters. He has also authored 9 patents. Recently, he coauthored a book titled “Data Management for Multimedia Retrieval” for the Cambridge University Press and co-edited “New Frontiers in Information and Software as Services: Service and Application Design Challenges in the Cloud” for Springer. Funding Information: This work is partially supported by an NSF Grant #1043583 — MiNC: NSDL Middleware for Network- and Context-aware Recommendations. ",

year = "2012",

month = feb,

doi = "10.1016/j.datak.2011.09.008",

language = "English (US)",

volume = "72",

pages = "103--125",

journal = "Data and Knowledge Engineering",

issn = "0169-023X",

publisher = "Elsevier",

}

TY - JOUR

T1 - Narrative-based taxonomy distillation for effective indexing of text collections

AU - Cataldi, Mario

AU - Candan, Kasim

AU - Sapino, Maria Luisa

N1 - Funding Information: K. Selcuk Candan is a Professor of Computer Science and Engineering at the School of Computing, Informatics, and Decision Science Engineering at the Arizona State University and is leading the EmitLab research group. He joined the department in August 1997, after receiving his Ph.D. from the Computer Science Department at the University of Maryland at College Park. Prof. Candan’s primary research interest is in the area of management of non-traditional, heterogeneous, and imprecise (such as multimedia, web, and scientific) data. His various research projects in this domain are funded by diverse sources, including the National Science Foundation, Department of Defense, Mellon Foundation, and DES/RSA (Rehabilitation Services Administration). He has published over 140 articles and many book chapters. He has also authored 9 patents. Recently, he coauthored a book titled “Data Management for Multimedia Retrieval” for the Cambridge University Press and co-edited “New Frontiers in Information and Software as Services: Service and Application Design Challenges in the Cloud” for Springer. Funding Information: This work is partially supported by an NSF Grant #1043583 — MiNC: NSDL Middleware for Network- and Context-aware Recommendations.

PY - 2012/2

Y1 - 2012/2

N2 - Taxonomies embody formalized knowledge and define aggregations between concepts/categories in a given domain, facilitating the organization of the data and making the contents easily accessible to the users. Since taxonomies have significant roles in data annotation, search and navigation, they are often carefully engineered. However, especially in domains, such as news, where content dynamically evolves, they do not necessarily reflect the content knowledge. Thus, in this paper, we ask and answer, in the positive, the following question: "is it possible to efficiently and effectively adapt a given taxonomy to a usage context defined by a corpus of documents?" In particular, we recognize that the primary role of a taxonomy is to describe or narrate the natural relationships between concepts in a given document corpus. Therefore, a corpus-aware adaptation of a taxonomy should essentially distill the structure of the existing taxonomy by appropriately segmenting and, if needed, summarizing this narrative relative to the content of the corpus. Based on this key observation, we propose A Narrative Interpretation of Taxonomies for their Adaptation (ANITA) for re-structuring existing taxonomies to varying application contexts and we evaluate the proposed scheme using different text collections. Finally we provide user studies that show that the proposed algorithm is able to adapt the taxonomy in a new compact and understandable structure.

AB - Taxonomies embody formalized knowledge and define aggregations between concepts/categories in a given domain, facilitating the organization of the data and making the contents easily accessible to the users. Since taxonomies have significant roles in data annotation, search and navigation, they are often carefully engineered. However, especially in domains, such as news, where content dynamically evolves, they do not necessarily reflect the content knowledge. Thus, in this paper, we ask and answer, in the positive, the following question: "is it possible to efficiently and effectively adapt a given taxonomy to a usage context defined by a corpus of documents?" In particular, we recognize that the primary role of a taxonomy is to describe or narrate the natural relationships between concepts in a given document corpus. Therefore, a corpus-aware adaptation of a taxonomy should essentially distill the structure of the existing taxonomy by appropriately segmenting and, if needed, summarizing this narrative relative to the content of the corpus. Based on this key observation, we propose A Narrative Interpretation of Taxonomies for their Adaptation (ANITA) for re-structuring existing taxonomies to varying application contexts and we evaluate the proposed scheme using different text collections. Finally we provide user studies that show that the proposed algorithm is able to adapt the taxonomy in a new compact and understandable structure.

KW - Information Retrieval and Filtering

KW - Metadata

KW - Taxonomy Classification

KW - Taxonomy Summarization

UR - http://www.scopus.com/inward/record.url?scp=84855243125&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84855243125&partnerID=8YFLogxK

U2 - 10.1016/j.datak.2011.09.008

DO - 10.1016/j.datak.2011.09.008

M3 - Article

AN - SCOPUS:84855243125

SN - 0169-023X

VL - 72

SP - 103

EP - 125

JO - Data and Knowledge Engineering

JF - Data and Knowledge Engineering

ER -

Narrative-based taxonomy distillation for effective indexing of text collections

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this