EAGER: Enabling Collaboration in the Creation of Scientific Databases from the Published Literature

Project: Research project


The aim of the proposed research is to develop and deploy cyber-infrastructure that will enable and promote large scale collaboration in building, maintaining, and using (towards discovery) scientific knowledge bases that are grounded in published literature. Although our focus will be on the Biology domain and articles and abstracts available through PubMed, we will develop infrastructure that can be adapted to other scientific domains such as archaeology, material science and ecology. In the Biology and life science domain there are 18 million abstracts in PubMed and in 2008 alone, 775,000 abstracts were added to it. Often a search for a gene name leads to several thousands of articles making it impossible for a researcher to directly read or even browse the publications relevant to their research. Thus there is a crucial need for databases and knowledge bases that would store the essential facts, such as interactions between particular protein pairs that are mentioned in the published articles. Our main concern in this research is about developing infrastructure that will help in the creation, maintenance and use of such databases. In terms of creation, the earlier approach of hiring humans, often post-doctoral researchers, to read articles and create the database is too expensive and does not scale. One such database, BIND, in its final year received CND $23 million in funding but over its life time could only cover a small fragment of the published articles. Thus an alternative approach is needed. However, the alternative of developing computer programs for automatic extraction of such facts is not, as yet, viable by itself due to low accuracy. In terms of maintenance and usage, many of the existing databases, even when citing articles from where a particular fact was obtained, do not mention the exact phrase or paragraph for it, thus making it hard for the interested scientist to verify the authenticity of a fact and obtain additional information about that fact.
Effective start/end date9/1/098/31/12


  • National Science Foundation (NSF): $179,927.00


life sciences
data processing program