Working Backwards from the Proteome

Project: Research project

Project Details


The ENCODE (ENCyclopedia Of DNA Elements) consortium recently reported pervasive transcription of the human genome (1, 2). ENCODE analyzed 1% of the human genome (30 Mb) and demonstrated that introns and intergenic regions of the genome previously thought to be transcriptionally silent and non-protein coding are transcribed into RNA. Since then it has been estimated that 93% of the genome is transcribed into RNA (3, 4). What is the fate of all that RNA? The current concept is that this RNA is regulatory or structural. Several times throughout the ENCODE report, the RNA was referred to as "non-protein-coding." We agree that some of the RNA ENCODE found is likely regulatory and/or structural, but we hypothesize that a significant amount of the RNA found by ENCODE is translated into protein. Furthermore, since ENCODE only analyzed 1% ofthe genome, we hypothesize that the cell is making a very large amount of uncharacterized protein, some of which is functional and serves a purpose in the cell. To address this hypothesis we propose to work backwards from the proteome by identifying peptides and protein fragments derived from cells to find new genes. Because there is an enormous amount of sequence space in the genome between known genes (intergenic space), and often several kilobases between exons within genes (intronic space), we suggest that much of this "non-protein coding RNA" does indeed code for proteins. While proteins encoded within introns are likely splice variants of known genes, we have preliminary evidence that a large amount of undiscovered genes exist in intergenic space. If this is true, it would change the current paradigm about the content of intergenic space and lead to hundreds of discoveries of new genes. Traditionally, genes are discovered using genomics techniques such as shotgun cloning and sequencing contigs followed by sequence annotation using specialized software (gene finding programs) that look for open reading frames (5). In fact, closer examination of the sequences of different genes from the human genome has led to a reduction in the number of predicted genes due to gene duplication in different areas of the genome (6). Our hypothesis challenges the current paradigm by suggesting that many more genes exist than gene annotation algorithms have identified. Because of recent reports by the ENCODE consortium and others, the definition of a gene is changing (7). In fact the ENCODE analysis revealed incredible complexity of the transcriptome, finding many overlapping gene loci, genes extending into intergenic space, and transcripts and regulatory elements originating from within an intron of another gene. The importance of our work is founded in early definitions of a gene-if a "gene" encodes a protein, we can identify the peptides contained within the protein by mass spectrometry and then work backwards to map it back to the transcriptome and genome. We challenged the existing gene number by translating into amino acids the RNA from 1 % of the genome that ENCODE identified. Peptide and protein sequences from this translation formed a database for searching mass spectra derived from proteins from 5 different cell lines of different histological origin. Because we know the genomic coordinates of the ENCODE RNA transcripts, we also know where in the genome any protein or peptide might be translated. Based on preliminary experiments, we now have evidence that the transcripts encoding many of the cell-derived protein fragments are polyadenylated, suggesting they might be undiscovered genes.
Effective start/end date9/4/098/31/14


  • HHS-NIH: National Institute of General Medical Sciences (NIGMS): $1,199,008.00

Fingerprint Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.