Predicting and annotating catalytic residues

An information theoretic approach

Beckett Sterner, Rohit Singh, Bonnie Berger

Research output: Contribution to journalArticle

24 Citations (Scopus)

Abstract

We introduce a computational method to predict and annotate the catalytic residues of a protein using only its sequence information, so that we describe both the residues' sequence locations (prediction) and their specific biochemical roles in the catalyzed reaction (annotation). While knowing the chemistry of an enzyme's catalytic residues is essential to understanding its function, the challenges of prediction and annotation have remained difficult, especially when only the enzyme's sequence and no homologous structures are available. Our sequence-based approach follows the guiding principle that catalytic residues performing the same biochemical function should have similar chemical environments; it detects specific conservation patterns near in sequence to known catalytic residues and accordingly constrains what combination of amino acids can be present near a predicted catalytic residue. We associate with each catalytic residue a short sequence profile and define a Kullback-Leibler (KL) distance measure between these profiles, which, as we show, effectively captures even subtle biochemical variations. We apply the method to the class of glycohydrolase enzymes. This class includes proteins from 96 families with very different sequences and folds, many of which perform important functions. In a cross-validation test, our approach correctly predicts the location of the enzymes' catalytic residues with a sensitivity of 80% at a specificity of 99.4%, and in a separate cross-validation we also correctly annotate the biochemical role of 80% of the catalytic residues. Our results compare favorably to existing methods. Moreover, our method is more broadly applicable because it relies on sequence and not structure information; it may, furthermore, be used in conjunction with structure-based methods.

Original languageEnglish (US)
Pages (from-to)1058-1073
Number of pages16
JournalJournal of Computational Biology
Volume14
Issue number8
DOIs
StatePublished - Oct 1 2007
Externally publishedYes

Fingerprint

Enzymes
Proteins
Computational methods
Glycoside Hydrolases
Amino acids
Cross-validation
Conservation
Sequence Homology
Annotation
Kullback-Leibler Distance
Protein
Predict
Amino Acids
Information Structure
Prediction
Distance Measure
Computational Methods
Chemistry
Specificity
Fold

Keywords

  • Algorithms
  • Computational molecular biology
  • Information theory
  • Multiple sequence alignment
  • Protein folding

ASJC Scopus subject areas

  • Molecular Biology
  • Genetics

Cite this

Predicting and annotating catalytic residues : An information theoretic approach. / Sterner, Beckett; Singh, Rohit; Berger, Bonnie.

In: Journal of Computational Biology, Vol. 14, No. 8, 01.10.2007, p. 1058-1073.

Research output: Contribution to journalArticle

@article{aa477b7fc5f944ddbcfd795e98aba1eb,
title = "Predicting and annotating catalytic residues: An information theoretic approach",
abstract = "We introduce a computational method to predict and annotate the catalytic residues of a protein using only its sequence information, so that we describe both the residues' sequence locations (prediction) and their specific biochemical roles in the catalyzed reaction (annotation). While knowing the chemistry of an enzyme's catalytic residues is essential to understanding its function, the challenges of prediction and annotation have remained difficult, especially when only the enzyme's sequence and no homologous structures are available. Our sequence-based approach follows the guiding principle that catalytic residues performing the same biochemical function should have similar chemical environments; it detects specific conservation patterns near in sequence to known catalytic residues and accordingly constrains what combination of amino acids can be present near a predicted catalytic residue. We associate with each catalytic residue a short sequence profile and define a Kullback-Leibler (KL) distance measure between these profiles, which, as we show, effectively captures even subtle biochemical variations. We apply the method to the class of glycohydrolase enzymes. This class includes proteins from 96 families with very different sequences and folds, many of which perform important functions. In a cross-validation test, our approach correctly predicts the location of the enzymes' catalytic residues with a sensitivity of 80{\%} at a specificity of 99.4{\%}, and in a separate cross-validation we also correctly annotate the biochemical role of 80{\%} of the catalytic residues. Our results compare favorably to existing methods. Moreover, our method is more broadly applicable because it relies on sequence and not structure information; it may, furthermore, be used in conjunction with structure-based methods.",
keywords = "Algorithms, Computational molecular biology, Information theory, Multiple sequence alignment, Protein folding",
author = "Beckett Sterner and Rohit Singh and Bonnie Berger",
year = "2007",
month = "10",
day = "1",
doi = "10.1089/cmb.2007.0042",
language = "English (US)",
volume = "14",
pages = "1058--1073",
journal = "Journal of Computational Biology",
issn = "1066-5277",
publisher = "Mary Ann Liebert Inc.",
number = "8",

}

TY - JOUR

T1 - Predicting and annotating catalytic residues

T2 - An information theoretic approach

AU - Sterner, Beckett

AU - Singh, Rohit

AU - Berger, Bonnie

PY - 2007/10/1

Y1 - 2007/10/1

N2 - We introduce a computational method to predict and annotate the catalytic residues of a protein using only its sequence information, so that we describe both the residues' sequence locations (prediction) and their specific biochemical roles in the catalyzed reaction (annotation). While knowing the chemistry of an enzyme's catalytic residues is essential to understanding its function, the challenges of prediction and annotation have remained difficult, especially when only the enzyme's sequence and no homologous structures are available. Our sequence-based approach follows the guiding principle that catalytic residues performing the same biochemical function should have similar chemical environments; it detects specific conservation patterns near in sequence to known catalytic residues and accordingly constrains what combination of amino acids can be present near a predicted catalytic residue. We associate with each catalytic residue a short sequence profile and define a Kullback-Leibler (KL) distance measure between these profiles, which, as we show, effectively captures even subtle biochemical variations. We apply the method to the class of glycohydrolase enzymes. This class includes proteins from 96 families with very different sequences and folds, many of which perform important functions. In a cross-validation test, our approach correctly predicts the location of the enzymes' catalytic residues with a sensitivity of 80% at a specificity of 99.4%, and in a separate cross-validation we also correctly annotate the biochemical role of 80% of the catalytic residues. Our results compare favorably to existing methods. Moreover, our method is more broadly applicable because it relies on sequence and not structure information; it may, furthermore, be used in conjunction with structure-based methods.

AB - We introduce a computational method to predict and annotate the catalytic residues of a protein using only its sequence information, so that we describe both the residues' sequence locations (prediction) and their specific biochemical roles in the catalyzed reaction (annotation). While knowing the chemistry of an enzyme's catalytic residues is essential to understanding its function, the challenges of prediction and annotation have remained difficult, especially when only the enzyme's sequence and no homologous structures are available. Our sequence-based approach follows the guiding principle that catalytic residues performing the same biochemical function should have similar chemical environments; it detects specific conservation patterns near in sequence to known catalytic residues and accordingly constrains what combination of amino acids can be present near a predicted catalytic residue. We associate with each catalytic residue a short sequence profile and define a Kullback-Leibler (KL) distance measure between these profiles, which, as we show, effectively captures even subtle biochemical variations. We apply the method to the class of glycohydrolase enzymes. This class includes proteins from 96 families with very different sequences and folds, many of which perform important functions. In a cross-validation test, our approach correctly predicts the location of the enzymes' catalytic residues with a sensitivity of 80% at a specificity of 99.4%, and in a separate cross-validation we also correctly annotate the biochemical role of 80% of the catalytic residues. Our results compare favorably to existing methods. Moreover, our method is more broadly applicable because it relies on sequence and not structure information; it may, furthermore, be used in conjunction with structure-based methods.

KW - Algorithms

KW - Computational molecular biology

KW - Information theory

KW - Multiple sequence alignment

KW - Protein folding

UR - http://www.scopus.com/inward/record.url?scp=35948961101&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=35948961101&partnerID=8YFLogxK

U2 - 10.1089/cmb.2007.0042

DO - 10.1089/cmb.2007.0042

M3 - Article

VL - 14

SP - 1058

EP - 1073

JO - Journal of Computational Biology

JF - Journal of Computational Biology

SN - 1066-5277

IS - 8

ER -