TY - JOUR
T1 - BLMT
T2 - Statistical sequence analysis using N-grams
AU - Ganapathiraju, Madhavi
AU - Manoharan, Vijayalaxmi
AU - Klein-Seetharaman, Judith
N1 - Funding Information:
This research was supported by National Science Foundation grants NSF0225656 and NSF0225636, and the Sofya Kovalevskaya Program of the Alexander von Humboldt-Foundation/Zukunftsinvestitionsprogramm der Bundesregierung Deutschland.
PY - 2004
Y1 - 2004
N2 - Statistical analysis of amino acid and nucleotide sequences, especially sequence alignment, is one of the most commonly performed tasks in modern molecular biology. However, for many tasks in bioinformatics, the requirement for the features in an alignment to be consecutive is restrictive and 'n-grams' (aka k-tuples) have been used as features instead. N-grams are usually short nucleotide or amino acid sequences of length n, but the unit for a gram may be chosen arbitrarily. The n-gram concept is borrowed from language technologies where n-grams of words form the fundamental units in statistical language models. Despite the demonstrated utility of n-gram statistics for the biology domain, there is currently no publicly accessible generic tool for the efficient calculation of such statistics. Most sequence analysis tools will disregard matches because of the lack of statistical significance in finding short sequences. This article presents the integrated Biological Language Modeling Toolkit (BLMT) that allows efficient calculation of n-gram statistics for arbitrary sequence datasets. Availability: BLMT can be downloaded from http://www.cs.cmu.edu/~blmt/source and installed for standalone use on any Unix platform or Unix shell emulation such as Cygwin on the Windows® platform. Specific tools and usage details are described in a 'readme' file. The n-gram computations carried out by the BLMT are part of a broader set of tools borrowed from language technologies and modified for statistical analysis of biological sequences; these are available at http://flan.blm.cs.cmu.edu/. Contact: Judith Klein-Seetharaman (judithks@cs.cmu.edu).
AB - Statistical analysis of amino acid and nucleotide sequences, especially sequence alignment, is one of the most commonly performed tasks in modern molecular biology. However, for many tasks in bioinformatics, the requirement for the features in an alignment to be consecutive is restrictive and 'n-grams' (aka k-tuples) have been used as features instead. N-grams are usually short nucleotide or amino acid sequences of length n, but the unit for a gram may be chosen arbitrarily. The n-gram concept is borrowed from language technologies where n-grams of words form the fundamental units in statistical language models. Despite the demonstrated utility of n-gram statistics for the biology domain, there is currently no publicly accessible generic tool for the efficient calculation of such statistics. Most sequence analysis tools will disregard matches because of the lack of statistical significance in finding short sequences. This article presents the integrated Biological Language Modeling Toolkit (BLMT) that allows efficient calculation of n-gram statistics for arbitrary sequence datasets. Availability: BLMT can be downloaded from http://www.cs.cmu.edu/~blmt/source and installed for standalone use on any Unix platform or Unix shell emulation such as Cygwin on the Windows® platform. Specific tools and usage details are described in a 'readme' file. The n-gram computations carried out by the BLMT are part of a broader set of tools borrowed from language technologies and modified for statistical analysis of biological sequences; these are available at http://flan.blm.cs.cmu.edu/. Contact: Judith Klein-Seetharaman (judithks@cs.cmu.edu).
UR - http://www.scopus.com/inward/record.url?scp=22744438090&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=22744438090&partnerID=8YFLogxK
U2 - 10.2165/00822942-200403020-00013
DO - 10.2165/00822942-200403020-00013
M3 - Article
C2 - 15693744
AN - SCOPUS:22744438090
VL - 3
SP - 193
EP - 200
JO - Applied Bioinformatics
JF - Applied Bioinformatics
SN - 1175-5636
IS - 2-3
ER -