Evolutionary insights from suffix array-based genome sequence analysis

Anindya Poddar; Nagasuma Chandra; Madhavi Ganapathiraju; K. Sekar; Judith Klein-Seetharaman; Raj Reddy; N. Balakrishnan

doi:10.1007/s12038-007-0087-z

Evolutionary insights from suffix array-based genome sequence analysis

Anindya Poddar, Nagasuma Chandra, Madhavi Ganapathiraju, K. Sekar, Judith Klein-Seetharaman, Raj Reddy, N. Balakrishnan

Research output: Contribution to journal › Article › peer-review

8 Scopus citations

Abstract

Gene and protein sequence analyses, central components of studies in modern biology are easily amenable to string matching and pattern recognition algorithms. The growing need of analysing whole genome sequences more efficiently and thoroughly, has led to the emergence of new computational methods. Suffix trees and suffix arrays are data structures, well known in many other areas and are highly suited for sequence analysis too. Here we report an improvement to the design of construction of suffix arrays. Enhancement in versatility and scalability, enabled by this approach, is demonstrated through the use of real-life examples. The scalability of the algorithm to whole genomes renders it suitable to address many biologically interesting problems. One example is the evolutionary insight gained by analysing unigrams, bi-grams and higher n-grams, indicating that the genetic code has a direct influence on the overall composition of the genome. Further, different proteomes have been analysed for the coverage of the possible peptide space, which indicate that as much as a quarter of the total space at the tetra-peptide level is left un-sampled in prokaryotic organisms, although almost all tri-peptides can be seen in one protein or another in a proteome. Besides, distinct patterns begin to emerge for the counts of particular tetra and higher peptides, indicative of a 'meaning' for tetra and higher n-grams. The toolkit has also been used to demonstrate the usefulness of identifying repeats in whole proteomes efficiently. As an example, 16 members of one COG, coded by the genome of Mycobacterium tuberculosis H37Rv have been found to contain a repeating sequence of 300 amino acids.

Original language	English (US)
Pages (from-to)	871-881
Number of pages	11
Journal	Journal of Biosciences
Volume	32
Issue number	1
DOIs	https://doi.org/10.1007/s12038-007-0087-z
State	Published - Aug 2007
Externally published	Yes

Keywords

Biological language modelling toolkit (BLMT)
Genome sequence analysis
N-grams
Pattern matching
Short peptide sequences genetic code bias
Suffix arrays
Suffix trees

ASJC Scopus subject areas

General Biochemistry, Genetics and Molecular Biology
General Agricultural and Biological Sciences

Access to Document

10.1007/s12038-007-0087-z

Cite this

@article{b978afec03494d2bb40b16a63c516db1,

title = "Evolutionary insights from suffix array-based genome sequence analysis",

abstract = "Gene and protein sequence analyses, central components of studies in modern biology are easily amenable to string matching and pattern recognition algorithms. The growing need of analysing whole genome sequences more efficiently and thoroughly, has led to the emergence of new computational methods. Suffix trees and suffix arrays are data structures, well known in many other areas and are highly suited for sequence analysis too. Here we report an improvement to the design of construction of suffix arrays. Enhancement in versatility and scalability, enabled by this approach, is demonstrated through the use of real-life examples. The scalability of the algorithm to whole genomes renders it suitable to address many biologically interesting problems. One example is the evolutionary insight gained by analysing unigrams, bi-grams and higher n-grams, indicating that the genetic code has a direct influence on the overall composition of the genome. Further, different proteomes have been analysed for the coverage of the possible peptide space, which indicate that as much as a quarter of the total space at the tetra-peptide level is left un-sampled in prokaryotic organisms, although almost all tri-peptides can be seen in one protein or another in a proteome. Besides, distinct patterns begin to emerge for the counts of particular tetra and higher peptides, indicative of a 'meaning' for tetra and higher n-grams. The toolkit has also been used to demonstrate the usefulness of identifying repeats in whole proteomes efficiently. As an example, 16 members of one COG, coded by the genome of Mycobacterium tuberculosis H37Rv have been found to contain a repeating sequence of 300 amino acids.",

keywords = "Biological language modelling toolkit (BLMT), Genome sequence analysis, N-grams, Pattern matching, Short peptide sequences genetic code bias, Suffix arrays, Suffix trees",

author = "Anindya Poddar and Nagasuma Chandra and Madhavi Ganapathiraju and K. Sekar and Judith Klein-Seetharaman and Raj Reddy and N. Balakrishnan",

year = "2007",

month = aug,

doi = "10.1007/s12038-007-0087-z",

language = "English (US)",

volume = "32",

pages = "871--881",

journal = "Journal of Biosciences",

issn = "0250-5991",

publisher = "Springer India",

number = "1",

}

TY - JOUR

T1 - Evolutionary insights from suffix array-based genome sequence analysis

AU - Poddar, Anindya

AU - Chandra, Nagasuma

AU - Ganapathiraju, Madhavi

AU - Sekar, K.

AU - Klein-Seetharaman, Judith

AU - Reddy, Raj

AU - Balakrishnan, N.

PY - 2007/8

Y1 - 2007/8

N2 - Gene and protein sequence analyses, central components of studies in modern biology are easily amenable to string matching and pattern recognition algorithms. The growing need of analysing whole genome sequences more efficiently and thoroughly, has led to the emergence of new computational methods. Suffix trees and suffix arrays are data structures, well known in many other areas and are highly suited for sequence analysis too. Here we report an improvement to the design of construction of suffix arrays. Enhancement in versatility and scalability, enabled by this approach, is demonstrated through the use of real-life examples. The scalability of the algorithm to whole genomes renders it suitable to address many biologically interesting problems. One example is the evolutionary insight gained by analysing unigrams, bi-grams and higher n-grams, indicating that the genetic code has a direct influence on the overall composition of the genome. Further, different proteomes have been analysed for the coverage of the possible peptide space, which indicate that as much as a quarter of the total space at the tetra-peptide level is left un-sampled in prokaryotic organisms, although almost all tri-peptides can be seen in one protein or another in a proteome. Besides, distinct patterns begin to emerge for the counts of particular tetra and higher peptides, indicative of a 'meaning' for tetra and higher n-grams. The toolkit has also been used to demonstrate the usefulness of identifying repeats in whole proteomes efficiently. As an example, 16 members of one COG, coded by the genome of Mycobacterium tuberculosis H37Rv have been found to contain a repeating sequence of 300 amino acids.

AB - Gene and protein sequence analyses, central components of studies in modern biology are easily amenable to string matching and pattern recognition algorithms. The growing need of analysing whole genome sequences more efficiently and thoroughly, has led to the emergence of new computational methods. Suffix trees and suffix arrays are data structures, well known in many other areas and are highly suited for sequence analysis too. Here we report an improvement to the design of construction of suffix arrays. Enhancement in versatility and scalability, enabled by this approach, is demonstrated through the use of real-life examples. The scalability of the algorithm to whole genomes renders it suitable to address many biologically interesting problems. One example is the evolutionary insight gained by analysing unigrams, bi-grams and higher n-grams, indicating that the genetic code has a direct influence on the overall composition of the genome. Further, different proteomes have been analysed for the coverage of the possible peptide space, which indicate that as much as a quarter of the total space at the tetra-peptide level is left un-sampled in prokaryotic organisms, although almost all tri-peptides can be seen in one protein or another in a proteome. Besides, distinct patterns begin to emerge for the counts of particular tetra and higher peptides, indicative of a 'meaning' for tetra and higher n-grams. The toolkit has also been used to demonstrate the usefulness of identifying repeats in whole proteomes efficiently. As an example, 16 members of one COG, coded by the genome of Mycobacterium tuberculosis H37Rv have been found to contain a repeating sequence of 300 amino acids.

KW - Biological language modelling toolkit (BLMT)

KW - Genome sequence analysis

KW - N-grams

KW - Pattern matching

KW - Short peptide sequences genetic code bias

KW - Suffix arrays

KW - Suffix trees

UR - http://www.scopus.com/inward/record.url?scp=34548782284&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=34548782284&partnerID=8YFLogxK

U2 - 10.1007/s12038-007-0087-z

DO - 10.1007/s12038-007-0087-z

M3 - Article

C2 - 17914229

AN - SCOPUS:34548782284

SN - 0250-5991

VL - 32

SP - 871

EP - 881

JO - Journal of Biosciences

JF - Journal of Biosciences

IS - 1

ER -

Evolutionary insights from suffix array-based genome sequence analysis

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this