Attempts to classify living organisms by their physical characteristics are as old as biology itself. The advent of protein and DNA sequencing - most notably the use of 16S ribosomal RNA - defined a new level of classification that now forms our basic understanding of the history of life on earth. High-throughput sequencing currently provides DNA sequences at an unprecedented rate, not only providing a wealth of information but also posing considerable analytical challenges. Here we present comparative genomics-based methods useful for automating evolutionary analysis between any number of species. As a practical example, we applied our method to the well-studied cyanobacterial lineage. The 24 cyanobacterial genomes compared here occupy a wide variety of environmental niches and play major roles in global carbon and nitrogen cycles. By integrating phylogenetic data inferred for upward of 1,000 protein-coding genes common to all or most cyanobacteria, we have reconstructed an evolutionary history of the phylum, establishing a framework for resolving key issues regarding the evolution of their metabolic and phenotypic diversity. Greater resolution on individual branches can be attained by telescoping inward to the larger set of conserved proteins between fewer taxa. The construction of all individual protein phylogenies allows for quantitative tree scoring, providing insight into the evolutionary history of each protein family as well as probing the limits of phylogenetic resolution. The tools incorporated here are fast, computationally tractable, and easily extendable to other phyla and provide a scaleable framework for contrasting and integrating the information present in thousands of protein-coding genes within related genomes.
- Markov clustering
ASJC Scopus subject areas
- Ecology, Evolution, Behavior and Systematics
- Molecular Biology