Project Summary Overview DNA sequence data is the primary tool used to construct the phylogenetic trees that represent evolutionary processes; however, many phylogenetic studies select markers based on their utility in previously studied species or determine new markers haphazardly. Some studies have circumvented the difficulty of identifying phylogenetic markers by using whole-genome sequence data. However, the current approach to whole-genome sequencing produces millions of short fragments of sequence from the genome, which must be assembled like a puzzle. This process is time-consuming and costly. Even when whole genomes are assembled, it is difficult to identify genes found in different places in the genome for different species. This project will develop methods to rapidly identify phylogenetic markers directly from raw sequence fragments. These methods will be implemented in robust, easy-to-use software, and made available to the scientific community. First, a new genome assembler will be developed to identify conserved regions of the genome from multiple taxa, regardless of the location of these regions in the genome. Second, we will develop new statistical models to call genotypes from multi-species datasets with sequencing error. Third, we will automate the identification of phylogenetic markers for different clades to reduce the impact of noisy data when constructing large phylogenies and resolve previously undetermined evolutionary relationships. Intellectual Merit This project has the potential to transform phylogenetic research by taking advantage of information in genomic data, while avoiding biased data. These methods will enable the rapid determination of phylogenetic markers, and thus evolutionary relationships, directly from next-generation sequence data. In contrast to previous methods, this approach does not require a reference genome, annotated genome assembly, or sequence alignment. Furthermore, by designing methods and software that take advantage of universally useful data, rather than designing phylogenetic markers for specific studies, we enable the most efficient use and reuse of data. Because we will implement our methods in a user-friendly software package, this approach will be available to any researcher interested in evolutionary relationships of taxa, regardless of their computational skills. Our novel method of genome assembly, in this case used to identify homologous regions of the genome for phylogenetic analyses, has several other potential applications. For example, conserved regions adjacent to non-conserved regions can easily be used to suggest genome rearrangements. Our novel method to identify genotypes, while retaining information on genotype likelihoods given the distribution of reads and the quality score for each read, is applicable to any research where the confidence in the genotype is essential for subsequent analyses. For example, identification of de novo mutations in humans (and linking de novo mutations to disease) relies on the genotypes of individuals in a pedigree. Broader Impacts Our software will facilitate numerous projects in phylogenetics. Because phylogenies are the basis for understanding evolutionary processes, this project has the potential to influence research across many areas of biology. In medicine, our collaborations using our software to study pathogen evolution has the potential to assist in rapid tracking of emerging pathogens. The identification of the origins and transmission pathways of emerging pathogens is critical to human health. This project will provide extensive bioinformatics training for 36 undergraduates; hundreds of additional undergraduates will be engaged using newly-developed computer exercises. Research updates and explanations will be provided to the scientific community and general public through the popular science weblog, The Pandas Thumb.
|Effective start/end date||8/1/14 → 7/31/18|
- NSF: Directorate for Biological Sciences (BIO): $686,240.00