Prospects for building large timetrees using molecular data with incomplete gene coverage among species

Alan Filipski, Oscar Murillo, Anna Freydenzon, Koichiro Tamura, Sudhir Kumar

Research output: Contribution to journalArticle

23 Citations (Scopus)

Abstract

Scientists are assembling sequence data sets from increasing numbers of species and genes to build comprehensive timetrees. However, data are often unavailable for some species and gene combinations, and the proportion of missing data is often large for data sets containing many genes and species. Surprisingly, there has not been a systematic analysis of the effect of the degree of sparseness of the species-genematrix on the accuracy of divergence time estimates. Here, we present results from computer simulations and empirical data analyses to quantify the impact of missing gene data on divergence time estimation in large phylogenies. We found that estimates of divergence times were robust even when sequences from a majority of genes for most of the species were absent. From the analysis of such extremely sparse data sets, we found that the most egregious errors occurred for nodes in the tree that had no common genes for any pair of species in the immediate descendant clades of the node in question. These problematic nodes can be easily detected prior to computational analyses based only on the input sequence alignment and the tree topology. We conclude that it is best to use larger alignments, because adding both genes and species to the alignment augments the number of genes available for estimating divergence events deep in the tree and improves their time estimates.

Original languageEnglish (US)
Pages (from-to)2542-2550
Number of pages9
JournalMolecular Biology and Evolution
Volume31
Issue number9
DOIs
StatePublished - 2014

Fingerprint

gene
Genes
genes
divergence
Sequence Alignment
sequence alignment
Phylogeny
computer simulation
Computer Simulation
topology
phylogeny
alignment
Datasets
analysis

Keywords

  • Divergence time
  • Incomplete data
  • Timetree

ASJC Scopus subject areas

  • Genetics
  • Molecular Biology
  • Ecology, Evolution, Behavior and Systematics

Cite this

Prospects for building large timetrees using molecular data with incomplete gene coverage among species. / Filipski, Alan; Murillo, Oscar; Freydenzon, Anna; Tamura, Koichiro; Kumar, Sudhir.

In: Molecular Biology and Evolution, Vol. 31, No. 9, 2014, p. 2542-2550.

Research output: Contribution to journalArticle

Filipski, Alan ; Murillo, Oscar ; Freydenzon, Anna ; Tamura, Koichiro ; Kumar, Sudhir. / Prospects for building large timetrees using molecular data with incomplete gene coverage among species. In: Molecular Biology and Evolution. 2014 ; Vol. 31, No. 9. pp. 2542-2550.
@article{7f1eca5eb93649139cb24e7659f537fe,
title = "Prospects for building large timetrees using molecular data with incomplete gene coverage among species",
abstract = "Scientists are assembling sequence data sets from increasing numbers of species and genes to build comprehensive timetrees. However, data are often unavailable for some species and gene combinations, and the proportion of missing data is often large for data sets containing many genes and species. Surprisingly, there has not been a systematic analysis of the effect of the degree of sparseness of the species-genematrix on the accuracy of divergence time estimates. Here, we present results from computer simulations and empirical data analyses to quantify the impact of missing gene data on divergence time estimation in large phylogenies. We found that estimates of divergence times were robust even when sequences from a majority of genes for most of the species were absent. From the analysis of such extremely sparse data sets, we found that the most egregious errors occurred for nodes in the tree that had no common genes for any pair of species in the immediate descendant clades of the node in question. These problematic nodes can be easily detected prior to computational analyses based only on the input sequence alignment and the tree topology. We conclude that it is best to use larger alignments, because adding both genes and species to the alignment augments the number of genes available for estimating divergence events deep in the tree and improves their time estimates.",
keywords = "Divergence time, Incomplete data, Timetree",
author = "Alan Filipski and Oscar Murillo and Anna Freydenzon and Koichiro Tamura and Sudhir Kumar",
year = "2014",
doi = "10.1093/molbev/msu200",
language = "English (US)",
volume = "31",
pages = "2542--2550",
journal = "Molecular Biology and Evolution",
issn = "0737-4038",
publisher = "Oxford University Press",
number = "9",

}

TY - JOUR

T1 - Prospects for building large timetrees using molecular data with incomplete gene coverage among species

AU - Filipski, Alan

AU - Murillo, Oscar

AU - Freydenzon, Anna

AU - Tamura, Koichiro

AU - Kumar, Sudhir

PY - 2014

Y1 - 2014

N2 - Scientists are assembling sequence data sets from increasing numbers of species and genes to build comprehensive timetrees. However, data are often unavailable for some species and gene combinations, and the proportion of missing data is often large for data sets containing many genes and species. Surprisingly, there has not been a systematic analysis of the effect of the degree of sparseness of the species-genematrix on the accuracy of divergence time estimates. Here, we present results from computer simulations and empirical data analyses to quantify the impact of missing gene data on divergence time estimation in large phylogenies. We found that estimates of divergence times were robust even when sequences from a majority of genes for most of the species were absent. From the analysis of such extremely sparse data sets, we found that the most egregious errors occurred for nodes in the tree that had no common genes for any pair of species in the immediate descendant clades of the node in question. These problematic nodes can be easily detected prior to computational analyses based only on the input sequence alignment and the tree topology. We conclude that it is best to use larger alignments, because adding both genes and species to the alignment augments the number of genes available for estimating divergence events deep in the tree and improves their time estimates.

AB - Scientists are assembling sequence data sets from increasing numbers of species and genes to build comprehensive timetrees. However, data are often unavailable for some species and gene combinations, and the proportion of missing data is often large for data sets containing many genes and species. Surprisingly, there has not been a systematic analysis of the effect of the degree of sparseness of the species-genematrix on the accuracy of divergence time estimates. Here, we present results from computer simulations and empirical data analyses to quantify the impact of missing gene data on divergence time estimation in large phylogenies. We found that estimates of divergence times were robust even when sequences from a majority of genes for most of the species were absent. From the analysis of such extremely sparse data sets, we found that the most egregious errors occurred for nodes in the tree that had no common genes for any pair of species in the immediate descendant clades of the node in question. These problematic nodes can be easily detected prior to computational analyses based only on the input sequence alignment and the tree topology. We conclude that it is best to use larger alignments, because adding both genes and species to the alignment augments the number of genes available for estimating divergence events deep in the tree and improves their time estimates.

KW - Divergence time

KW - Incomplete data

KW - Timetree

UR - http://www.scopus.com/inward/record.url?scp=84906045297&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84906045297&partnerID=8YFLogxK

U2 - 10.1093/molbev/msu200

DO - 10.1093/molbev/msu200

M3 - Article

VL - 31

SP - 2542

EP - 2550

JO - Molecular Biology and Evolution

JF - Molecular Biology and Evolution

SN - 0737-4038

IS - 9

ER -