TY - JOUR
T1 - A stochastic Farris transform for genetic data under the multispecies coalescent with applications to data requirements
AU - Dasarathy, Gautam
AU - Mossel, Elchanan
AU - Nowak, Robert
AU - Roch, Sebastien
N1 - Funding Information:
G. Dasarathy: Supported by the NSF grant CCF-2048223 (CAREER) and the NIH grant 1R01GM140468-01. E. Mossel: Supported by Simons-NSF Grant DMS-2031883 and by a Vannevar Bush Faculty Fellowship ONR-N00014-20-1-2826. S. Roch: Supported by NSF Grants DMS-1149312 (CAREER), DMS-1614242, CCF-1740707 (TRIPODS), DMS-1902892 and DMS-2023239 (TRIPODS Phase II), as well as a Simons Fellowship and a Vilas Associates Award.
Publisher Copyright:
© 2022, The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature.
PY - 2022/4
Y1 - 2022/4
N2 - Species tree estimation faces many significant hurdles. Chief among them is that the trees describing the ancestral lineages of each individual gene—the gene trees—often differ from the species tree. The multispecies coalescent is commonly used to model this gene tree discordance, at least when it is believed to arise from incomplete lineage sorting, a population-genetic effect. Another significant challenge in this area is that molecular sequences associated to each gene typically provide limited information about the gene trees themselves. While the modeling of sequence evolution by single-site substitutions is well-studied, few species tree reconstruction methods with theoretical guarantees actually address this latter issue. Instead, a standard—but unsatisfactory—assumption is that gene trees are perfectly reconstructed before being fed into a so-called summary method. Hence much remains to be done in the development of inference methodologies that rigorously account for gene tree estimation error—or completely avoid gene tree estimation in the first place. In previous work, a data requirement trade-off was derived between the number of loci m needed for an accurate reconstruction and the length of the locus sequences k. It was shown that to reconstruct an internal branch of length f, one needs m to be of the order of 1/[f2k]. That previous result was obtained under the restrictive assumption that mutation rates as well as population sizes are constant across the species phylogeny. Here we further generalize this result beyond this assumption. Our main contribution is a novel reduction to the molecular clock case under the multispecies coalescent, which we refer to as a stochastic Farris transform. As a corollary, we also obtain a new identifiability result of independent interest: for any species tree with n≥ 3 species, the rooted topology of the species tree can be identified from the distribution of its unrooted weighted gene trees even in the absence of a molecular clock.
AB - Species tree estimation faces many significant hurdles. Chief among them is that the trees describing the ancestral lineages of each individual gene—the gene trees—often differ from the species tree. The multispecies coalescent is commonly used to model this gene tree discordance, at least when it is believed to arise from incomplete lineage sorting, a population-genetic effect. Another significant challenge in this area is that molecular sequences associated to each gene typically provide limited information about the gene trees themselves. While the modeling of sequence evolution by single-site substitutions is well-studied, few species tree reconstruction methods with theoretical guarantees actually address this latter issue. Instead, a standard—but unsatisfactory—assumption is that gene trees are perfectly reconstructed before being fed into a so-called summary method. Hence much remains to be done in the development of inference methodologies that rigorously account for gene tree estimation error—or completely avoid gene tree estimation in the first place. In previous work, a data requirement trade-off was derived between the number of loci m needed for an accurate reconstruction and the length of the locus sequences k. It was shown that to reconstruct an internal branch of length f, one needs m to be of the order of 1/[f2k]. That previous result was obtained under the restrictive assumption that mutation rates as well as population sizes are constant across the species phylogeny. Here we further generalize this result beyond this assumption. Our main contribution is a novel reduction to the molecular clock case under the multispecies coalescent, which we refer to as a stochastic Farris transform. As a corollary, we also obtain a new identifiability result of independent interest: for any species tree with n≥ 3 species, the rooted topology of the species tree can be identified from the distribution of its unrooted weighted gene trees even in the absence of a molecular clock.
KW - Coalescent
KW - Data requirement
KW - Distance methods
KW - Gene tree/species tree
KW - Phylogenetic reconstruction
UR - http://www.scopus.com/inward/record.url?scp=85127882214&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85127882214&partnerID=8YFLogxK
U2 - 10.1007/s00285-022-01731-5
DO - 10.1007/s00285-022-01731-5
M3 - Article
C2 - 35394192
AN - SCOPUS:85127882214
SN - 0303-6812
VL - 84
JO - Journal of Mathematical Biology
JF - Journal of Mathematical Biology
IS - 5
M1 - 36
ER -