Statistics and truth in phylogenomics

Sudhir Kumar, Alan J. Filipski, Fabia U. Battistuzzi, Sergei L. Kosakovsky Pond, Koichiro Tamura

Research output: Contribution to journalArticle

102 Citations (Scopus)

Abstract

Phylogenomics refers to the inference of historical relationships among species using genome-scale sequence data and to the use of phylogenetic analysis to infer protein function in multigene families. With rapidly decreasing sequencing costs, phylogenomics is becoming synonymous with evolutionary analysis of genome-scale and taxonomically densely sampled data sets. In phylogenetic inference applications, this translates into very large data sets that yield evolutionary and functional inferences with extremely small variances and high statistical confidence (P value). However, reports of highly significant P values are increasing even for contrasting phylogenetic hypotheses depending on the evolutionary model and inference method used, making it difficult to establish true relationships. We argue that the assessment of the robustness of results to biological factors, that may systematically mislead (bias) the outcomes of statistical estimation, will be a key to avoiding incorrect phylogenomic inferences. In fact, there is a need for increased emphasis on the magnitude of differences (effect sizes) in addition to the P values of the statistical test of the null hypothesis. On the other hand, the amount of sequence data available will likely always remain inadequate for some phylogenomic applications, for example, those involving episodic positive selection at individual codon positions and in specific lineages. Again, a focus on effect size and biological relevance, rather than the P value, may be warranted. Here, we present a theoretical overview and discuss practical aspects of the interplay between effect sizes, bias, and P values as it relates to the statistical inference of evolutionary truth in phylogenomics.

Original languageEnglish (US)
Pages (from-to)457-472
Number of pages16
JournalMolecular Biology and Evolution
Volume29
Issue number2
DOIs
StatePublished - Feb 2012

Fingerprint

statistics
Genome
phylogenetics
phylogeny
Biological Factors
Multigene Family
genome
Codon
multigene family
Costs and Cost Analysis
codons
statistical analysis
Proteins
protein
cost
Datasets
effect
proteins
analysis
methodology

Keywords

  • evolutionary tree
  • molecular evolution
  • phylogenetics
  • statistical bias
  • statistical inference
  • variance

ASJC Scopus subject areas

  • Genetics
  • Molecular Biology
  • Ecology, Evolution, Behavior and Systematics

Cite this

Kumar, S., Filipski, A. J., Battistuzzi, F. U., Kosakovsky Pond, S. L., & Tamura, K. (2012). Statistics and truth in phylogenomics. Molecular Biology and Evolution, 29(2), 457-472. https://doi.org/10.1093/molbev/msr202

Statistics and truth in phylogenomics. / Kumar, Sudhir; Filipski, Alan J.; Battistuzzi, Fabia U.; Kosakovsky Pond, Sergei L.; Tamura, Koichiro.

In: Molecular Biology and Evolution, Vol. 29, No. 2, 02.2012, p. 457-472.

Research output: Contribution to journalArticle

Kumar, S, Filipski, AJ, Battistuzzi, FU, Kosakovsky Pond, SL & Tamura, K 2012, 'Statistics and truth in phylogenomics', Molecular Biology and Evolution, vol. 29, no. 2, pp. 457-472. https://doi.org/10.1093/molbev/msr202
Kumar S, Filipski AJ, Battistuzzi FU, Kosakovsky Pond SL, Tamura K. Statistics and truth in phylogenomics. Molecular Biology and Evolution. 2012 Feb;29(2):457-472. https://doi.org/10.1093/molbev/msr202
Kumar, Sudhir ; Filipski, Alan J. ; Battistuzzi, Fabia U. ; Kosakovsky Pond, Sergei L. ; Tamura, Koichiro. / Statistics and truth in phylogenomics. In: Molecular Biology and Evolution. 2012 ; Vol. 29, No. 2. pp. 457-472.
@article{8cd4dfdf482541fa806c13e818a522eb,
title = "Statistics and truth in phylogenomics",
abstract = "Phylogenomics refers to the inference of historical relationships among species using genome-scale sequence data and to the use of phylogenetic analysis to infer protein function in multigene families. With rapidly decreasing sequencing costs, phylogenomics is becoming synonymous with evolutionary analysis of genome-scale and taxonomically densely sampled data sets. In phylogenetic inference applications, this translates into very large data sets that yield evolutionary and functional inferences with extremely small variances and high statistical confidence (P value). However, reports of highly significant P values are increasing even for contrasting phylogenetic hypotheses depending on the evolutionary model and inference method used, making it difficult to establish true relationships. We argue that the assessment of the robustness of results to biological factors, that may systematically mislead (bias) the outcomes of statistical estimation, will be a key to avoiding incorrect phylogenomic inferences. In fact, there is a need for increased emphasis on the magnitude of differences (effect sizes) in addition to the P values of the statistical test of the null hypothesis. On the other hand, the amount of sequence data available will likely always remain inadequate for some phylogenomic applications, for example, those involving episodic positive selection at individual codon positions and in specific lineages. Again, a focus on effect size and biological relevance, rather than the P value, may be warranted. Here, we present a theoretical overview and discuss practical aspects of the interplay between effect sizes, bias, and P values as it relates to the statistical inference of evolutionary truth in phylogenomics.",
keywords = "evolutionary tree, molecular evolution, phylogenetics, statistical bias, statistical inference, variance",
author = "Sudhir Kumar and Filipski, {Alan J.} and Battistuzzi, {Fabia U.} and {Kosakovsky Pond}, {Sergei L.} and Koichiro Tamura",
year = "2012",
month = "2",
doi = "10.1093/molbev/msr202",
language = "English (US)",
volume = "29",
pages = "457--472",
journal = "Molecular Biology and Evolution",
issn = "0737-4038",
publisher = "Oxford University Press",
number = "2",

}

TY - JOUR

T1 - Statistics and truth in phylogenomics

AU - Kumar, Sudhir

AU - Filipski, Alan J.

AU - Battistuzzi, Fabia U.

AU - Kosakovsky Pond, Sergei L.

AU - Tamura, Koichiro

PY - 2012/2

Y1 - 2012/2

N2 - Phylogenomics refers to the inference of historical relationships among species using genome-scale sequence data and to the use of phylogenetic analysis to infer protein function in multigene families. With rapidly decreasing sequencing costs, phylogenomics is becoming synonymous with evolutionary analysis of genome-scale and taxonomically densely sampled data sets. In phylogenetic inference applications, this translates into very large data sets that yield evolutionary and functional inferences with extremely small variances and high statistical confidence (P value). However, reports of highly significant P values are increasing even for contrasting phylogenetic hypotheses depending on the evolutionary model and inference method used, making it difficult to establish true relationships. We argue that the assessment of the robustness of results to biological factors, that may systematically mislead (bias) the outcomes of statistical estimation, will be a key to avoiding incorrect phylogenomic inferences. In fact, there is a need for increased emphasis on the magnitude of differences (effect sizes) in addition to the P values of the statistical test of the null hypothesis. On the other hand, the amount of sequence data available will likely always remain inadequate for some phylogenomic applications, for example, those involving episodic positive selection at individual codon positions and in specific lineages. Again, a focus on effect size and biological relevance, rather than the P value, may be warranted. Here, we present a theoretical overview and discuss practical aspects of the interplay between effect sizes, bias, and P values as it relates to the statistical inference of evolutionary truth in phylogenomics.

AB - Phylogenomics refers to the inference of historical relationships among species using genome-scale sequence data and to the use of phylogenetic analysis to infer protein function in multigene families. With rapidly decreasing sequencing costs, phylogenomics is becoming synonymous with evolutionary analysis of genome-scale and taxonomically densely sampled data sets. In phylogenetic inference applications, this translates into very large data sets that yield evolutionary and functional inferences with extremely small variances and high statistical confidence (P value). However, reports of highly significant P values are increasing even for contrasting phylogenetic hypotheses depending on the evolutionary model and inference method used, making it difficult to establish true relationships. We argue that the assessment of the robustness of results to biological factors, that may systematically mislead (bias) the outcomes of statistical estimation, will be a key to avoiding incorrect phylogenomic inferences. In fact, there is a need for increased emphasis on the magnitude of differences (effect sizes) in addition to the P values of the statistical test of the null hypothesis. On the other hand, the amount of sequence data available will likely always remain inadequate for some phylogenomic applications, for example, those involving episodic positive selection at individual codon positions and in specific lineages. Again, a focus on effect size and biological relevance, rather than the P value, may be warranted. Here, we present a theoretical overview and discuss practical aspects of the interplay between effect sizes, bias, and P values as it relates to the statistical inference of evolutionary truth in phylogenomics.

KW - evolutionary tree

KW - molecular evolution

KW - phylogenetics

KW - statistical bias

KW - statistical inference

KW - variance

UR - http://www.scopus.com/inward/record.url?scp=84862928655&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84862928655&partnerID=8YFLogxK

U2 - 10.1093/molbev/msr202

DO - 10.1093/molbev/msr202

M3 - Article

VL - 29

SP - 457

EP - 472

JO - Molecular Biology and Evolution

JF - Molecular Biology and Evolution

SN - 0737-4038

IS - 2

ER -