Population-genetic inference from pooled-sequencing data

Michael Lynch; Darius Bost; Sade Wilson; Takahiro Maruki; Scott Harrison

doi:10.1093/gbe/evu085

Population-genetic inference from pooled-sequencing data

Michael Lynch, Darius Bost, Sade Wilson, Takahiro Maruki, Scott Harrison

Research output: Contribution to journal › Article › peer-review

71 Scopus citations

Abstract

Although pooled-population sequencing has become a widely used approach for estimating allele frequencies, most work has proceeded in the absence of a proper statistical framework. We introduce a self-sufficient, closed-form, maximum-likelihood estimator for allele frequencies that accounts for errors associated with sequencing, and a likelihood-ratio test statistic that provides a simple means for evaluating the null hypothesis of monomorphism. Unbiased estimates of allele frequencies < 5/N (where N is the number of individuals sampled) appear to be unachievable, and near-certain identification of a polymorphism requires a minor-allele frequency> 10/N. A framework is provided for testing for significant differences in allele frequencies between populations, taking into account sampling at the levels of individuals within populations and sequences within pooled samples. Analyses that fail to account for the two tiers of sampling suffer from very large false-positive rates and can become increasingly misleading with increasing depths of sequence coverage. The power to detect significant allele-frequency differences between two populations is very limited unless both the number of sampled individuals and depth of sequencing coverage exceed 100.

Original language	English (US)
Pages (from-to)	1210-1218
Number of pages	9
Journal	Genome biology and evolution
Volume	6
Issue number	5
DOIs	https://doi.org/10.1093/gbe/evu085
State	Published - May 2014
Externally published	Yes

Keywords

Allele-frequency estimation
Population genomics
Population subdivision

ASJC Scopus subject areas

Ecology, Evolution, Behavior and Systematics
Genetics

Access to Document

10.1093/gbe/evu085

Cite this

@article{aaba8b8932b64a99bc2daa53b158ce05,

title = "Population-genetic inference from pooled-sequencing data",

abstract = "Although pooled-population sequencing has become a widely used approach for estimating allele frequencies, most work has proceeded in the absence of a proper statistical framework. We introduce a self-sufficient, closed-form, maximum-likelihood estimator for allele frequencies that accounts for errors associated with sequencing, and a likelihood-ratio test statistic that provides a simple means for evaluating the null hypothesis of monomorphism. Unbiased estimates of allele frequencies < 5/N (where N is the number of individuals sampled) appear to be unachievable, and near-certain identification of a polymorphism requires a minor-allele frequency> 10/N. A framework is provided for testing for significant differences in allele frequencies between populations, taking into account sampling at the levels of individuals within populations and sequences within pooled samples. Analyses that fail to account for the two tiers of sampling suffer from very large false-positive rates and can become increasingly misleading with increasing depths of sequence coverage. The power to detect significant allele-frequency differences between two populations is very limited unless both the number of sampled individuals and depth of sequencing coverage exceed 100.",

keywords = "Allele-frequency estimation, Population genomics, Population subdivision",

author = "Michael Lynch and Darius Bost and Sade Wilson and Takahiro Maruki and Scott Harrison",

year = "2014",

month = may,

doi = "10.1093/gbe/evu085",

language = "English (US)",

volume = "6",

pages = "1210--1218",

journal = "Genome biology and evolution",

issn = "1759-6653",

publisher = "Oxford University Press",

number = "5",

}

TY - JOUR

T1 - Population-genetic inference from pooled-sequencing data

AU - Lynch, Michael

AU - Bost, Darius

AU - Wilson, Sade

AU - Maruki, Takahiro

AU - Harrison, Scott

PY - 2014/5

Y1 - 2014/5

N2 - Although pooled-population sequencing has become a widely used approach for estimating allele frequencies, most work has proceeded in the absence of a proper statistical framework. We introduce a self-sufficient, closed-form, maximum-likelihood estimator for allele frequencies that accounts for errors associated with sequencing, and a likelihood-ratio test statistic that provides a simple means for evaluating the null hypothesis of monomorphism. Unbiased estimates of allele frequencies < 5/N (where N is the number of individuals sampled) appear to be unachievable, and near-certain identification of a polymorphism requires a minor-allele frequency> 10/N. A framework is provided for testing for significant differences in allele frequencies between populations, taking into account sampling at the levels of individuals within populations and sequences within pooled samples. Analyses that fail to account for the two tiers of sampling suffer from very large false-positive rates and can become increasingly misleading with increasing depths of sequence coverage. The power to detect significant allele-frequency differences between two populations is very limited unless both the number of sampled individuals and depth of sequencing coverage exceed 100.

AB - Although pooled-population sequencing has become a widely used approach for estimating allele frequencies, most work has proceeded in the absence of a proper statistical framework. We introduce a self-sufficient, closed-form, maximum-likelihood estimator for allele frequencies that accounts for errors associated with sequencing, and a likelihood-ratio test statistic that provides a simple means for evaluating the null hypothesis of monomorphism. Unbiased estimates of allele frequencies < 5/N (where N is the number of individuals sampled) appear to be unachievable, and near-certain identification of a polymorphism requires a minor-allele frequency> 10/N. A framework is provided for testing for significant differences in allele frequencies between populations, taking into account sampling at the levels of individuals within populations and sequences within pooled samples. Analyses that fail to account for the two tiers of sampling suffer from very large false-positive rates and can become increasingly misleading with increasing depths of sequence coverage. The power to detect significant allele-frequency differences between two populations is very limited unless both the number of sampled individuals and depth of sequencing coverage exceed 100.

KW - Allele-frequency estimation

KW - Population genomics

KW - Population subdivision

UR - http://www.scopus.com/inward/record.url?scp=84902964148&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84902964148&partnerID=8YFLogxK

U2 - 10.1093/gbe/evu085

DO - 10.1093/gbe/evu085

M3 - Article

C2 - 24787620

AN - SCOPUS:84902964148

SN - 1759-6653

VL - 6

SP - 1210

EP - 1218

JO - Genome biology and evolution

JF - Genome biology and evolution

IS - 5

ER -

Population-genetic inference from pooled-sequencing data

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this