Performance evaluation of six popular short-read simulators

Mark Milhaven; Susanne P. Pfeifer

doi:10.1038/s41437-022-00577-3

Performance evaluation of six popular short-read simulators

Mark Milhaven, Susanne P. Pfeifer

Life Sciences, School of (SOLS)

Research output: Contribution to journal › Article › peer-review

3 Scopus citations

Abstract

High-throughput sequencing data enables the comprehensive study of genomes and the variation therein. Essential for the interpretation of this genomic data is a thorough understanding of the computational methods used for processing and analysis. Whereas “gold-standard” empirical datasets exist for this purpose in humans, synthetic (i.e., simulated) sequencing data can offer important insights into the capabilities and limitations of computational pipelines for any arbitrary species and/or study design—yet, the ability of read simulator software to emulate genomic characteristics of empirical datasets remains poorly understood. We here compare the performance of six popular short-read simulators—ART, DWGSIM, InSilicoSeq, Mason, NEAT, and wgsim—and discuss important considerations for selecting suitable models for benchmarking.

Original language	English (US)
Pages (from-to)	55-63
Number of pages	9
Journal	Heredity
Volume	130
Issue number	2
DOIs	https://doi.org/10.1038/s41437-022-00577-3
State	Published - Feb 2023

ASJC Scopus subject areas

Genetics
Genetics(clinical)

Access to Document

10.1038/s41437-022-00577-3

Cite this

@article{c1cd4ab5f06c4f3197a7df9edf4828e6,

title = "Performance evaluation of six popular short-read simulators",

abstract = "High-throughput sequencing data enables the comprehensive study of genomes and the variation therein. Essential for the interpretation of this genomic data is a thorough understanding of the computational methods used for processing and analysis. Whereas “gold-standard” empirical datasets exist for this purpose in humans, synthetic (i.e., simulated) sequencing data can offer important insights into the capabilities and limitations of computational pipelines for any arbitrary species and/or study design—yet, the ability of read simulator software to emulate genomic characteristics of empirical datasets remains poorly understood. We here compare the performance of six popular short-read simulators—ART, DWGSIM, InSilicoSeq, Mason, NEAT, and wgsim—and discuss important considerations for selecting suitable models for benchmarking.",

author = "Mark Milhaven and Pfeifer, {Susanne P.}",

note = "Funding Information: This work was supported by a National Science Foundation CAREER grant to SPP (DEB-2045343). Computations were performed on Arizona State University{\textquoteright}s High-Performance Compute Cluster. Publisher Copyright: {\textcopyright} 2022, The Author(s).",

year = "2023",

month = feb,

doi = "10.1038/s41437-022-00577-3",

language = "English (US)",

volume = "130",

pages = "55--63",

journal = "Heredity",

issn = "0018-067X",

publisher = "Nature Publishing Group",

number = "2",

}

TY - JOUR

T1 - Performance evaluation of six popular short-read simulators

AU - Milhaven, Mark

AU - Pfeifer, Susanne P.

N1 - Funding Information: This work was supported by a National Science Foundation CAREER grant to SPP (DEB-2045343). Computations were performed on Arizona State University’s High-Performance Compute Cluster. Publisher Copyright: © 2022, The Author(s).

PY - 2023/2

Y1 - 2023/2

N2 - High-throughput sequencing data enables the comprehensive study of genomes and the variation therein. Essential for the interpretation of this genomic data is a thorough understanding of the computational methods used for processing and analysis. Whereas “gold-standard” empirical datasets exist for this purpose in humans, synthetic (i.e., simulated) sequencing data can offer important insights into the capabilities and limitations of computational pipelines for any arbitrary species and/or study design—yet, the ability of read simulator software to emulate genomic characteristics of empirical datasets remains poorly understood. We here compare the performance of six popular short-read simulators—ART, DWGSIM, InSilicoSeq, Mason, NEAT, and wgsim—and discuss important considerations for selecting suitable models for benchmarking.

AB - High-throughput sequencing data enables the comprehensive study of genomes and the variation therein. Essential for the interpretation of this genomic data is a thorough understanding of the computational methods used for processing and analysis. Whereas “gold-standard” empirical datasets exist for this purpose in humans, synthetic (i.e., simulated) sequencing data can offer important insights into the capabilities and limitations of computational pipelines for any arbitrary species and/or study design—yet, the ability of read simulator software to emulate genomic characteristics of empirical datasets remains poorly understood. We here compare the performance of six popular short-read simulators—ART, DWGSIM, InSilicoSeq, Mason, NEAT, and wgsim—and discuss important considerations for selecting suitable models for benchmarking.

UR - http://www.scopus.com/inward/record.url?scp=85143593608&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85143593608&partnerID=8YFLogxK

U2 - 10.1038/s41437-022-00577-3

DO - 10.1038/s41437-022-00577-3

M3 - Article

C2 - 36496447

AN - SCOPUS:85143593608

SN - 0018-067X

VL - 130

SP - 55

EP - 63

JO - Heredity

JF - Heredity

IS - 2

ER -

Performance evaluation of six popular short-read simulators

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this