SNaReSim

Synthetic Nanopore Read Simulator

Philippe C. Faucon, Parithi Balachandran, Sharon Crook

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

Nanopores represent the first commercial technology in decades to present a significantly different technique for DNA sequencing, and one of the first technologies to propose direct RNA sequencing. Despite significant differences with previous sequencing technologies, read simulators to date make similar assumptions with respect to error profiles and their analysis, resulting in incorrect characterization of nanopore error. This is a great disservice to both nanopore sequencing and to computer scientists who seek to optimize their tools for the platform. Previous works have discussed the occurrence of some bias in the identifiability of certain k-mers, but this discussion has been focused on homopolymers, leaving unanswered the question of whether k-mer bias exists over all k-mers, the strength of the bias, how it occurs, and what can be done to reduce the effects. In this work, we demonstrate that current read simulators fail to accurately represent k-mer error distributions. We explore the sources of k-mer bias in nanopore basecalls, and we present a model for predicting k-mers that are difficult to identify. We also propose SNaReSim, a new state-of-the-art simulator, and demonstrate that it provides higher accuracy with respect to 6- mer accuracy biases.

Original languageEnglish (US)
Title of host publicationProceedings - 2017 IEEE International Conference on Healthcare Informatics, ICHI 2017
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages338-344
Number of pages7
ISBN (Electronic)9781509048816
DOIs
StatePublished - Sep 8 2017
Event5th IEEE International Conference on Healthcare Informatics, ICHI 2017 - Park City, United States
Duration: Aug 23 2017Aug 26 2017

Other

Other5th IEEE International Conference on Healthcare Informatics, ICHI 2017
CountryUnited States
CityPark City
Period8/23/178/26/17

Fingerprint

Nanopores
Technology
RNA Sequence Analysis
DNA Sequence Analysis

Keywords

  • Feature Engineering
  • Nanopore Sequencing
  • Third Generation Sequencing
  • Unsupervised Learning

ASJC Scopus subject areas

  • Health Informatics

Cite this

Faucon, P. C., Balachandran, P., & Crook, S. (2017). SNaReSim: Synthetic Nanopore Read Simulator. In Proceedings - 2017 IEEE International Conference on Healthcare Informatics, ICHI 2017 (pp. 338-344). [8031171] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICHI.2017.98

SNaReSim : Synthetic Nanopore Read Simulator. / Faucon, Philippe C.; Balachandran, Parithi; Crook, Sharon.

Proceedings - 2017 IEEE International Conference on Healthcare Informatics, ICHI 2017. Institute of Electrical and Electronics Engineers Inc., 2017. p. 338-344 8031171.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Faucon, PC, Balachandran, P & Crook, S 2017, SNaReSim: Synthetic Nanopore Read Simulator. in Proceedings - 2017 IEEE International Conference on Healthcare Informatics, ICHI 2017., 8031171, Institute of Electrical and Electronics Engineers Inc., pp. 338-344, 5th IEEE International Conference on Healthcare Informatics, ICHI 2017, Park City, United States, 8/23/17. https://doi.org/10.1109/ICHI.2017.98
Faucon PC, Balachandran P, Crook S. SNaReSim: Synthetic Nanopore Read Simulator. In Proceedings - 2017 IEEE International Conference on Healthcare Informatics, ICHI 2017. Institute of Electrical and Electronics Engineers Inc. 2017. p. 338-344. 8031171 https://doi.org/10.1109/ICHI.2017.98
Faucon, Philippe C. ; Balachandran, Parithi ; Crook, Sharon. / SNaReSim : Synthetic Nanopore Read Simulator. Proceedings - 2017 IEEE International Conference on Healthcare Informatics, ICHI 2017. Institute of Electrical and Electronics Engineers Inc., 2017. pp. 338-344
@inproceedings{96f892c88ff74ebbbca0601b5a327b35,
title = "SNaReSim: Synthetic Nanopore Read Simulator",
abstract = "Nanopores represent the first commercial technology in decades to present a significantly different technique for DNA sequencing, and one of the first technologies to propose direct RNA sequencing. Despite significant differences with previous sequencing technologies, read simulators to date make similar assumptions with respect to error profiles and their analysis, resulting in incorrect characterization of nanopore error. This is a great disservice to both nanopore sequencing and to computer scientists who seek to optimize their tools for the platform. Previous works have discussed the occurrence of some bias in the identifiability of certain k-mers, but this discussion has been focused on homopolymers, leaving unanswered the question of whether k-mer bias exists over all k-mers, the strength of the bias, how it occurs, and what can be done to reduce the effects. In this work, we demonstrate that current read simulators fail to accurately represent k-mer error distributions. We explore the sources of k-mer bias in nanopore basecalls, and we present a model for predicting k-mers that are difficult to identify. We also propose SNaReSim, a new state-of-the-art simulator, and demonstrate that it provides higher accuracy with respect to 6- mer accuracy biases.",
keywords = "Feature Engineering, Nanopore Sequencing, Third Generation Sequencing, Unsupervised Learning",
author = "Faucon, {Philippe C.} and Parithi Balachandran and Sharon Crook",
year = "2017",
month = "9",
day = "8",
doi = "10.1109/ICHI.2017.98",
language = "English (US)",
pages = "338--344",
booktitle = "Proceedings - 2017 IEEE International Conference on Healthcare Informatics, ICHI 2017",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
address = "United States",

}

TY - GEN

T1 - SNaReSim

T2 - Synthetic Nanopore Read Simulator

AU - Faucon, Philippe C.

AU - Balachandran, Parithi

AU - Crook, Sharon

PY - 2017/9/8

Y1 - 2017/9/8

N2 - Nanopores represent the first commercial technology in decades to present a significantly different technique for DNA sequencing, and one of the first technologies to propose direct RNA sequencing. Despite significant differences with previous sequencing technologies, read simulators to date make similar assumptions with respect to error profiles and their analysis, resulting in incorrect characterization of nanopore error. This is a great disservice to both nanopore sequencing and to computer scientists who seek to optimize their tools for the platform. Previous works have discussed the occurrence of some bias in the identifiability of certain k-mers, but this discussion has been focused on homopolymers, leaving unanswered the question of whether k-mer bias exists over all k-mers, the strength of the bias, how it occurs, and what can be done to reduce the effects. In this work, we demonstrate that current read simulators fail to accurately represent k-mer error distributions. We explore the sources of k-mer bias in nanopore basecalls, and we present a model for predicting k-mers that are difficult to identify. We also propose SNaReSim, a new state-of-the-art simulator, and demonstrate that it provides higher accuracy with respect to 6- mer accuracy biases.

AB - Nanopores represent the first commercial technology in decades to present a significantly different technique for DNA sequencing, and one of the first technologies to propose direct RNA sequencing. Despite significant differences with previous sequencing technologies, read simulators to date make similar assumptions with respect to error profiles and their analysis, resulting in incorrect characterization of nanopore error. This is a great disservice to both nanopore sequencing and to computer scientists who seek to optimize their tools for the platform. Previous works have discussed the occurrence of some bias in the identifiability of certain k-mers, but this discussion has been focused on homopolymers, leaving unanswered the question of whether k-mer bias exists over all k-mers, the strength of the bias, how it occurs, and what can be done to reduce the effects. In this work, we demonstrate that current read simulators fail to accurately represent k-mer error distributions. We explore the sources of k-mer bias in nanopore basecalls, and we present a model for predicting k-mers that are difficult to identify. We also propose SNaReSim, a new state-of-the-art simulator, and demonstrate that it provides higher accuracy with respect to 6- mer accuracy biases.

KW - Feature Engineering

KW - Nanopore Sequencing

KW - Third Generation Sequencing

KW - Unsupervised Learning

UR - http://www.scopus.com/inward/record.url?scp=85032344212&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85032344212&partnerID=8YFLogxK

U2 - 10.1109/ICHI.2017.98

DO - 10.1109/ICHI.2017.98

M3 - Conference contribution

SP - 338

EP - 344

BT - Proceedings - 2017 IEEE International Conference on Healthcare Informatics, ICHI 2017

PB - Institute of Electrical and Electronics Engineers Inc.

ER -