TY - GEN
T1 - SNaReSim
T2 - 5th IEEE International Conference on Healthcare Informatics, ICHI 2017
AU - Faucon, Philippe C.
AU - Balachandran, Parithi
AU - Crook, Sharon
N1 - Publisher Copyright:
© 2017 IEEE.
Copyright:
Copyright 2017 Elsevier B.V., All rights reserved.
PY - 2017/9/8
Y1 - 2017/9/8
N2 - Nanopores represent the first commercial technology in decades to present a significantly different technique for DNA sequencing, and one of the first technologies to propose direct RNA sequencing. Despite significant differences with previous sequencing technologies, read simulators to date make similar assumptions with respect to error profiles and their analysis, resulting in incorrect characterization of nanopore error. This is a great disservice to both nanopore sequencing and to computer scientists who seek to optimize their tools for the platform. Previous works have discussed the occurrence of some bias in the identifiability of certain k-mers, but this discussion has been focused on homopolymers, leaving unanswered the question of whether k-mer bias exists over all k-mers, the strength of the bias, how it occurs, and what can be done to reduce the effects. In this work, we demonstrate that current read simulators fail to accurately represent k-mer error distributions. We explore the sources of k-mer bias in nanopore basecalls, and we present a model for predicting k-mers that are difficult to identify. We also propose SNaReSim, a new state-of-the-art simulator, and demonstrate that it provides higher accuracy with respect to 6- mer accuracy biases.
AB - Nanopores represent the first commercial technology in decades to present a significantly different technique for DNA sequencing, and one of the first technologies to propose direct RNA sequencing. Despite significant differences with previous sequencing technologies, read simulators to date make similar assumptions with respect to error profiles and their analysis, resulting in incorrect characterization of nanopore error. This is a great disservice to both nanopore sequencing and to computer scientists who seek to optimize their tools for the platform. Previous works have discussed the occurrence of some bias in the identifiability of certain k-mers, but this discussion has been focused on homopolymers, leaving unanswered the question of whether k-mer bias exists over all k-mers, the strength of the bias, how it occurs, and what can be done to reduce the effects. In this work, we demonstrate that current read simulators fail to accurately represent k-mer error distributions. We explore the sources of k-mer bias in nanopore basecalls, and we present a model for predicting k-mers that are difficult to identify. We also propose SNaReSim, a new state-of-the-art simulator, and demonstrate that it provides higher accuracy with respect to 6- mer accuracy biases.
KW - Feature Engineering
KW - Nanopore Sequencing
KW - Third Generation Sequencing
KW - Unsupervised Learning
UR - http://www.scopus.com/inward/record.url?scp=85032344212&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85032344212&partnerID=8YFLogxK
U2 - 10.1109/ICHI.2017.98
DO - 10.1109/ICHI.2017.98
M3 - Conference contribution
AN - SCOPUS:85032344212
T3 - Proceedings - 2017 IEEE International Conference on Healthcare Informatics, ICHI 2017
SP - 338
EP - 344
BT - Proceedings - 2017 IEEE International Conference on Healthcare Informatics, ICHI 2017
A2 - Cummins, Mollie
A2 - Facelli, Julio
A2 - Meixner, Gerrit
A2 - Giraud-Carrier, Christophe
A2 - Nakajima, Hiroshi
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 23 August 2017 through 26 August 2017
ER -