TY - JOUR
T1 - ENTRNA
T2 - A framework to predict RNA foldability
AU - Su, Congzhe
AU - Weir, Jeffery D.
AU - Zhang, Fei
AU - Yan, Hao
AU - Wu, Teresa
N1 - Funding Information:
This work was supported by grants from United States Transportation Command (USTRANSCOM) in concert with the Air Force Institute of Technology (AFIT) under an ongoing Memorandum of Agreement. The funding bodies did not play any role in the design of the study, data collection and analysis, or preparation of the manuscript.
Publisher Copyright:
© 2019 The Author(s).
PY - 2019/7/3
Y1 - 2019/7/3
N2 - Background: RNA molecules play many crucial roles in living systems. The spatial complexity that exists in RNA structures determines their cellular functions. Therefore, understanding RNA folding conformations, in particular, RNA secondary structures, is critical for elucidating biological functions. Existing literature has focused on RNA design as either an RNA structure prediction problem or an RNA inverse folding problem where free energy has played a key role. Results: In this research, we propose a Positive-Unlabeled data-driven framework termed ENTRNA. Other than free energy and commonly studied sequence and structural features, we propose a new feature, Sequence Segment Entropy (SSE), to measure the diversity of RNA sequences. ENTRNA is trained and cross-validated using 1024 pseudoknot-free RNAs and 1060 pseudoknotted RNAs from the RNASTRAND database respectively. To test the robustness of the ENTRNA, the models are further blind tested on 206 pseudoknot-free and 93 pseudoknotted RNAs from the PDB database. For pseudoknot-free RNAs, ENTRNA has 86.5% sensitivity on the training dataset and 80.6% sensitivity on the testing dataset. For pseudoknotted RNAs, ENTRNA shows 81.5% sensitivity on the training dataset and 71.0% on the testing dataset. To test the applicability of ENTRNA to long structural-complex RNA, we collect 5 laboratory synthetic RNAs ranging from 1618 to 1790 nucleotides. ENTRNA is able to predict the foldability of 4 RNAs. Conclusion: In this article, we reformulate the RNA design problem as a foldability prediction problem which is to predict the likelihood of the co-existence of a sequence-structure pair. This new construct has the potential for both RNA structure prediction and the inverse folding problem. In addition, this new construct enables us to explore data-driven approaches in RNA research.
AB - Background: RNA molecules play many crucial roles in living systems. The spatial complexity that exists in RNA structures determines their cellular functions. Therefore, understanding RNA folding conformations, in particular, RNA secondary structures, is critical for elucidating biological functions. Existing literature has focused on RNA design as either an RNA structure prediction problem or an RNA inverse folding problem where free energy has played a key role. Results: In this research, we propose a Positive-Unlabeled data-driven framework termed ENTRNA. Other than free energy and commonly studied sequence and structural features, we propose a new feature, Sequence Segment Entropy (SSE), to measure the diversity of RNA sequences. ENTRNA is trained and cross-validated using 1024 pseudoknot-free RNAs and 1060 pseudoknotted RNAs from the RNASTRAND database respectively. To test the robustness of the ENTRNA, the models are further blind tested on 206 pseudoknot-free and 93 pseudoknotted RNAs from the PDB database. For pseudoknot-free RNAs, ENTRNA has 86.5% sensitivity on the training dataset and 80.6% sensitivity on the testing dataset. For pseudoknotted RNAs, ENTRNA shows 81.5% sensitivity on the training dataset and 71.0% on the testing dataset. To test the applicability of ENTRNA to long structural-complex RNA, we collect 5 laboratory synthetic RNAs ranging from 1618 to 1790 nucleotides. ENTRNA is able to predict the foldability of 4 RNAs. Conclusion: In this article, we reformulate the RNA design problem as a foldability prediction problem which is to predict the likelihood of the co-existence of a sequence-structure pair. This new construct has the potential for both RNA structure prediction and the inverse folding problem. In addition, this new construct enables us to explore data-driven approaches in RNA research.
KW - Data-driven
KW - Foldability
KW - Sequence segment entropy
UR - http://www.scopus.com/inward/record.url?scp=85068603166&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85068603166&partnerID=8YFLogxK
U2 - 10.1186/s12859-019-2948-5
DO - 10.1186/s12859-019-2948-5
M3 - Article
C2 - 31269893
AN - SCOPUS:85068603166
SN - 1471-2105
VL - 20
JO - BMC Bioinformatics
JF - BMC Bioinformatics
IS - 1
M1 - 373
ER -