TY - GEN
T1 - SyML
T2 - 24th International Symposium on Research in Attacks, Intrusions and Defenses, RAID 2021
AU - Ruaro, Nicola
AU - Zeng, Kyle
AU - Dresel, Lukas
AU - Polino, Mario
AU - Bao, Tiffany
AU - Continella, Andrea
AU - Zanero, Stefano
AU - Kruegel, Christopher
AU - Vigna, Giovanni
N1 - Funding Information:
We would like to thank our reviewers for their valuable comments and inputs to improve our paper. This material is based upon work supported by NSF under Award No. CNS-1704253. Research was also sponsored by DARPA under agreements number HR001118C0060 and FA8750-19-C-0003. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA, the U.S. Government, or the other sponsors.
Publisher Copyright:
© 2021 Owner/Author.
PY - 2021/10/6
Y1 - 2021/10/6
N2 - Exploring many execution paths in a binary program is essential to discover new vulnerabilities. Dynamic Symbolic Execution (DSE) is useful to trigger complex input conditions and enables an accurate exploration of a program while providing extensive crash replayability and semantic insights. However, scaling this type of analysis to complex binaries is difficult. Current methods suffer from the path explosion problem, despite many attempts to mitigate this challenge (e.g., by merging paths when appropriate). Still, in general, this challenge is not yet surmounted, and most bugs discovered through such techniques are shallow. We propose a novel approach to address the path explosion problem: A smart triaging system that leverages supervised machine learning techniques to replicate human expertise, leading to vulnerable path discovery. Our approach monitors the execution traces in vulnerable programs and extracts relevant features - register and memory accesses, function complexity, system calls - to guide the symbolic exploration. We train models to learn the patterns of vulnerable paths from the extracted features, and we leverage their predictions to discover interesting execution paths in new programs. We implement our approach in a tool called SyML, and we evaluate it on the Cyber Grand Challenge (CGC) dataset - a well-known dataset of vulnerable programs - and on 3 real-world Linux binaries. We show that the knowledge collected from the analysis of vulnerable paths, without any explicit prior knowledge about vulnerability patterns, is transferrable to unseen binaries, and leads to outperforming prior work in path prioritization by triggering more, and different, unique vulnerabilities.
AB - Exploring many execution paths in a binary program is essential to discover new vulnerabilities. Dynamic Symbolic Execution (DSE) is useful to trigger complex input conditions and enables an accurate exploration of a program while providing extensive crash replayability and semantic insights. However, scaling this type of analysis to complex binaries is difficult. Current methods suffer from the path explosion problem, despite many attempts to mitigate this challenge (e.g., by merging paths when appropriate). Still, in general, this challenge is not yet surmounted, and most bugs discovered through such techniques are shallow. We propose a novel approach to address the path explosion problem: A smart triaging system that leverages supervised machine learning techniques to replicate human expertise, leading to vulnerable path discovery. Our approach monitors the execution traces in vulnerable programs and extracts relevant features - register and memory accesses, function complexity, system calls - to guide the symbolic exploration. We train models to learn the patterns of vulnerable paths from the extracted features, and we leverage their predictions to discover interesting execution paths in new programs. We implement our approach in a tool called SyML, and we evaluate it on the Cyber Grand Challenge (CGC) dataset - a well-known dataset of vulnerable programs - and on 3 real-world Linux binaries. We show that the knowledge collected from the analysis of vulnerable paths, without any explicit prior knowledge about vulnerability patterns, is transferrable to unseen binaries, and leads to outperforming prior work in path prioritization by triggering more, and different, unique vulnerabilities.
KW - Machine learning
KW - Symbolic execution
KW - Vulnerability discovery
UR - http://www.scopus.com/inward/record.url?scp=85117733433&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85117733433&partnerID=8YFLogxK
U2 - 10.1145/3471621.3471865
DO - 10.1145/3471621.3471865
M3 - Conference contribution
AN - SCOPUS:85117733433
T3 - ACM International Conference Proceeding Series
SP - 456
EP - 468
BT - Proceedings of 2021 24th International Symposium on Research in Attacks, Intrusions and Defenses, RAID 2021
PB - Association for Computing Machinery
Y2 - 6 October 2021 through 8 October 2021
ER -