TY - GEN
T1 - DAPT 2020 - Constructing a Benchmark Dataset for Advanced Persistent Threats
AU - Myneni, Sowmya
AU - Chowdhary, Ankur
AU - Sabur, Abdulhakim
AU - Sengupta, Sailik
AU - Agrawal, Garima
AU - Huang, Dijiang
AU - Kang, Myong
N1 - Funding Information:
Acknowledgement. This research is supported in part by following research grants: Naval Research Lab N0017319-1-G002, NSF DGE-1723440, OAC-1642031. Sailik Sen-gupta is supported by the IBM Ph.D. Fellowship.
Publisher Copyright:
© 2020, Springer Nature Switzerland AG.
PY - 2020
Y1 - 2020
N2 - Machine learning is being embraced by information security researchers and organizations alike for its potential in detecting attacks that an organization faces, specifically attacks that go undetected by traditional signature-based intrusion detection systems. Along with the ability to process large amounts of data, machine learning brings the potential to detect contextual and collective anomalies, an essential attribute of an ideal threat detection system. Datasets play a vital role in developing machine learning models that are capable of detecting complex and sophisticated threats like Advanced Persistent Threats (APT). However, there is currently no APT-dataset that can be used for modeling and detecting APT attacks. Characterized by the sophistication involved and the determined nature of the APT attackers, these threats are not only difficult to detect but also to model. Generic intrusion datasets have three key limitations - (1) They capture attack traffic at the external endpoints, limiting their usefulness in the context of APTs which comprise of attack vectors within the internal network as well (2) The difference between normal and anomalous behavior is quiet distinguishable in these datasets and thus fails to represent the sophisticated attackers’ of APT attacks (3) The data imbalance in existing datasets do not reflect the real-world settings rendering themselves as a benchmark for supervised models and falling short of semi-supervised learning. To address these concerns, in this paper, we propose a dataset DAPT 2020 which consists of attacks that are part of Advanced Persistent Threats (APT). These attacks (1) are hard to distinguish from normal traffic flows but investigate the raw feature space and (2) comprise of traffic on both public-to-private interface and the internal (private) network. Due to the existence of severe class imbalance, we benchmark DAPT 2020 dataset on semi-supervised models and show that they perform poorly trying to detect attack traffic in the various stages of an APT.
AB - Machine learning is being embraced by information security researchers and organizations alike for its potential in detecting attacks that an organization faces, specifically attacks that go undetected by traditional signature-based intrusion detection systems. Along with the ability to process large amounts of data, machine learning brings the potential to detect contextual and collective anomalies, an essential attribute of an ideal threat detection system. Datasets play a vital role in developing machine learning models that are capable of detecting complex and sophisticated threats like Advanced Persistent Threats (APT). However, there is currently no APT-dataset that can be used for modeling and detecting APT attacks. Characterized by the sophistication involved and the determined nature of the APT attackers, these threats are not only difficult to detect but also to model. Generic intrusion datasets have three key limitations - (1) They capture attack traffic at the external endpoints, limiting their usefulness in the context of APTs which comprise of attack vectors within the internal network as well (2) The difference between normal and anomalous behavior is quiet distinguishable in these datasets and thus fails to represent the sophisticated attackers’ of APT attacks (3) The data imbalance in existing datasets do not reflect the real-world settings rendering themselves as a benchmark for supervised models and falling short of semi-supervised learning. To address these concerns, in this paper, we propose a dataset DAPT 2020 which consists of attacks that are part of Advanced Persistent Threats (APT). These attacks (1) are hard to distinguish from normal traffic flows but investigate the raw feature space and (2) comprise of traffic on both public-to-private interface and the internal (private) network. Due to the existence of severe class imbalance, we benchmark DAPT 2020 dataset on semi-supervised models and show that they perform poorly trying to detect attack traffic in the various stages of an APT.
KW - Advanced Persistent Threat
KW - Anomaly detection
KW - Benchmark dataset
KW - Long Term Short Memory (LSTM)
KW - Stacked Autoencoder (SAE)
UR - http://www.scopus.com/inward/record.url?scp=85096612402&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85096612402&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-59621-7_8
DO - 10.1007/978-3-030-59621-7_8
M3 - Conference contribution
AN - SCOPUS:85096612402
SN - 9783030596200
T3 - Communications in Computer and Information Science
SP - 138
EP - 163
BT - Deployable Machine Learning for Security Defense - 1st International Workshop, MLHat 2020, Proceedings
A2 - Wang, Gang
A2 - Ciptadi, Arridhana
A2 - Ahmadzadeh, Ali
PB - Springer Science and Business Media Deutschland GmbH
T2 - 1st International Workshop on Deployable Machine Learning for Security Defense, MLHat 2020, collocated with the 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2020
Y2 - 24 August 2020 through 24 August 2020
ER -