DAPT 2020 - Constructing a Benchmark Dataset for Advanced Persistent Threats

Sowmya Myneni; Ankur Chowdhary; Abdulhakim Sabur; Sailik Sengupta; Garima Agrawal; Dijiang Huang; Myong Kang

doi:10.1007/978-3-030-59621-7_8

DAPT 2020 - Constructing a Benchmark Dataset for Advanced Persistent Threats

Sowmya Myneni, Ankur Chowdhary, Abdulhakim Sabur, Sailik Sengupta, Garima Agrawal, Dijiang Huang, Myong Kang

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

39 Scopus citations

Abstract

Machine learning is being embraced by information security researchers and organizations alike for its potential in detecting attacks that an organization faces, specifically attacks that go undetected by traditional signature-based intrusion detection systems. Along with the ability to process large amounts of data, machine learning brings the potential to detect contextual and collective anomalies, an essential attribute of an ideal threat detection system. Datasets play a vital role in developing machine learning models that are capable of detecting complex and sophisticated threats like Advanced Persistent Threats (APT). However, there is currently no APT-dataset that can be used for modeling and detecting APT attacks. Characterized by the sophistication involved and the determined nature of the APT attackers, these threats are not only difficult to detect but also to model. Generic intrusion datasets have three key limitations - (1) They capture attack traffic at the external endpoints, limiting their usefulness in the context of APTs which comprise of attack vectors within the internal network as well (2) The difference between normal and anomalous behavior is quiet distinguishable in these datasets and thus fails to represent the sophisticated attackers’ of APT attacks (3) The data imbalance in existing datasets do not reflect the real-world settings rendering themselves as a benchmark for supervised models and falling short of semi-supervised learning. To address these concerns, in this paper, we propose a dataset DAPT 2020 which consists of attacks that are part of Advanced Persistent Threats (APT). These attacks (1) are hard to distinguish from normal traffic flows but investigate the raw feature space and (2) comprise of traffic on both public-to-private interface and the internal (private) network. Due to the existence of severe class imbalance, we benchmark DAPT 2020 dataset on semi-supervised models and show that they perform poorly trying to detect attack traffic in the various stages of an APT.

Original language	English (US)
Title of host publication	Deployable Machine Learning for Security Defense - 1st International Workshop, MLHat 2020, Proceedings
Editors	Gang Wang, Arridhana Ciptadi, Ali Ahmadzadeh
Publisher	Springer Science and Business Media Deutschland GmbH
Pages	138-163
Number of pages	26
ISBN (Print)	9783030596200
DOIs	https://doi.org/10.1007/978-3-030-59621-7_8
State	Published - 2020
Event	1st International Workshop on Deployable Machine Learning for Security Defense, MLHat 2020, collocated with the 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2020 - San Diego, United States Duration: Aug 24 2020 → Aug 24 2020

Publication series

Name	Communications in Computer and Information Science
Volume	1271 CCIS
ISSN (Print)	1865-0929
ISSN (Electronic)	1865-0937

Conference

Conference	1st International Workshop on Deployable Machine Learning for Security Defense, MLHat 2020, collocated with the 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2020
Country/Territory	United States
City	San Diego
Period	8/24/20 → 8/24/20

Keywords

Advanced Persistent Threat
Anomaly detection
Benchmark dataset
Long Term Short Memory (LSTM)
Stacked Autoencoder (SAE)

ASJC Scopus subject areas

General Computer Science
General Mathematics

Access to Document

10.1007/978-3-030-59621-7_8

Cite this

Myneni, S., Chowdhary, A., Sabur, A., Sengupta, S., Agrawal, G., Huang, D., & Kang, M. (2020). DAPT 2020 - Constructing a Benchmark Dataset for Advanced Persistent Threats. In G. Wang, A. Ciptadi, & A. Ahmadzadeh (Eds.), Deployable Machine Learning for Security Defense - 1st International Workshop, MLHat 2020, Proceedings (pp. 138-163). (Communications in Computer and Information Science; Vol. 1271 CCIS). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-59621-7_8

DAPT 2020 - Constructing a Benchmark Dataset for Advanced Persistent Threats. / Myneni, Sowmya; Chowdhary, Ankur; Sabur, Abdulhakim et al.
Deployable Machine Learning for Security Defense - 1st International Workshop, MLHat 2020, Proceedings. ed. / Gang Wang; Arridhana Ciptadi; Ali Ahmadzadeh. Springer Science and Business Media Deutschland GmbH, 2020. p. 138-163 (Communications in Computer and Information Science; Vol. 1271 CCIS).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Myneni, S, Chowdhary, A, Sabur, A, Sengupta, S, Agrawal, G, Huang, D & Kang, M 2020, DAPT 2020 - Constructing a Benchmark Dataset for Advanced Persistent Threats. in G Wang, A Ciptadi & A Ahmadzadeh (eds), Deployable Machine Learning for Security Defense - 1st International Workshop, MLHat 2020, Proceedings. Communications in Computer and Information Science, vol. 1271 CCIS, Springer Science and Business Media Deutschland GmbH, pp. 138-163, 1st International Workshop on Deployable Machine Learning for Security Defense, MLHat 2020, collocated with the 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2020, San Diego, United States, 8/24/20. https://doi.org/10.1007/978-3-030-59621-7_8

Myneni S, Chowdhary A, Sabur A, Sengupta S, Agrawal G, Huang D et al. DAPT 2020 - Constructing a Benchmark Dataset for Advanced Persistent Threats. In Wang G, Ciptadi A, Ahmadzadeh A, editors, Deployable Machine Learning for Security Defense - 1st International Workshop, MLHat 2020, Proceedings. Springer Science and Business Media Deutschland GmbH. 2020. p. 138-163. (Communications in Computer and Information Science). doi: 10.1007/978-3-030-59621-7_8

Myneni, Sowmya ; Chowdhary, Ankur ; Sabur, Abdulhakim et al. / DAPT 2020 - Constructing a Benchmark Dataset for Advanced Persistent Threats. Deployable Machine Learning for Security Defense - 1st International Workshop, MLHat 2020, Proceedings. editor / Gang Wang ; Arridhana Ciptadi ; Ali Ahmadzadeh. Springer Science and Business Media Deutschland GmbH, 2020. pp. 138-163 (Communications in Computer and Information Science).

@inproceedings{e9ee5c3fe5374b61bbce3c1abcacd60b,

title = "DAPT 2020 - Constructing a Benchmark Dataset for Advanced Persistent Threats",

abstract = "Machine learning is being embraced by information security researchers and organizations alike for its potential in detecting attacks that an organization faces, specifically attacks that go undetected by traditional signature-based intrusion detection systems. Along with the ability to process large amounts of data, machine learning brings the potential to detect contextual and collective anomalies, an essential attribute of an ideal threat detection system. Datasets play a vital role in developing machine learning models that are capable of detecting complex and sophisticated threats like Advanced Persistent Threats (APT). However, there is currently no APT-dataset that can be used for modeling and detecting APT attacks. Characterized by the sophistication involved and the determined nature of the APT attackers, these threats are not only difficult to detect but also to model. Generic intrusion datasets have three key limitations - (1) They capture attack traffic at the external endpoints, limiting their usefulness in the context of APTs which comprise of attack vectors within the internal network as well (2) The difference between normal and anomalous behavior is quiet distinguishable in these datasets and thus fails to represent the sophisticated attackers{\textquoteright} of APT attacks (3) The data imbalance in existing datasets do not reflect the real-world settings rendering themselves as a benchmark for supervised models and falling short of semi-supervised learning. To address these concerns, in this paper, we propose a dataset DAPT 2020 which consists of attacks that are part of Advanced Persistent Threats (APT). These attacks (1) are hard to distinguish from normal traffic flows but investigate the raw feature space and (2) comprise of traffic on both public-to-private interface and the internal (private) network. Due to the existence of severe class imbalance, we benchmark DAPT 2020 dataset on semi-supervised models and show that they perform poorly trying to detect attack traffic in the various stages of an APT.",

keywords = "Advanced Persistent Threat, Anomaly detection, Benchmark dataset, Long Term Short Memory (LSTM), Stacked Autoencoder (SAE)",

author = "Sowmya Myneni and Ankur Chowdhary and Abdulhakim Sabur and Sailik Sengupta and Garima Agrawal and Dijiang Huang and Myong Kang",

note = "Funding Information: Acknowledgement. This research is supported in part by following research grants: Naval Research Lab N0017319-1-G002, NSF DGE-1723440, OAC-1642031. Sailik Sen-gupta is supported by the IBM Ph.D. Fellowship. Publisher Copyright: {\textcopyright} 2020, Springer Nature Switzerland AG.; 1st International Workshop on Deployable Machine Learning for Security Defense, MLHat 2020, collocated with the 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2020 ; Conference date: 24-08-2020 Through 24-08-2020",

year = "2020",

doi = "10.1007/978-3-030-59621-7_8",

language = "English (US)",

isbn = "9783030596200",

series = "Communications in Computer and Information Science",

publisher = "Springer Science and Business Media Deutschland GmbH",

pages = "138--163",

editor = "Gang Wang and Arridhana Ciptadi and Ali Ahmadzadeh",

booktitle = "Deployable Machine Learning for Security Defense - 1st International Workshop, MLHat 2020, Proceedings",

address = "Germany",

}

TY - GEN

T1 - DAPT 2020 - Constructing a Benchmark Dataset for Advanced Persistent Threats

AU - Myneni, Sowmya

AU - Chowdhary, Ankur

AU - Sabur, Abdulhakim

AU - Sengupta, Sailik

AU - Agrawal, Garima

AU - Huang, Dijiang

AU - Kang, Myong

N1 - Funding Information: Acknowledgement. This research is supported in part by following research grants: Naval Research Lab N0017319-1-G002, NSF DGE-1723440, OAC-1642031. Sailik Sen-gupta is supported by the IBM Ph.D. Fellowship. Publisher Copyright: © 2020, Springer Nature Switzerland AG.

PY - 2020

Y1 - 2020

N2 - Machine learning is being embraced by information security researchers and organizations alike for its potential in detecting attacks that an organization faces, specifically attacks that go undetected by traditional signature-based intrusion detection systems. Along with the ability to process large amounts of data, machine learning brings the potential to detect contextual and collective anomalies, an essential attribute of an ideal threat detection system. Datasets play a vital role in developing machine learning models that are capable of detecting complex and sophisticated threats like Advanced Persistent Threats (APT). However, there is currently no APT-dataset that can be used for modeling and detecting APT attacks. Characterized by the sophistication involved and the determined nature of the APT attackers, these threats are not only difficult to detect but also to model. Generic intrusion datasets have three key limitations - (1) They capture attack traffic at the external endpoints, limiting their usefulness in the context of APTs which comprise of attack vectors within the internal network as well (2) The difference between normal and anomalous behavior is quiet distinguishable in these datasets and thus fails to represent the sophisticated attackers’ of APT attacks (3) The data imbalance in existing datasets do not reflect the real-world settings rendering themselves as a benchmark for supervised models and falling short of semi-supervised learning. To address these concerns, in this paper, we propose a dataset DAPT 2020 which consists of attacks that are part of Advanced Persistent Threats (APT). These attacks (1) are hard to distinguish from normal traffic flows but investigate the raw feature space and (2) comprise of traffic on both public-to-private interface and the internal (private) network. Due to the existence of severe class imbalance, we benchmark DAPT 2020 dataset on semi-supervised models and show that they perform poorly trying to detect attack traffic in the various stages of an APT.

AB - Machine learning is being embraced by information security researchers and organizations alike for its potential in detecting attacks that an organization faces, specifically attacks that go undetected by traditional signature-based intrusion detection systems. Along with the ability to process large amounts of data, machine learning brings the potential to detect contextual and collective anomalies, an essential attribute of an ideal threat detection system. Datasets play a vital role in developing machine learning models that are capable of detecting complex and sophisticated threats like Advanced Persistent Threats (APT). However, there is currently no APT-dataset that can be used for modeling and detecting APT attacks. Characterized by the sophistication involved and the determined nature of the APT attackers, these threats are not only difficult to detect but also to model. Generic intrusion datasets have three key limitations - (1) They capture attack traffic at the external endpoints, limiting their usefulness in the context of APTs which comprise of attack vectors within the internal network as well (2) The difference between normal and anomalous behavior is quiet distinguishable in these datasets and thus fails to represent the sophisticated attackers’ of APT attacks (3) The data imbalance in existing datasets do not reflect the real-world settings rendering themselves as a benchmark for supervised models and falling short of semi-supervised learning. To address these concerns, in this paper, we propose a dataset DAPT 2020 which consists of attacks that are part of Advanced Persistent Threats (APT). These attacks (1) are hard to distinguish from normal traffic flows but investigate the raw feature space and (2) comprise of traffic on both public-to-private interface and the internal (private) network. Due to the existence of severe class imbalance, we benchmark DAPT 2020 dataset on semi-supervised models and show that they perform poorly trying to detect attack traffic in the various stages of an APT.

KW - Advanced Persistent Threat

KW - Anomaly detection

KW - Benchmark dataset

KW - Long Term Short Memory (LSTM)

KW - Stacked Autoencoder (SAE)

UR - http://www.scopus.com/inward/record.url?scp=85096612402&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85096612402&partnerID=8YFLogxK

U2 - 10.1007/978-3-030-59621-7_8

DO - 10.1007/978-3-030-59621-7_8

M3 - Conference contribution

AN - SCOPUS:85096612402

SN - 9783030596200

T3 - Communications in Computer and Information Science

SP - 138

EP - 163

BT - Deployable Machine Learning for Security Defense - 1st International Workshop, MLHat 2020, Proceedings

A2 - Wang, Gang

A2 - Ciptadi, Arridhana

A2 - Ahmadzadeh, Ali

PB - Springer Science and Business Media Deutschland GmbH

T2 - 1st International Workshop on Deployable Machine Learning for Security Defense, MLHat 2020, collocated with the 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2020

Y2 - 24 August 2020 through 24 August 2020

ER -

DAPT 2020 - Constructing a Benchmark Dataset for Advanced Persistent Threats

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this