DAPT 2020 - Constructing a Benchmark Dataset for Advanced Persistent Threats

Sowmya Myneni, Ankur Chowdhary, Abdulhakim Sabur, Sailik Sengupta, Garima Agrawal, Dijiang Huang, Myong Kang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Machine learning is being embraced by information security researchers and organizations alike for its potential in detecting attacks that an organization faces, specifically attacks that go undetected by traditional signature-based intrusion detection systems. Along with the ability to process large amounts of data, machine learning brings the potential to detect contextual and collective anomalies, an essential attribute of an ideal threat detection system. Datasets play a vital role in developing machine learning models that are capable of detecting complex and sophisticated threats like Advanced Persistent Threats (APT). However, there is currently no APT-dataset that can be used for modeling and detecting APT attacks. Characterized by the sophistication involved and the determined nature of the APT attackers, these threats are not only difficult to detect but also to model. Generic intrusion datasets have three key limitations - (1) They capture attack traffic at the external endpoints, limiting their usefulness in the context of APTs which comprise of attack vectors within the internal network as well (2) The difference between normal and anomalous behavior is quiet distinguishable in these datasets and thus fails to represent the sophisticated attackers’ of APT attacks (3) The data imbalance in existing datasets do not reflect the real-world settings rendering themselves as a benchmark for supervised models and falling short of semi-supervised learning. To address these concerns, in this paper, we propose a dataset DAPT 2020 which consists of attacks that are part of Advanced Persistent Threats (APT). These attacks (1) are hard to distinguish from normal traffic flows but investigate the raw feature space and (2) comprise of traffic on both public-to-private interface and the internal (private) network. Due to the existence of severe class imbalance, we benchmark DAPT 2020 dataset on semi-supervised models and show that they perform poorly trying to detect attack traffic in the various stages of an APT.

Original languageEnglish (US)
Title of host publicationDeployable Machine Learning for Security Defense - 1st International Workshop, MLHat 2020, Proceedings
EditorsGang Wang, Arridhana Ciptadi, Ali Ahmadzadeh
PublisherSpringer Science and Business Media Deutschland GmbH
Pages138-163
Number of pages26
ISBN (Print)9783030596200
DOIs
StatePublished - 2020
Event1st International Workshop on Deployable Machine Learning for Security Defense, MLHat 2020, collocated with the 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2020 - San Diego, United States
Duration: Aug 24 2020Aug 24 2020

Publication series

NameCommunications in Computer and Information Science
Volume1271 CCIS
ISSN (Print)1865-0929
ISSN (Electronic)1865-0937

Conference

Conference1st International Workshop on Deployable Machine Learning for Security Defense, MLHat 2020, collocated with the 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2020
CountryUnited States
CitySan Diego
Period8/24/208/24/20

Keywords

  • Advanced Persistent Threat
  • Anomaly detection
  • Benchmark dataset
  • Long Term Short Memory (LSTM)
  • Stacked Autoencoder (SAE)

ASJC Scopus subject areas

  • Computer Science(all)
  • Mathematics(all)

Fingerprint Dive into the research topics of 'DAPT 2020 - Constructing a Benchmark Dataset for Advanced Persistent Threats'. Together they form a unique fingerprint.

Cite this