CyberGAN: Generating High-fidelity Cybersecurity Data With Generative Adversarial Networks

Zhang, Yuening; Viswanathan, Arun A; Le, Joie; Gonik, Julia

Machine learning for cyber defense offers the promise of detecting adversarial activity against the ground data systems managing critical space assets. A fundamental challenge facing machine learning research in cybersecurity is the lack of high-fidelity, shareable datasets for robust evaluation and testing of machine learning-based solutions. High-fidelity, real-world datasets are necessary for reliable benchmarking of nominal system behavior and malicious activity. Unfortunately, such realistic datasets of both nominal and adversarial activity are rarely shared publicly by data owners due to security and privacy concerns. Besides, the available adversarial data is sparse, which makes training models on malicious activity much harder. This situation has impeded and continues to impede the research and successful adoption of machine learning methods for cyber defense. Researchers have dealt with this problem by generating data within a low-fidelity lab environment, using classified and thus unshareable datasets, or downloading low-fidelity public datasets made available by others. We propose an innovative solution to the problem by employing machine learning methods to generate high-fidelity data. Specifically, we propose the use of Generative Adversarial Networks (GANs) to generate high-fidelity data for cybersecurity purposes. GANs have found successful image processing and natural language applications, but have not yet been investigated for cyber data generation. Our proposed approach first involves training the `discriminator' network of the GAN with a sample of real-world data consisting of malicious and nominal samples. We then use the `generator' network to generate new high-fidelity data samples consisting of an appropriate mix of malicious and nominal activity. We demonstrate applications of our architecture by generating high-fidelity cybersecurity data containing both malicious and nominal samples. We thoroughly evaluate the fidelity of our generated data using heuristics and evaluate its usefulness for machine learning applications using three different datasets. Overall, our approach results in high-fidelity, shareable datasets.

Document ID

20220001546

Acquisition Source

Jet Propulsion Laboratory

Document Type

Preprint (Draft being sent to journal)

External Source(s)

hdl:2014/53255

Authors

Date Acquired

November 16, 2020

Publication Date

November 16, 2020

Publication Information

Publisher: Pasadena, CA: Jet Propulsion Laboratory, National Aeronautics and Space Administration, 2020

Distribution Limits

Public

Other

Technical Review

Available Downloads

There are no available downloads for this record.

No Preview Available

NTRS

NTRS - NASA Technical Reports Server

Available Downloads

Related Records