Award Abstract # 2040675
NSF Convergence Accelerator Track - Track D - AI-Enabled, Privacy-Preserving Information Sharing for Securing Network Infrastructure

NSF Org: ITE
Innovation and Technology Ecosystems
Recipient: CARNEGIE MELLON UNIVERSITY
Initial Amendment Date: September 10, 2020
Latest Amendment Date: October 14, 2020
Award Number: 2040675
Award Instrument: Standard Grant
Program Manager: Mike Pozmantier
ITE
 Innovation and Technology Ecosystems
TIP
 Directorate for Technology, Innovation, and Partnerships
Start Date: September 15, 2020
End Date: May 31, 2022 (Estimated)
Total Intended Award Amount: $968,013.00
Total Awarded Amount to Date: $968,013.00
Funds Obligated to Date: FY 2020 = $968,013.00
History of Investigator:
  • Giulia Fanti (Principal Investigator)
    gfanti@andrew.cmu.edu
  • Michael Reiter (Co-Principal Investigator)
  • Nicholas Feamster (Co-Principal Investigator)
  • Vyas Sekar (Co-Principal Investigator)
  • Lior Strahilevitz (Co-Principal Investigator)
Recipient Sponsored Research Office: Carnegie-Mellon University
5000 FORBES AVE
PITTSBURGH
PA  US  15213-3890
(412)268-8746
Sponsor Congressional District: 12
Primary Place of Performance: Carnegie Mellon University
5000 Forbes Ave
Pittsburgh
PA  US  15213-3815
Primary Place of Performance
Congressional District:
12
Unique Entity Identifier (UEI): U3NKNFLNQ613
Parent UEI: U3NKNFLNQ613
NSF Program(s): Convergence Accelerator Resrch
Primary Program Source: 01002021DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):
Program Element Code(s): 131Y00
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.084

ABSTRACT

The NSF Convergence Accelerator supports use-inspired, team-based, multidisciplinary efforts that address challenges of national importance and will produce deliverables of value to society in the near future. Cyber attacks on enterprise networks pose a tremendous threat to business operations today. Defending against the ever-changing landscape of threats and normal user traffic is time-consuming and labor-intensive. To address this challenge, there is an ongoing effort across many sectors to adopt artificial intelligence and machine learning (AI/ML) models to automate security incident detection and response. In practice, however, there are two roadblocks to AI/ML-enabled workflows: (1) lack of sufficient data to train a reliable model to detect new attack campaigns or model normal behaviors; (2) lack of confidence in model outputs over a short timeframe, inducing undesirable tradeoffs between false positives (i.e., blocking legitimate users) and false negatives (i.e., missing attacks). Ideally, sharing data would help address both of these problems, however this information is rarely shared (if at all) due to concerns about consumer or business privacy, and what is shared in many cases is anonymized in such way that the data loses its value. This project will create new capabilities for sharing detailed yet privacy-preserving information about security incidents that will substantially alter the data-sharing pipeline, both within and across organizations and accelerate the industry transition to AI-driven security workflows. Having better AI-driven cybersecurity tools will have an enormous impact in protecting critical infrastructure and networks across all sectors from cybers attacks.

This project will take an interdisciplinary approach spanning AI/ML, security, privacy, networked systems, law, and policy. It will tackle the fundamental tradeoffs among privacy, utility, and efficiency along three key thrusts: (1) design and implement novel generative adversarial networks (GANs) by which an enterprise can model its network data to inform anomaly detection by others. This thrust will design and implement novel GANs and analyze their privacy implications and their utility for use by others to detect malicious network activity. (2) Design and implement new cryptographic protocols and systems workflows for efficiently comparing hypotheses (suspicious identifiers, such as domain names, IP subnets, and program hashes) across enterprises to inform policy deployments. (3) Develop new legal and policy analyses on the implications of sharing such synthetic data, ML models, and hypotheses. By addressing these three critical areas and engaging key stakeholders, the tools developed by this project stand a high probably of gaining adoption and having tremendous value to the country by improving cybersecurity.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Huster, Todd and Cohen, Jeremy and Lin, Zinan and Chan, Kevin and Kamhoua, Charles and Leslie, Nandi and Chiang, Cho-Yu Jason and Sekar, Vyas "Pareto GAN: Extending the Representational Power of GANs to Heavy-Tailed Distributions" International Conference on Machine Learning , 2021 Citation Details
Lin, Zinan and Jain, Alankar and Wang, Chen and Fanti, Giulia and Sekar, Vyas "Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions" ACM Internet Measurement Conference , 2020 https://doi.org/10.1145/3419394.3423643 Citation Details
Lin, Zinan and Liang, Hao and Fanti, Giulia and Vyas Sekar "RareGAN: Generating Samples for Rare Classes" AAAI , 2022 Citation Details
Lin, Zinan and Sekar, Vyas and Fanti, Giulia "On the Privacy Properties of GAN-generated Samples" 24th International Conference on Artificial Intelligence and Statistics , 2021 Citation Details
Severini, Joseph and Mysore, Radhika Niranjan and Sekar, Vyas and Banerjee, Sujata and Reiter, Michael K. "The Netivus Manifesto: making collaborative network management easier for the rest of us" ACM SIGCOMM Computer Communication Review , 2021 Citation Details
Wang, Ke Coby and Reiter, Michael "Using Amnesia to Detect Credential Database Breaches" USENIX Security , 2021 Citation Details
Wang, Ke Coby and Reiter, Michael K. "Detecting stuffing of a users credentials at her own accounts" USENIX Security Symposium , 2020 Citation Details

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Our goal for this project was to implement new infrastructure within existing Information Sharing and Analysis Centers (ISACs), which are nonprofit organizations that facilitate the sharing of information regarding cybersecurity threats between enterprises. Our initial goal was to build software that could be integrated with ISAC infrastructure to increase the amount and flexibility of data that is shared within ISACs. Specifically, we wanted to facilitate the sharing of network traffic to detect emergent threats. This software would enable ISAC members to upload and share synthetic data (e.g., raw network traces) from their local data stores, which other members could use to train detection models. Additionally, our software would allow organizations to compare encrypted policies or alerts to identify similar policies across organizations.

Over the course of this grant, our focus shifted as a result of the PI curriculum, which involved meeting with potential customers (including ISAC operators and members). Although our technological goals remained the same, we instead turned our focus to data sharing in enterprise settings not mediated by ISACs. Over the period of this grant, we had several major project outcomes. 


1) Synthetic data: We have developed new technologies for enabling data sharing across organizations via “synthetic data”, or data that is generated to mimic real data, without directly copying it. Our approach in this project was to generate synthetic data from deep generative models, which have recently gained traction in the machine learning community for their ability to model complex distributions, such as images. We have designed new architectures to model network traffic with deep generative models and demonstrated the efficacy of such approaches on various types of network traces, including traffic measurements, resource usage traces, and packet captures. Our work has resulted in academic publications in the networking, security, and machine learning domains. It has also resulted in an open source project (DataFuel), which allows users to generate synthetic data models on their own datasets. 


2) Network traffic analysis: Data representation plays a critical role in the performance of novelty detection (or ``anomaly detection'') methods in machine learning. The data representation of network traffic often determines the effectiveness of these models as much as the model itself. Over the course of this project, we have developed a systematic framework, open-source toolkit, and public Python library that makes it both possible and easy to extract and generate features from network traffic and perform and end-to-end evaluation of these representations across most prevalent modern novelty detection models. We first develop and publicly release an open-source tool, an accompanying Python library (NetML), and end-to-end pipeline for novelty detection in network traffic. Second, we apply this tool to five different novelty detection problems in networking, across a range of scenarios from attack detection to novel device detection. Our findings general insights and guidelines concerning which features appear to be more appropriate for particular situations.


3) Encrypted data comparison techniques: To enable organizations to compare encrypted policies or alerts to identify similar policies across organizations, we developed novel generalizations of private set intersection (PSI) protocols, which generally allow two mutually untrusting parties to compute an intersection of their sets, without revealing information about items that are not in the intersection.  In particular, we introduced a PSI variant called distance-aware PSI (DA-PSI) for sets whose elements lie in a metric space. DA-PSI returns pairs of items that are within a specified distance threshold of each other. We showed how this new construct can be used to compare lists of suspicious IP addresses when an entire subnetwork of IP addresses is malicious. In these cases, different organizations may not see the exact same suspicious IP addresses, but their lists of suspicious IP addresses may fall in the same range, or “subnetwork”. In these cases, existing PSI tools fail; we show that our relaxed construction succeeds over real honeypot data. 


Last Modified: 09/12/2022
Modified by: Giulia Fanti

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page