
NSF Org: |
ITE Innovation and Technology Ecosystems |
Recipient: |
|
Initial Amendment Date: | September 10, 2020 |
Latest Amendment Date: | October 14, 2020 |
Award Number: | 2040675 |
Award Instrument: | Standard Grant |
Program Manager: |
Mike Pozmantier
ITE Innovation and Technology Ecosystems TIP Directorate for Technology, Innovation, and Partnerships |
Start Date: | September 15, 2020 |
End Date: | May 31, 2022 (Estimated) |
Total Intended Award Amount: | $968,013.00 |
Total Awarded Amount to Date: | $968,013.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
5000 FORBES AVE PITTSBURGH PA US 15213-3890 (412)268-8746 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
5000 Forbes Ave Pittsburgh PA US 15213-3815 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Convergence Accelerator Resrch |
Primary Program Source: |
|
Program Reference Code(s): | |
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.084 |
ABSTRACT
The NSF Convergence Accelerator supports use-inspired, team-based, multidisciplinary efforts that address challenges of national importance and will produce deliverables of value to society in the near future. Cyber attacks on enterprise networks pose a tremendous threat to business operations today. Defending against the ever-changing landscape of threats and normal user traffic is time-consuming and labor-intensive. To address this challenge, there is an ongoing effort across many sectors to adopt artificial intelligence and machine learning (AI/ML) models to automate security incident detection and response. In practice, however, there are two roadblocks to AI/ML-enabled workflows: (1) lack of sufficient data to train a reliable model to detect new attack campaigns or model normal behaviors; (2) lack of confidence in model outputs over a short timeframe, inducing undesirable tradeoffs between false positives (i.e., blocking legitimate users) and false negatives (i.e., missing attacks). Ideally, sharing data would help address both of these problems, however this information is rarely shared (if at all) due to concerns about consumer or business privacy, and what is shared in many cases is anonymized in such way that the data loses its value. This project will create new capabilities for sharing detailed yet privacy-preserving information about security incidents that will substantially alter the data-sharing pipeline, both within and across organizations and accelerate the industry transition to AI-driven security workflows. Having better AI-driven cybersecurity tools will have an enormous impact in protecting critical infrastructure and networks across all sectors from cybers attacks.
This project will take an interdisciplinary approach spanning AI/ML, security, privacy, networked systems, law, and policy. It will tackle the fundamental tradeoffs among privacy, utility, and efficiency along three key thrusts: (1) design and implement novel generative adversarial networks (GANs) by which an enterprise can model its network data to inform anomaly detection by others. This thrust will design and implement novel GANs and analyze their privacy implications and their utility for use by others to detect malicious network activity. (2) Design and implement new cryptographic protocols and systems workflows for efficiently comparing hypotheses (suspicious identifiers, such as domain names, IP subnets, and program hashes) across enterprises to inform policy deployments. (3) Develop new legal and policy analyses on the implications of sharing such synthetic data, ML models, and hypotheses. By addressing these three critical areas and engaging key stakeholders, the tools developed by this project stand a high probably of gaining adoption and having tremendous value to the country by improving cybersecurity.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
Our goal for this project was to implement new infrastructure within existing Information Sharing and Analysis Centers (ISACs), which are nonprofit organizations that facilitate the sharing of information regarding cybersecurity threats between enterprises. Our initial goal was to build software that could be integrated with ISAC infrastructure to increase the amount and flexibility of data that is shared within ISACs. Specifically, we wanted to facilitate the sharing of network traffic to detect emergent threats. This software would enable ISAC members to upload and share synthetic data (e.g., raw network traces) from their local data stores, which other members could use to train detection models. Additionally, our software would allow organizations to compare encrypted policies or alerts to identify similar policies across organizations.
Over the course of this grant, our focus shifted as a result of the PI curriculum, which involved meeting with potential customers (including ISAC operators and members). Although our technological goals remained the same, we instead turned our focus to data sharing in enterprise settings not mediated by ISACs. Over the period of this grant, we had several major project outcomes.
1) Synthetic data: We have developed new technologies for enabling data sharing across organizations via “synthetic data”, or data that is generated to mimic real data, without directly copying it. Our approach in this project was to generate synthetic data from deep generative models, which have recently gained traction in the machine learning community for their ability to model complex distributions, such as images. We have designed new architectures to model network traffic with deep generative models and demonstrated the efficacy of such approaches on various types of network traces, including traffic measurements, resource usage traces, and packet captures. Our work has resulted in academic publications in the networking, security, and machine learning domains. It has also resulted in an open source project (DataFuel), which allows users to generate synthetic data models on their own datasets.
2) Network traffic analysis: Data representation plays a critical role in the performance of novelty detection (or ``anomaly detection'') methods in machine learning. The data representation of network traffic often determines the effectiveness of these models as much as the model itself. Over the course of this project, we have developed a systematic framework, open-source toolkit, and public Python library that makes it both possible and easy to extract and generate features from network traffic and perform and end-to-end evaluation of these representations across most prevalent modern novelty detection models. We first develop and publicly release an open-source tool, an accompanying Python library (NetML), and end-to-end pipeline for novelty detection in network traffic. Second, we apply this tool to five different novelty detection problems in networking, across a range of scenarios from attack detection to novel device detection. Our findings general insights and guidelines concerning which features appear to be more appropriate for particular situations.
3) Encrypted data comparison techniques: To enable organizations to compare encrypted policies or alerts to identify similar policies across organizations, we developed novel generalizations of private set intersection (PSI) protocols, which generally allow two mutually untrusting parties to compute an intersection of their sets, without revealing information about items that are not in the intersection. In particular, we introduced a PSI variant called distance-aware PSI (DA-PSI) for sets whose elements lie in a metric space. DA-PSI returns pairs of items that are within a specified distance threshold of each other. We showed how this new construct can be used to compare lists of suspicious IP addresses when an entire subnetwork of IP addresses is malicious. In these cases, different organizations may not see the exact same suspicious IP addresses, but their lists of suspicious IP addresses may fall in the same range, or “subnetwork”. In these cases, existing PSI tools fail; we show that our relaxed construction succeeds over real honeypot data.
Last Modified: 09/12/2022
Modified by: Giulia Fanti
Please report errors in award information by writing to: awardsearch@nsf.gov.