
NSF Org: |
CNS Division Of Computer and Network Systems |
Recipient: |
|
Initial Amendment Date: | July 19, 2018 |
Latest Amendment Date: | August 2, 2021 |
Award Number: | 1801652 |
Award Instrument: | Continuing Grant |
Program Manager: |
Sara Kiesler
skiesler@nsf.gov (703)292-8643 CNS Division Of Computer and Network Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | September 1, 2018 |
End Date: | August 31, 2023 (Estimated) |
Total Intended Award Amount: | $300,000.00 |
Total Awarded Amount to Date: | $300,000.00 |
Funds Obligated to Date: |
FY 2019 = $150,083.00 FY 2021 = $74,040.00 |
History of Investigator: |
|
Recipient Sponsored Research Office: |
506 S WRIGHT ST URBANA IL US 61801-3620 (217)333-2187 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
IL US 61801-2302 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Secure &Trustworthy Cyberspace |
Primary Program Source: |
01001920DB NSF RESEARCH & RELATED ACTIVIT 01002021DB NSF RESEARCH & RELATED ACTIVIT 01002122DB NSF RESEARCH & RELATED ACTIVIT |
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Unlawful online business often leaves behind human-readable text traces for interacting with its targets (e.g., defrauding victims, advertising illicit products to intended customers) or coordinating among the criminals involved. Such text content is valuable for detecting various types of cybercrimes and understanding how they happen, the perpetrator's strategies, capabilities and infrastructures and even the ecosystem of the underground business. Automatic discovery and analysis of such text traces, however, are challenging, due to their deceptive content that can easily blend into legitimate communication, and the criminal's extensive use of secret languages to hide their communication, even on public platforms (such as social media and forums). The project aims at systematically studying how to automatically discover such text traces and intelligently utilize them to fight against online crime. The research outcomes will contribute to more effective and timely control of online criminal activities, and the team's collaboration with industry also enables the team to get feedback and facilitate the transformation of new techniques to practical use.
This project focuses on both criminals' communication with their targets and the underground communications among miscreants. To discover and understand illicit online activities, the research looks for any semantic inconsistency between text content and its context (such as advertisements for selling illegal drugs on an .edu domain) and for inappropriate operations being triggered (such as a malware download). Inconsistencies are captured by the Natural Language Processing (NLP) techniques customized to various security settings. Further, based upon crime-related content discovered, the project will study various machine learning techniques that support automatic extraction and analysis of threat intelligence and criminal activities. The techniques are evaluated using data collected from various sources (public datasets, underground forums and others), and the findings they make are validated through a process that involves manual labeling, communication with affected parties, and collaborations with industry partners. This work will help create in-depth knowledge about underground ecosystems and lead to more effective control of illicit operations of these online businesses.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
Unlawful online business often leaves behind human-readable text traces when coordinating among the criminals involved. Such text content is valuable for detecting various types of cybercrimes and understanding how they happen, the perpetrator's strategies, capabilities and infrastructures and even the ecosystem of the underground business. Automatic discovery and analysis of such text traces, however, are challenging, due to their deceptive content that can easily blend into legitimate communication, and the criminal's extensive use of secret languages to hide their communication, even on public platforms (such as social media and forums). The project aims at systematically studying how to automatically discover such text traces and intelligently utilize them to fight against online crime.
The major outcomes of this project include the following: 1) We have published over 10 research papers with a wide range of new algorithms for analyzing online information to understand and analyze the behaviors of cybersecurity attackers and helping people acquire useful knowledge about security from online sources. 2) We have developed three innovative systems that can facilitate education and online learning. 3) Over 1,000 graduate and undergraduate students have benefited directly or indirectly from the research results of this project via working on the project or taking courses taught by the PI at the University of Illinois at Urbana-Champaign.
We highlight a few specific papers to illustrate three representative lines of our research contributions:
First, in our ECIR 2021 paper, we have proposed and studied a novel method towards automatically identifying and interpreting dark jargons. Dark jargons are benign-looking words that have hidden, sinister meanings and are used by participants of underground forums for illicit behavior. For example, the dark term “rat” is often used in lieu of “Remote Access Trojan”. We formalized the problem as a mapping from dark words to “clean” words with no hidden meaning and addressed the problem by making use of interpretable representations of dark and clean words in the form of probability distributions over a shared vocabulary. The intuition behind our approaches is to leverage the similarity between the context words of a dark term and those of the corresponding “clean” word. In our experiments we show our method to be effective in terms of dark jargon identification and that our method is able to detect dark jargons in a real-world underground forum dataset. The proposed methods are generally and do not require manual effort from humans. Thus they can be used immediately to analyze the dart terms used in many communication platforms by security attackers.
Second, we have developed multiple new foundational artificial intelligence (AI) algorithms for machine learning and natural language processing, which are important general techniques for understanding the semantics of online content and are particularly useful for combatting cybersecurity. For example, in our ICLR 2019 paper, we proposed an innovative machine learning algorithm called multi-agent dual learning, which outperformed state-of-the-art algorithms and can be used in many machine learning applications. As of the time of writing this report, the paper had attracted over 60 citations by peer researchers. Another example is our ACM SIGIR 2020 paper, where we proposed general natural language processing methods that can better handle content not encountered in the training data, thus enabling effective semantic analysis of a wider scope of content on the Web.
Third, we have developed multiple new algorithms that enable computers to better understand social media content. One example is our AAAI ISWSM 2019 paper, where we proposed a novel neural network model for text normalization. In social media, there are many informal languages such as “idk” for “I don’t know”, “2morrow” for “tomorrow”. Our algorithm can learn from training examples to automatically normalize such informal language into more regular English text, thus facilitating both human comprehension and computer analysis of content.
In addition to research contributions on new algorithms, we have also developed innovative systems for supporting education and learning in general, which can help people learn knowledge about cybersecurity, thus preventing cybercrimes from causing harms. Two examples are a dark jargon portal (open source at https://github.com/dom-s/dark-jargon) and TextData (https://textdata.org/about).
A large number of graduate and undergraduate students have directly or indirectly benefited from this project. One PhD student has been supported as an RA on this project and finished a dissertation on the topic of this project. Multiple graduate students have worked directly on this project as an RA and acquired skills and knowledge in related topics. Moreover, over 1,000 graduate and undergraduate students have benefited from the enriched course content of an undergraduate course (CS410) and a graduate course (CS510) that the PI regularly teaches at the University of Illinois at Urbana-Champaign.
Last Modified: 04/20/2024
Modified by: Chengxiang Zhai
Please report errors in award information by writing to: awardsearch@nsf.gov.