Award Abstract # 1801652
SaTC: CORE: Medium: Collaborative: Understanding and Discovering Illicit Online Business Through Automatic Analysis of Online Text Traces

NSF Org: CNS
Division Of Computer and Network Systems
Recipient: UNIVERSITY OF ILLINOIS
Initial Amendment Date: July 19, 2018
Latest Amendment Date: August 2, 2021
Award Number: 1801652
Award Instrument: Continuing Grant
Program Manager: Sara Kiesler
skiesler@nsf.gov
 (703)292-8643
CNS
 Division Of Computer and Network Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: September 1, 2018
End Date: August 31, 2023 (Estimated)
Total Intended Award Amount: $300,000.00
Total Awarded Amount to Date: $300,000.00
Funds Obligated to Date: FY 2018 = $75,877.00
FY 2019 = $150,083.00

FY 2021 = $74,040.00
History of Investigator:
  • ChengXiang Zhai (Principal Investigator)
    czhai@illinois.edu
Recipient Sponsored Research Office: University of Illinois at Urbana-Champaign
506 S WRIGHT ST
URBANA
IL  US  61801-3620
(217)333-2187
Sponsor Congressional District: 13
Primary Place of Performance: University of Illinois at Urbana-Champaign
IL  US  61801-2302
Primary Place of Performance
Congressional District:
13
Unique Entity Identifier (UEI): Y8CWNJRCNN91
Parent UEI: V2PHZ2CSCH63
NSF Program(s): Secure &Trustworthy Cyberspace
Primary Program Source: 01001819DB NSF RESEARCH & RELATED ACTIVIT
01001920DB NSF RESEARCH & RELATED ACTIVIT

01002021DB NSF RESEARCH & RELATED ACTIVIT

01002122DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 025Z, 065Z, 7434, 7924
Program Element Code(s): 806000
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Unlawful online business often leaves behind human-readable text traces for interacting with its targets (e.g., defrauding victims, advertising illicit products to intended customers) or coordinating among the criminals involved. Such text content is valuable for detecting various types of cybercrimes and understanding how they happen, the perpetrator's strategies, capabilities and infrastructures and even the ecosystem of the underground business. Automatic discovery and analysis of such text traces, however, are challenging, due to their deceptive content that can easily blend into legitimate communication, and the criminal's extensive use of secret languages to hide their communication, even on public platforms (such as social media and forums). The project aims at systematically studying how to automatically discover such text traces and intelligently utilize them to fight against online crime. The research outcomes will contribute to more effective and timely control of online criminal activities, and the team's collaboration with industry also enables the team to get feedback and facilitate the transformation of new techniques to practical use.

This project focuses on both criminals' communication with their targets and the underground communications among miscreants. To discover and understand illicit online activities, the research looks for any semantic inconsistency between text content and its context (such as advertisements for selling illegal drugs on an .edu domain) and for inappropriate operations being triggered (such as a malware download). Inconsistencies are captured by the Natural Language Processing (NLP) techniques customized to various security settings. Further, based upon crime-related content discovered, the project will study various machine learning techniques that support automatic extraction and analysis of threat intelligence and criminal activities. The techniques are evaluated using data collected from various sources (public datasets, underground forums and others), and the findings they make are validated through a process that involves manual labeling, communication with affected parties, and collaborations with industry partners. This work will help create in-depth knowledge about underground ecosystems and lead to more effective control of illicit operations of these online businesses.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 15)
Bhavya, Bhavya and Boughoula, Assma and Green, Aaron and Zhai, ChengXiang "Collective Development of Large Scale Data Science Products via Modularized Assignments: An Experience Report" Proceedings of the 51st ACM Technical Symposium on Computer Science Education , 2020 https://doi.org/10.1145/3328778.3366961 Citation Details
Bhavya, Bhavya and Chen, Si and Zhang, Zhilin and Li, Wenting and Zhai, Chengxiang and Angrave, Lawrence and Huang, Yun "Exploring collaborative caption editing to augment video-based learning" Educational technology research and development , 2022 https://doi.org/10.1007/s11423-022-10137-5 Citation Details
Boughoula, Assma and San, Aidan and Zhai, ChengXiang "Leveraging Book Indexes for Automatic Extraction of Concepts in MOOCs" L@S '20: Proceedings of the Seventh ACM Conference on Learning @ Scale , 2020 https://doi.org/10.1145/3386527.3406749 Citation Details
Dominic Seyler, Wei Liu "Towards Dark Jargon Interpretation in Underground Forums" Proceedings of 43rd European Conference on IR Research (ECIR 2021) , 2021 Citation Details
Kuzi, Saar and Labhishetty, Sahiti and Karmaker Santu, Shubhra Kanti and Joshi, Prasad Pradip and Zhai, ChengXiang "Analysis of Adaptive Training for Learning to Rank in Information Retrieval" Proceedings of the 28th ACM International Conference on Information and Knowledge Management , 2019 10.1145/3357384.3358159 Citation Details
Lourentzou, Ismini and Manghnani, Kabir and Zhai, ChengXiang "Adapting Sequence to Sequence Models for Text Normalization in Social Media" Proceedings of the ... International AAAI Conference on Weblogs and Social Media , v.13 , 2019 Citation Details
Messaoud, Safa and Lourentzou, Ismini and Boughoula, Assma and Zehni, Mona and Zhao, Zhizhen and Zhai, Chengxiang and Schwing, Alexander G. "DeepQAMVS: Query-Aware Hierarchical Pointer Networks for Multi-Video Summarization" SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval , 2021 https://doi.org/10.1145/3404835.3462959 Citation Details
Priyanka Dey and ChengXiang Zhai "Fine Grained Categorization of Drug Usage Tweets" Proceedings of the14th International Conference on Social Computing and Social Media: Design, User Experience and Impact (SCSM 2022) , 2022 https://doi.org/10.1007/978-3-031-05061-9_19 Citation Details
Ros, Kevin and Jin, Matthew and Levine, Jacob and Zhai, ChengXiang "Retrieving Webpages Using Online Discussions" , 2023 https://doi.org/10.1145/3578337.3605139 Citation Details
Ros, Kevin and Zhai, ChengXiang "The CDL: An Online Platform for Creating Community-based Digital Libraries" , 2023 https://doi.org/10.1145/3584931.3607495 Citation Details
Seyler, Dominic and Li, Lunan and Zhai, ChengXiang "Semantic Text Analysis for Detection of Compromised Accounts on Social Networks" Proceedings of 2020 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM) , 2020 https://doi.org/10.1109/ASONAM49781.2020.9381432 Citation Details
(Showing: 1 - 10 of 15)

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Unlawful online business often leaves behind human-readable text traces when coordinating among the criminals involved. Such text content is valuable for detecting various types of cybercrimes and understanding how they happen, the perpetrator's strategies, capabilities and infrastructures and even the ecosystem of the underground business. Automatic discovery and analysis of such text traces, however, are challenging, due to their deceptive content that can easily blend into legitimate communication, and the criminal's extensive use of secret languages to hide their communication, even on public platforms (such as social media and forums). The project aims at systematically studying how to automatically discover such text traces and intelligently utilize them to fight against online crime.

The major outcomes of this project include the following: 1) We have published over 10 research papers with a wide range of new algorithms for analyzing online information to understand and analyze the behaviors of cybersecurity attackers and helping people acquire useful knowledge about security from online sources. 2) We have developed three innovative systems that can facilitate education and online learning. 3) Over 1,000 graduate and undergraduate students have benefited directly or indirectly from the research results of this project via working on the project or taking courses taught by the PI at the University of Illinois at Urbana-Champaign.

We highlight a few specific papers to illustrate three representative lines of our research contributions: 

First, in our ECIR 2021 paper, we have proposed and studied a novel method towards automatically identifying and interpreting dark jargons. Dark jargons are benign-looking words that have hidden, sinister meanings and are used by participants of underground forums for illicit behavior. For example, the dark term “rat” is often used in lieu of “Remote Access Trojan”. We formalized the problem as a mapping from dark words to “clean” words with no hidden meaning and addressed the problem by making use of interpretable representations of dark and clean words in the form of probability distributions over a shared vocabulary. The intuition behind our approaches is to leverage the similarity between the context words of a dark term and those of the corresponding “clean” word. In our experiments we show our method to be effective in terms of dark jargon identification and that our method is able to detect dark jargons in a real-world underground forum dataset. The proposed methods are generally and do not require manual effort from humans. Thus they can be used immediately to analyze the dart terms used in many communication platforms by security attackers.

Second, we have developed multiple new foundational artificial intelligence (AI) algorithms for machine learning and natural language processing, which are important general techniques for understanding the semantics of online content and are particularly useful for combatting cybersecurity. For example, in our ICLR 2019 paper, we proposed an innovative machine learning algorithm called multi-agent dual learning, which outperformed state-of-the-art algorithms and can be used in many machine learning applications. As of the time of writing this report, the paper had attracted over 60 citations by peer researchers. Another example is our ACM SIGIR 2020 paper, where we proposed general natural language processing methods that can better handle content not encountered in the training data, thus enabling effective semantic analysis of a wider scope of content on the Web.

Third, we have developed multiple new algorithms that enable computers to better understand social media content. One example is our AAAI ISWSM 2019 paper, where we proposed a novel neural network model for text normalization. In social media, there are many informal languages such as “idk” for “I don’t know”, “2morrow” for “tomorrow”. Our algorithm can learn from training examples to automatically normalize such informal language into more regular English text, thus facilitating both human comprehension and computer analysis of content.   

In addition to research contributions on new algorithms, we have also developed innovative systems for supporting education and learning in general, which can help people learn knowledge about cybersecurity, thus preventing cybercrimes from causing harms. Two examples are a dark jargon portal (open source at https://github.com/dom-s/dark-jargon) and TextData (https://textdata.org/about).  

A large number of graduate and undergraduate students have directly or indirectly benefited from this project. One PhD student has been supported as an RA on this project and finished a dissertation on the topic of this project. Multiple graduate students have worked directly on this project as an RA and acquired skills and knowledge in related topics. Moreover, over 1,000 graduate and undergraduate students have benefited from the enriched course content of an undergraduate course (CS410) and a graduate course (CS510) that the PI regularly teaches at the University of Illinois at Urbana-Champaign.


Last Modified: 04/20/2024
Modified by: Chengxiang Zhai

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page