NSF Award Search: Award # 1717862

Award Abstract # 1717862

SaTC: CORE: Small: Collaborative: Data-driven Approaches for Large-scale Security Analysis of Mobile Applications

NSF Org:	CNS Division Of Computer and Network Systems
Recipient:	UNIVERSITY OF SOUTH FLORIDA
Initial Amendment Date:	July 18, 2017
Latest Amendment Date:	July 18, 2017
Award Number:	1717862
Award Instrument:	Standard Grant
Program Manager:	Sol Greenspan CNS Division Of Computer and Network Systems CSE Directorate for Computer and Information Science and Engineering
Start Date:	August 15, 2017
End Date:	July 31, 2021 (Estimated)
Total Intended Award Amount:	$200,000.00
Total Awarded Amount to Date:	$200,000.00
Funds Obligated to Date:	FY 2017 = $200,000.00
History of Investigator:	Xinming Ou (Principal Investigator) xou@usf.edu
Recipient Sponsored Research Office:	University of South Florida 4202 E FOWLER AVE TAMPA FL US 33620-5800 (813)974-2897
Sponsor Congressional District:	15
Primary Place of Performance:	University of South Florida Tampa FL US 33612-9446
Primary Place of Performance Congressional District:	15
Unique Entity Identifier (UEI):	NKAZLXLL7Z91
Parent UEI:
NSF Program(s):	Secure &Trustworthy Cyberspace
Primary Program Source:	01001718DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	025Z, 7434, 7923
Program Element Code(s):	806000
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

This project investigates how to apply big-data analysis techniques to analyze mobile apps for the Android platform, for the purpose of accurately identifying security problems therein. A major challenge is the scale of the problem, with thousands of new apps entering the online app markets on a daily basis. Current technologies cannot keep up with the pace of the threats, and malware are regularly found in both large-scale marketplaces such as the official Google Play market and in third-party markets. The project adopts a number of advanced machine learning and data mining techniques to tackle those challenges. The large number of apps in the markets allows an automated machine learning algorithm to better capture security-related patterns and trends in the data, so that it can predict with good accuracy which apps may have security problems. Those apps are worth the more in-depth and expensive analysis that usually requires significant human effort. This creates an effective triage to deal with the scale challenge, and can be used by industry to scale the security vetting process of mobile apps. Artifacts produced from the research are released in open source and benefit practitioners. New courses on mobile apps and their security are developed. Undergraduate students are involved in this research. Underrepresented groups, including female students, also participate in the research. The materials developed from the research are used to further enrich cybersecurity education opportunities in the PIs' multiple outreach platforms in their institutions, to enable a large student body to benefit from the project.

The project designs solutions to tackle the unique challenges in applying machine learning for mobile app security analysis, most of which are due to the big data nature of the problem. A key scientific challenge faced in mobile app security analysis is the difficulty in obtaining high-quality ground truth. Many times one has to rely upon imperfect data in training and evaluation. The research experiments with a number of approaches to deal with the noise due to the imperfect labels, including semi-supervised learning algorithms, which can learn from small amounts of labeled data, or even from positive data only, together with unlabeled data. The project also explores a novel approach that uses social media information to acquire additional information to improve the ground truth and/or the prediction accuracy.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Chaulagain, Dewan and Poudel, Prabesh and Pathak, Prabesh and Roy, Sankardas and Caragea, Doina and Liu, Guojun and Ou, Xinming "Hybrid Analysis of Android Apps for Security Vetting using Deep Learning" 2020 IEEE Conference on Communications and Network Security (CNS) , 2020 Citation Details

Li, Yuping and Caragea, Doina and Hall, Lawrence and Ou, Xinming "Experimental Study of Machine Learning based Malware Detection Systems Practical Utility" HICSS SYMPOSIUM ON CYBERSECURITY BIG DATA ANALYTICS , 2020 Citation Details

Li, Yuping Li and Jang, Jiyong and Hu, Xin and Ou, Xinming Ou "Android Malware Clustering Through Malicious Payload Mining" International Symposium on Research in Attacks, Intrusions, and Defenses , 2017 Citation Details

Wei, Fengguo and "Amandroid: A Precise and General Inter component Data Flow Analysis Framework for Security Vetting of Android Apps" ACM transactions on information and system security , v.21 , 2018 Citation Details

Wei, Fengguo and Lin, Xingwei and Ou, Xinming and Chen, Ting and Zhang, Xiaosong "JN-SAF: Precise and Efficient NDK/JNI-aware Inter-language Static Analysis Framework for Security Vetting of Android Applications with Native Code" Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security , 2018 10.1145/3243734.3243835 Citation Details

Yu, Xiaodong and Wei, Fengguo and Ou, Xinming and Becchi, Michela and Bicer, Tekin and Yao, Danfeng Daphne "GPU-Based Static Data-Flow Analysis for Fast and Scalable Android App Vetting" 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) , v.1 , 2020 10.1109/IPDPS47924.2020.00037 Citation Details

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

This project has investigated how to apply big-data analysis techniques to analyze mobile apps for the Android platform for the purpose of accurately identifying security problems therein. A major challenge is the scale of the problem, with thousands of new apps entering the online app markets on a daily basis. Current technologies cannot keep up with the pace of the threats, and malware are regularly found in both large-scale marketplaces such as the official Google Play market and in third-party markets. The project has adopted advanced machine learning and deep learning techniques to tackle those challenges, as highlighted below.

We have studied machine learning algorithms on the task of detecting zero-day Android malware from noisy ground truth. We have performed extensive practical experimentation with different types of machine learning approaches, including traditional machine learning, transductive learning and modern recurrent neural networks. Experimental results have shown that all types of approaches can achieve verifiable zero-day malware detection, even when trained with a noisy ground truth dataset.

We have designed deep learning models based on a hybrid analysis technique, which combines the complementary strengths of the static and dynamic analysis paradigms to attain better accuracy. Using lightweight static and dynamic analysis procedures, we obtained multiple artifacts and used them to train deep learning models, specifically, long short-term memory (LSTM) networks, bidirectional LSTM (BiLSTM) networks and attention based BiLSTM networks for identifying Android malware. Experimental results have shown that the best results are obtained with an attention-based BiLSTM model that uses hybrid static and dynamic artifacts. Furthermore, a robustness analysis has indicated that the best model is fairly robust against imbalanced data and is scalable. In another work, we have shown how the attention mechanism can be used to find the API calls that are predictive with respect to the maliciousness of Android apps. In turn, the information about predictive API calls can be used to improve the interpretability of the deep learning models.

We have designed a novel deep learning model based on capsule graph neural networks (CapsGNN). Our model makes use of more precise program semantics through an Android app's inter-procedural control flow graphs (ICFGs), instead of the code sequences found in the app's code. To perform a thorough comparison between our cutting-edge CapsGNN model and a traditional machine learning approach with features engineered in prior work, we created a market-scale dataset by collecting about 240K apps from 2017 to 2020 from a number of Android app stores through AndroZoo. This dataset reflects the distribution of malicious apps in the real world. We labeled the apps in the dataset following best practices from the literature, using the latest scan results from VirusTotal. We organized the evaluation data by quarters; models trained on one quarter are tested on the data in the following quarter. This ensures the evaluation results correspond to realistic use pattern of the models. Our experimental results showed that the deep learning and machine learning models generally have similar performances in terms of precision, recall, and F1 metrics, on our train/test subsets. The deep learning models, despite needing significantly more computational resources, do not appear to provide an obvious advantage over traditional machine learning when no changes occur in the Android development environment. However, the deep learning models perform better than the machine learning models when changes in the Android development environment are observed, likely due to the fact that deep learning has the ability to extract features on-the-fly and can thus handle changes in data distribution.

We will share with the research community the dataset (specifically, each app's SHA256 hash value as well as each app's VirusTotal report url based on which the app's label is determined). This dataset will enable other researchers to verify and reproduce our experimental results, as well as comparing with our results in further research in this area. Other outcomes of this project have been shared though public code repositories and disseminated as peer-reviewed conference or journal articles.

Components of this research have been integrated in courses taught by the PIs at their respective institutions. The research involved the training of two PhD students, six MS students and four undergraduate students, including one female PhD student, two female MS students, and two female undergraduate students. These students participated in dataset construction, data analysis and preprocessing, in the design and implementation of algorithms for Android malware identification, and in paper/thesis/report writing. Thus, all students have gained experience and skills in research, in collaboration, and in written and verbal technical communication.

Last Modified: 12/25/2021
Modified by: Xinming Ou

Please report errors in award information by writing to: awardsearch@nsf.gov.

Success

Error