
NSF Org: |
CNS Division Of Computer and Network Systems |
Recipient: |
|
Initial Amendment Date: | July 18, 2017 |
Latest Amendment Date: | July 18, 2017 |
Award Number: | 1717862 |
Award Instrument: | Standard Grant |
Program Manager: |
Sol Greenspan
CNS Division Of Computer and Network Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | August 15, 2017 |
End Date: | July 31, 2021 (Estimated) |
Total Intended Award Amount: | $200,000.00 |
Total Awarded Amount to Date: | $200,000.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
4202 E FOWLER AVE TAMPA FL US 33620-5800 (813)974-2897 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
Tampa FL US 33612-9446 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Secure &Trustworthy Cyberspace |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
This project investigates how to apply big-data analysis techniques to analyze mobile apps for the Android platform, for the purpose of accurately identifying security problems therein. A major challenge is the scale of the problem, with thousands of new apps entering the online app markets on a daily basis. Current technologies cannot keep up with the pace of the threats, and malware are regularly found in both large-scale marketplaces such as the official Google Play market and in third-party markets. The project adopts a number of advanced machine learning and data mining techniques to tackle those challenges. The large number of apps in the markets allows an automated machine learning algorithm to better capture security-related patterns and trends in the data, so that it can predict with good accuracy which apps may have security problems. Those apps are worth the more in-depth and expensive analysis that usually requires significant human effort. This creates an effective triage to deal with the scale challenge, and can be used by industry to scale the security vetting process of mobile apps. Artifacts produced from the research are released in open source and benefit practitioners. New courses on mobile apps and their security are developed. Undergraduate students are involved in this research. Underrepresented groups, including female students, also participate in the research. The materials developed from the research are used to further enrich cybersecurity education opportunities in the PIs' multiple outreach platforms in their institutions, to enable a large student body to benefit from the project.
The project designs solutions to tackle the unique challenges in applying machine learning for mobile app security analysis, most of which are due to the big data nature of the problem. A key scientific challenge faced in mobile app security analysis is the difficulty in obtaining high-quality ground truth. Many times one has to rely upon imperfect data in training and evaluation. The research experiments with a number of approaches to deal with the noise due to the imperfect labels, including semi-supervised learning algorithms, which can learn from small amounts of labeled data, or even from positive data only, together with unlabeled data. The project also explores a novel approach that uses social media information to acquire additional information to improve the ground truth and/or the prediction accuracy.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
This project has investigated how to apply big-data analysis techniques to analyze mobile apps for the Android platform for the purpose of accurately identifying security problems therein. A major challenge is the scale of the problem, with thousands of new apps entering the online app markets on a daily basis. Current technologies cannot keep up with the pace of the threats, and malware are regularly found in both large-scale marketplaces such as the official Google Play market and in third-party markets. The project has adopted advanced machine learning and deep learning techniques to tackle those challenges, as highlighted below.
We have studied machine learning algorithms on the task of detecting zero-day Android malware from noisy ground truth. We have performed extensive practical experimentation with different types of machine learning approaches, including traditional machine learning, transductive learning and modern recurrent neural networks. Experimental results have shown that all types of approaches can achieve verifiable zero-day malware detection, even when trained with a noisy ground truth dataset.
We have designed deep learning models based on a hybrid analysis technique, which combines the complementary strengths of the static and dynamic analysis paradigms to attain better accuracy. Using lightweight static and dynamic analysis procedures, we obtained multiple artifacts and used them to train deep learning models, specifically, long short-term memory (LSTM) networks, bidirectional LSTM (BiLSTM) networks and attention based BiLSTM networks for identifying Android malware. Experimental results have shown that the best results are obtained with an attention-based BiLSTM model that uses hybrid static and dynamic artifacts. Furthermore, a robustness analysis has indicated that the best model is fairly robust against imbalanced data and is scalable. In another work, we have shown how the attention mechanism can be used to find the API calls that are predictive with respect to the maliciousness of Android apps. In turn, the information about predictive API calls can be used to improve the interpretability of the deep learning models.
We have designed a novel deep learning model based on capsule graph neural networks (CapsGNN). Our model makes use of more precise program semantics through an Android app's inter-procedural control flow graphs (ICFGs), instead of the code sequences found in the app's code. To perform a thorough comparison between our cutting-edge CapsGNN model and a traditional machine learning approach with features engineered in prior work, we created a market-scale dataset by collecting about 240K apps from 2017 to 2020 from a number of Android app stores through AndroZoo. This dataset reflects the distribution of malicious apps in the real world. We labeled the apps in the dataset following best practices from the literature, using the latest scan results from VirusTotal. We organized the evaluation data by quarters; models trained on one quarter are tested on the data in the following quarter. This ensures the evaluation results correspond to realistic use pattern of the models. Our experimental results showed that the deep learning and machine learning models generally have similar performances in terms of precision, recall, and F1 metrics, on our train/test subsets. The deep learning models, despite needing significantly more computational resources, do not appear to provide an obvious advantage over traditional machine learning when no changes occur in the Android development environment. However, the deep learning models perform better than the machine learning models when changes in the Android development environment are observed, likely due to the fact that deep learning has the ability to extract features on-the-fly and can thus handle changes in data distribution.
We will share with the research community the dataset (specifically, each app's SHA256 hash value as well as each app's VirusTotal report url based on which the app's label is determined). This dataset will enable other researchers to verify and reproduce our experimental results, as well as comparing with our results in further research in this area. Other outcomes of this project have been shared though public code repositories and disseminated as peer-reviewed conference or journal articles.
Components of this research have been integrated in courses taught by the PIs at their respective institutions. The research involved the training of two PhD students, six MS students and four undergraduate students, including one female PhD student, two female MS students, and two female undergraduate students. These students participated in dataset construction, data analysis and preprocessing, in the design and implementation of algorithms for Android malware identification, and in paper/thesis/report writing. Thus, all students have gained experience and skills in research, in collaboration, and in written and verbal technical communication.
Last Modified: 12/25/2021
Modified by: Xinming Ou
Please report errors in award information by writing to: awardsearch@nsf.gov.