NSF Award Search: Award # 1755464

Award Abstract # 1755464

CRII: OAC: Scalable Cyberinfrastructure for Big Graph and Matrix/Tensor Analytics

NSF Org:	OAC Office of Advanced Cyberinfrastructure (OAC)
Recipient:	UNIVERSITY OF ALABAMA AT BIRMINGHAM
Initial Amendment Date:	May 23, 2018
Latest Amendment Date:	May 23, 2018
Award Number:	1755464
Award Instrument:	Standard Grant
Program Manager:	Juan Li jjli@nsf.gov (703)292-2625 OAC Office of Advanced Cyberinfrastructure (OAC) CSE Directorate for Computer and Information Science and Engineering
Start Date:	June 1, 2018
End Date:	May 31, 2022 (Estimated)
Total Intended Award Amount:	$170,941.00
Total Awarded Amount to Date:	$170,941.00
Funds Obligated to Date:	FY 2018 = $170,941.00
History of Investigator:	Da Yan (Principal Investigator) yanda@iu.edu
Recipient Sponsored Research Office:	University of Alabama at Birmingham 701 S 20TH STREET BIRMINGHAM AL US 35294-0001 (205)934-5266
Sponsor Congressional District:	07
Primary Place of Performance:	University of Alabama at Birmingham 1300 University Blvd Birmingham AL US 35233-1405
Primary Place of Performance Congressional District:	07
Unique Entity Identifier (UEI):	YND4PLMC9AN7
Parent UEI:
NSF Program(s):	CRII CISE Research Initiation, EPSCoR Co-Funding
Primary Program Source:	01001819DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	026Z, 8228, 9150
Program Element Code(s):	026Y00, 915000
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

The existing distributed graph and matrix analytics frameworks are designed with data-intensive workloads in mind, rendering them inefficient for compute-intensive applications such as graph mining and scientific computing. The goal of this project is to develop novel big data frameworks for two compute-intensive tasks, graph mining and matrix/tensor computations, respectively. The two frameworks advance the field of big data analytics by motivating future systems for compute-intensive analytics, and promoting their application in various scientific areas to improve research productivity. The two systems will be available for public use, and can serve several cross-disciplinary projects in computer forensics, computational physics, and bioinformatics. The project includes mentoring graduate students and training K-12 students through summer internships, as well as related new course materials and outreach activities to help the public learn big data technologies. Thus, the project aligns with the NSF's mission to promote the progress of science and to advance the national health and prosperity.

The graph mining system and the matrix/tensor platform share the design of (i) a tailor-made storage subsystem providing efficient and flexible data access, and (ii) a computation subsystem with fine-grained task control for data-reuse-aware task assignment and load balancing. The graph mining system, called G-thinker, aims to facilitate the writing of distributed programs which mine from a big graph those subgraphs that satisfy certain requirements. Such mining problems are useful in many applications like community detection and subgraph matching. These problems usually have a high computational complexity, and existing serial algorithms tackle these problems by backtracking in a duplication-free vertex-set numeration tree, which recursively partitions the search space. G-thinker adopts an intuitive programming interface that minimizes the effort of adapting an existing serial subgraph mining algorithm for distributed execution. The subgraphs to mine are spawned from individual vertices and they grow their frontiers as needed, and memory overflow is avoided by spilling subgraphs to disks when needed. In each machine, vertices and edges shared by multiple subgraphs need only be transmitted and cached once, which minimizes communication (and hence data waiting) so that CPU cores are better utilized. To address the load-balancing problem of power-law graphs, G-thinker explores recursive decomposition and work stealing to allow idle machines to steal subgraphs for mining from heavily-loaded machines. The project also explores a distributed matrix/tensor storage and computing framework, where matrix/tensor partitions are stored in multiple replicas using different storage schemes to efficiently support all kinds of submatrix access operations. This flexible storage scheme offers the upper-layer computations much more opportunities for fine-grained optimizations, including smarter task scheduling and in-situ updates. The use of this framework is exemplified by matrix multiplication and LU factorization. Both of the proposed frameworks can help build a cyberinfrastructure for collaborations with scientists in science, medicine, and industry.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 17)

Show All

Cheng, Ji and Yan, Da and Qu, Wenwen and Hao, Xiaotian and Long, Cheng and Ng, Wilfred and Wang, Xiaoling "Mining Order-preserving Submatrices under Data Uncertainty: A Possible-world Approach and Efficient Approximation Methods" ACM transactions on database systems , 2022 https://doi.org/10.1145/3524915 Citation Details

Guo, Guimu and Yan, Da and Özsu, M. Tamer and Jiang, Zhe and Khalil, Jalal "Scalable Mining of Maximal Quasi-Cliques: An Algorithm-System Codesign Approach" Proceedings of the VLDB Endowment , v.14 , 2020 https://doi.org/10.14778/3436905.3436916 Citation Details

Guo, Guimu and Yan, Da and Yuan, Lyuheng and Khalil, Jalal and Long, Cheng and Jiang, Zhe and Zhou, Yang "Maximal Directed Quasi-Clique Mining" Proceedings of the 38th IEEE International Conference on Data Engineering (ICDE) , 2022 Citation Details

Hui, Bo and Yan, Da and Ku, Wei-Shinn "Node-Polysemy Aware Recommendation by Matrix Completion with Side Information" 2021 IEEE International Conference on Big Data (Big Data) , 2021 https://doi.org/10.1109/BigData52589.2021.9672000 Citation Details

Hui, Bo and Yan, Da and Ku, Wei-Shinn "Time-sensitive POI Recommendation by Tensor Completion with Side Information" Proceedings of the 38th IEEE International Conference on Data Engineering (ICDE) , 2022 Citation Details

Karim, Ramin Goudarzi and Guo, Guimu and Yan, Da and Navasca, Carmeliza "Accurate Tensor Decomposition with Simultaneous Rank Approximation for Surveillance Videos" 54th Asilomar Conference on Signals, Systems, and Computers , 2020 https://doi.org/10.1109/IEEECONF51394.2020.9443285 Citation Details

Khalil, Jalal and Guo, Guimu and Yuan, Lyuheng "Parallel mining of large maximal quasi-cliques" The VLDB journal , 2022 https://doi.org/10.1007/s00778-021-00712-2 Citation Details

Qu, Wenwen and Yan, Da and Guo, Guimu and Wang, Xiaoling and Zou, Lei and Zhou, Yang "Parallel Mining of Frequent Subtree Patterns" Communications in computer and information science , v.1281 , 2020 https://doi.org/10.1007/978-3-030-61133-0_2 Citation Details

Sakr, Sherif and Bonifati, Angela and Voigt, Hannes and Iosup, Alexandru and Ammar, Khaled and Angles, Renzo and Aref, Walid G. and Arenas, Marcelo and Besta, Maciej and Boncz, Peter A. and Daudjee, Khuzaima and Valle, Emanuele Della and Dumbrava, Stefani "The Future is Big Graphs! A Community View on Graph Processing Systems" Communications of the ACM , 2021 https://doi.org/ Citation Details

Yan, Da and Cheng, James and Chen, Hongzhi and Long, Cheng and Bangalore, Purushotham "Lightweight Fault Tolerance in Pregel-Like Systems" Proceedings of the 48th International Conference on Parallel Processing (ICPP) 2019 , 2019 10.1145/3337821.3337823 Citation Details

Yan, Da and Chowdhury, Md Mashiur and Guo, Guimu and Kahlil, Jalal and Jiang, Zhe and Prasad, Sushil "Distributed Task-Based Training of Tree Models" Proceedings of the 38th IEEE International Conference on Data Engineering (ICDE) , 2022 Citation Details

(Showing: 1 - 10 of 17)

Show All

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

This project has supported the PI and a number of PhD students to investigate novel parallel and distributed programming models and systems for accelerating compute-intensive analytics of graph data and matrix/tensor data. Through the investigation conducted in this project, a suite of new systems have been designed and implemented, particularly:

(1) G-thinker (https://bit.ly/gthinker), a distributed system for mining subgraphs that satisfy user-defined requirements (e.g., subgraph matching, quasi-cliques);

(2) PrefixFPM (https://github.com/wenwenQu/PrefixFPM), a general-purpose parallel programming framework for mining all kinds of frequent or closed patterns (subsequence, subtrees, subgraphs and submatrices).

(3) TreeServer (https://github.com/yanlab19870714/TreeServer), a system to train decision tree ensemble models, including deep forest which often contains a large number of trees.

The systems developed are all based on the T-thinker programming paradigm (i.e., think like a task, where "T" means "task") proposed by the PI in a poster in PPoPP 2019, and advertised by CCC as a great innovative idea at https://cra.org/ccc/great-innovative-ideas/t-thinker-a-task-centric-framework-to-revolutionize-big-data-systems-research/

Put simply, the idea of the T-thinker programming paradigm is that for compute-intensive problems, we can divide the problem space recursively until a sufficiently small unit is obtained, which we call a task. A worker machine can then obtain the subset of data needed for the task (linear communication cost), and compute it afterwards (often high time complexity). Since computation is the bottleneck, data moving cost can mostly be overlapped with the concurrent computation so that all CPU cores can be fully utilized. This is in contrast to existing cyberinfrastructure that targets data-intensive problems, which when used to solve compute-intensive problems, the performance is comparable to a single-threaded program as indicated by Frank McSherry https://github.com/frankmcsherry/blog/blob/master/posts/2017-09-23.md

Besides, tools for matrix and tensor factorization have also been investigated and developed. A number of publications in high impact venues have been published including VLDB, ICDE, VLDB Journal, ACM TODS, KDD, ICDM, etc. The results have been disseminated by paper presentations in the respective conferences, a tutorial in IEEE BigData 2020, an invited keynote talk at the LSGDA workshop held in conjunction with VLDB 2020, Dagstuhl Seminars (19491, 19051, 22031), and invited talks at various universities.

Regarding the broader impacts, while data-intensive cyberinfrastructure is in wide need and sees the biggest investment, there are applications where the time complexity of computation is very high, and the availability of only data-intensive cyberinfrastructure causes misuse and hence huge waste of resources and energy. System paradigms highly-efficient for those costly computations are in urgent need to alleviate this problem, but they require non-trivial efforts to design and develop, and they were not given enough attention. During the project period, the PI's team has been advocating the design of compute-intensive programming paradigms and systems, and T-thinker is a viable paradigm along this line. The research outcomes have gained a lot of attention and a productive research team has been built around the T-thinker research. Besides a number of PhD students (including one who defended and found a tenure-track faculty job), the project also trained Master's students at UAB advised by Directed Readings course, high-school summer interns, and visiting scholars. The trainees will contribute to the next-generation Big Data workforce proficient in systems and HPC research, and parallel and distributed algorithms.

Last Modified: 06/11/2022
Modified by: Da Yan

Please report errors in award information by writing to: awardsearch@nsf.gov.

Success

Error