NSF Award Search: Award # 1642385

Award Abstract # 1642385

SI2-SSE: Collaborative Research: High Performance Low Rank Approximation for Scalable Data Analytics

NSF Org:	OAC Office of Advanced Cyberinfrastructure (OAC)
Recipient:	WAKE FOREST UNIVERSITY
Initial Amendment Date:	September 8, 2016
Latest Amendment Date:	September 8, 2016
Award Number:	1642385
Award Instrument:	Standard Grant
Program Manager:	Amy Walton awalton@nsf.gov (703)292-4538 OAC Office of Advanced Cyberinfrastructure (OAC) CSE Directorate for Computer and Information Science and Engineering
Start Date:	November 1, 2016
End Date:	October 31, 2021 (Estimated)
Total Intended Award Amount:	$167,713.00
Total Awarded Amount to Date:	$167,713.00
Funds Obligated to Date:	FY 2016 = $167,713.00
History of Investigator:	Grey Ballard (Principal Investigator) ballard@wfu.edu
Recipient Sponsored Research Office:	Wake Forest University 1834 WAKE FOREST RD WINSTON SALEM NC US 27109-6000 (336)758-5888
Sponsor Congressional District:	05
Primary Place of Performance:	Wake Forest University NC US 27109-8758
Primary Place of Performance Congressional District:	05
Unique Entity Identifier (UEI):	MBU6HCLNZ431
Parent UEI:
NSF Program(s):	Software & Hardware Foundation, Software Institutes
Primary Program Source:	01001617DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	7433, 7942, 8004, 8005
Program Element Code(s):	779800, 800400
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

Big Data analytics is at the core of discovery covering vast areas such as medical informatics, business analytics, national security, and materials sciences. This project aims to model some of the key data analytics problems and design, verify, and deploy scalable methods for knowledge extraction. The algorithms developed will be able to handle data sets of extreme sizes and will be deployable on advanced computer hardware. The goal is to realize orders-of-magnitude improvements over existing data analytics technologies, developing algorithms that are robust to incompleteness, noise, ambiguity, and high dimension in the data. Particular focus will be parallel and distributed algorithms that can efficiently solve large problems and produce accurate solutions. The proposed research and software development will allow domain experts to tackle Big Data sets requiring large parallel systems. The improved performance will enable fast and scalable data analysis across applications, from social network analysis to study citizens' attitudes toward sustainability-related issues to computational marketing techniques that refine customers' shopping experiences. The proposed work will help bridge the gap between computational science and data analytics ecosystems, two fields that stand to make great advancements from cross-fertilization. The education and outreach plan includes graduate course creation, engagement of under-represented groups via both undergraduate and graduate research experiences, and community-building efforts by workshop and mini-symposium organization.

With the advent of internet-scale data, the data mining and machine learning community has adopted Nonnegative Matrix Factorization (NMF) for performing numerous tasks such as topic modeling, background separation from video data, hyper-spectral imaging, web-scale clustering, and community detection. The goals of this proposal are to develop efficient parallel algorithms for computing nonnegative matrix and tensor factorizations (NMF and NTF) and their variants using a unified framework, and to produce a software package called Parallel Low-rank Approximation with Nonnegative Constraints (PLANCK) that delivers the high performance, flexibility, and scalability necessary to tackle the ever-growing size of today's data sets. The algorithms will be generalized to NTF problems and extend the class of algorithms we can efficiently parallelize; our software framework will allow end-users to use and extend our techniques. Rather than developing separate software for each problem domain and mathematical technique, flexibility will be achieved by characterizing nearly all of the current NMF and NTF algorithms in the context of a block coordinate descent framework. Using this framework the shared computational kernels can be separated, which usually extend run times, from the algorithm-specific computations. Finally, the usability and practicality of the proposed software will be maintained by being application driven, establishing collaborations with early end-users, and by incrementally generalizing the framework in terms of both algorithms and problems.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 11)

Show All

Ballard, Grey and Hayashi, Koby and Ramakrishnan, Kannan "Parallel Nonnegative CP Decomposition of Dense Tensors" 25th IEEE International Conference on High Performance Computing , 2018 10.1109/HiPC.2018.00012 Citation Details

Ballard, Grey and Knight, Nicholas and Rouse, Kathryn "Communication Lower Bounds for Matricized Tensor Times Khatri-Rao Product" 2018 IEEE International Parallel and Distributed Processing Symposium , 2018 10.1109/IPDPS.2018.00065 Citation Details

Ballard, Grey and Rouse, Kathryn "General Memory-Independent Lower Bound for MTTKRP" SIAM Conference on Parallel Processing for Scientific Computing , 2020 https://doi.org/10.1137/1.9781611976137.1 Citation Details

Eswar, Srinivas and Hayashi, Koby and Ballard, Grey and Kannan, Ramakrishnan and Matheson, Michael A. and Park, Haesun "PLANC: Parallel Low-rank Approximation with Nonnegativity Constraints" ACM Transactions on Mathematical Software , v.47 , 2021 https://doi.org/10.1145/3432185 Citation Details

Eswar, Srinivas and Hayashi, Koby and Ballard, Grey and Kannan, Ramakrishnan and Vuduc, Richard and Park, Haesun "Distributed-Memory Parallel Symmetric Nonnegative Matrix Factorization" SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , v.1 , 2020 https://doi.org/10.1109/SC41405.2020.00078 Citation Details

Hayashi, Koby and Ballard, Grey and Jiang, Yujie and Tobia, Michael J. "Shared-memory parallelization of MTTKRP for dense tensors" 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming , 2018 10.1145/3178487.3178522 Citation Details

Kannan, Ramakrishnan and Ballard, Grey and Park, Haesun "MPI-FAUN: An MPI-Based Framework for Alternating-Updating Nonnegative Matrix Factorization" IEEE Transactions on Knowledge and Data Engineering , v.30 , 2018 10.1109/TKDE.2017.2767592 Citation Details

Kaya, Oguz and Kannan, Ramakrishnan and Ballard, Grey "Partitioning and Communication Strategies for Sparse Non-negative Matrix Factorization" 47th International Conference on Parallel Processing , 2018 10.1145/3225058.3225127 Citation Details

Manning, Lawton and Ballard, Grey and Kannan, Ramakrishnan and Park, Haesun "Parallel Hierarchical Clustering using Rank-Two Nonnegative Matrix Factorization" 2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC) , 2020 https://doi.org/10.1109/HiPC50609.2020.00028 Citation Details

Mokhtari, Fatemeh and Laurienti, Paul J. and Rejeski, W. Jack and Ballard, Grey "Dynamic Functional Magnetic Resonance Imaging Connectivity Tensor Decomposition: A New Approach to Analyze and Interpret Dynamic Brain Connectivity" Brain Connectivity , v.9 , 2019 10.1089/brain.2018.0605 Citation Details

Tobia, Michael J. and Hayashi, Koby and Ballard, Grey and Gotlib, Ian H. and Waugh, Christian E. "Dynamic functional connectivity and individual differences in emotions during social stress: Stress and Brain Synchrony" Human Brain Mapping , v.38 , 2017 10.1002/hbm.23821 Citation Details

(Showing: 1 - 10 of 11)

Show All

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

In data analytics, a data set often is represented in a feature-data matrix where either each data item is a vector and the vector elements are the features of the data item, or in a similarity or adjacency matrix where a relationship between every pair of data items is encoded. Matrix low rank approximation (LRA) provides a powerful tool for analyzing data sets represented in either of these matrix forms. In a constrained matrix low rank approximation (CLRA), the conditions such as nonnegativity and sum-to-one constraint are added which makes the mathematical formulation follow the problem setting more judiciously and the computational results more interpretable. Clustering is one of the major tasks in data analytics and it can be understood as a special case of a constrained low rank matrix approximation. The Nonnegative Matrix Factorization (NMF) and Nonnegative Tensor Factorization (NTF) are two of the most commonly utilized CLRAs.

The major activities of this project include development of new algorithms for NMF and efficient software for distributed computing of the NMF and NTF which are available in an open-source MPI library developed in collaboration with the Oak Ridge National Laboratory. The NMF and NTF algorithm research and software development results were published in refereed journals and refereed conferences that detail the software, parallel algorithms, scalability, and application of NTF to an application data set. Another major activity was development of a method based on NMF and SymNMF to produce a semi-supervised clustering technique that can utilize prior known partial cluster information such as items that should be grouped together or binary relationship such as two items must link or cannot link in their clustering membership relationship.

The HierNMF2 method is one of the fastest NMF algorithm based on a very fast recursive rank-2 NMF algorithm and design of a decision tree that determines the tree traversing rule for deciding the next leaf node to further binary split. Parallel implementation of an algorithm for computing the rank-2 NMF of a general matrix was developed. This is the key computation within the divide-and-conquer NMF algorithm that hierarchically clusters data items in a nonnegative set. In SymNMF, the input is a similarity (adjacency) relationship matrix which is nonnegative and symmetric. It is theoretically related to the well-known spectral clustering. Three algorithms based on ANLS/BPP, vector-based block coordinate descent, and a variant of Gauss-Newton, have been selected and parallel algorithms for each of these three algorithms were developed with the goal of scaling them to very large problems. Some of the main applications for these parallel algorithms and software include large scale topic modeling for text analysis, community detection, and image segmentation. The algorithms and software developed with the support of this project have contributed significantly in these important data analytics tasks.

Last Modified: 07/20/2021
Modified by: Grey M Ballard

Please report errors in award information by writing to: awardsearch@nsf.gov.

Success

Error