
NSF Org: |
OAC Office of Advanced Cyberinfrastructure (OAC) |
Recipient: |
|
Initial Amendment Date: | September 8, 2016 |
Latest Amendment Date: | September 8, 2016 |
Award Number: | 1642385 |
Award Instrument: | Standard Grant |
Program Manager: |
Amy Walton
awalton@nsf.gov (703)292-4538 OAC Office of Advanced Cyberinfrastructure (OAC) CSE Directorate for Computer and Information Science and Engineering |
Start Date: | November 1, 2016 |
End Date: | October 31, 2021 (Estimated) |
Total Intended Award Amount: | $167,713.00 |
Total Awarded Amount to Date: | $167,713.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
1834 WAKE FOREST RD WINSTON SALEM NC US 27109-6000 (336)758-5888 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
NC US 27109-8758 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): |
Software & Hardware Foundation, Software Institutes |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Big Data analytics is at the core of discovery covering vast areas such as medical informatics, business analytics, national security, and materials sciences. This project aims to model some of the key data analytics problems and design, verify, and deploy scalable methods for knowledge extraction. The algorithms developed will be able to handle data sets of extreme sizes and will be deployable on advanced computer hardware. The goal is to realize orders-of-magnitude improvements over existing data analytics technologies, developing algorithms that are robust to incompleteness, noise, ambiguity, and high dimension in the data. Particular focus will be parallel and distributed algorithms that can efficiently solve large problems and produce accurate solutions. The proposed research and software development will allow domain experts to tackle Big Data sets requiring large parallel systems. The improved performance will enable fast and scalable data analysis across applications, from social network analysis to study citizens' attitudes toward sustainability-related issues to computational marketing techniques that refine customers' shopping experiences. The proposed work will help bridge the gap between computational science and data analytics ecosystems, two fields that stand to make great advancements from cross-fertilization. The education and outreach plan includes graduate course creation, engagement of under-represented groups via both undergraduate and graduate research experiences, and community-building efforts by workshop and mini-symposium organization.
With the advent of internet-scale data, the data mining and machine learning community has adopted Nonnegative Matrix Factorization (NMF) for performing numerous tasks such as topic modeling, background separation from video data, hyper-spectral imaging, web-scale clustering, and community detection. The goals of this proposal are to develop efficient parallel algorithms for computing nonnegative matrix and tensor factorizations (NMF and NTF) and their variants using a unified framework, and to produce a software package called Parallel Low-rank Approximation with Nonnegative Constraints (PLANCK) that delivers the high performance, flexibility, and scalability necessary to tackle the ever-growing size of today's data sets. The algorithms will be generalized to NTF problems and extend the class of algorithms we can efficiently parallelize; our software framework will allow end-users to use and extend our techniques. Rather than developing separate software for each problem domain and mathematical technique, flexibility will be achieved by characterizing nearly all of the current NMF and NTF algorithms in the context of a block coordinate descent framework. Using this framework the shared computational kernels can be separated, which usually extend run times, from the algorithm-specific computations. Finally, the usability and practicality of the proposed software will be maintained by being application driven, establishing collaborations with early end-users, and by incrementally generalizing the framework in terms of both algorithms and problems.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
In data analytics, a data set often is represented in a feature-data matrix where either each data item is a vector and the vector elements are the features of the data item, or in a similarity or adjacency matrix where a relationship between every pair of data items is encoded. Matrix low rank approximation (LRA) provides a powerful tool for analyzing data sets represented in either of these matrix forms. In a constrained matrix low rank approximation (CLRA), the conditions such as nonnegativity and sum-to-one constraint are added which makes the mathematical formulation follow the problem setting more judiciously and the computational results more interpretable. Clustering is one of the major tasks in data analytics and it can be understood as a special case of a constrained low rank matrix approximation. The Nonnegative Matrix Factorization (NMF) and Nonnegative Tensor Factorization (NTF) are two of the most commonly utilized CLRAs.
The major activities of this project include development of new algorithms for NMF and efficient software for distributed computing of the NMF and NTF which are available in an open-source MPI library developed in collaboration with the Oak Ridge National Laboratory. The NMF and NTF algorithm research and software development results were published in refereed journals and refereed conferences that detail the software, parallel algorithms, scalability, and application of NTF to an application data set. Another major activity was development of a method based on NMF and SymNMF to produce a semi-supervised clustering technique that can utilize prior known partial cluster information such as items that should be grouped together or binary relationship such as two items must link or cannot link in their clustering membership relationship.
The HierNMF2 method is one of the fastest NMF algorithm based on a very fast recursive rank-2 NMF algorithm and design of a decision tree that determines the tree traversing rule for deciding the next leaf node to further binary split. Parallel implementation of an algorithm for computing the rank-2 NMF of a general matrix was developed. This is the key computation within the divide-and-conquer NMF algorithm that hierarchically clusters data items in a nonnegative set. In SymNMF, the input is a similarity (adjacency) relationship matrix which is nonnegative and symmetric. It is theoretically related to the well-known spectral clustering. Three algorithms based on ANLS/BPP, vector-based block coordinate descent, and a variant of Gauss-Newton, have been selected and parallel algorithms for each of these three algorithms were developed with the goal of scaling them to very large problems. Some of the main applications for these parallel algorithms and software include large scale topic modeling for text analysis, community detection, and image segmentation. The algorithms and software developed with the support of this project have contributed significantly in these important data analytics tasks.
Last Modified: 07/20/2021
Modified by: Grey M Ballard
Please report errors in award information by writing to: awardsearch@nsf.gov.