Award Abstract # 1247489
BIGDATA: Mid-Scale: DA: Collaborative Research: Big Tensor Mining: Theory, Scalable Algorithms and Applications

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: CARNEGIE MELLON UNIVERSITY
Initial Amendment Date: September 13, 2012
Latest Amendment Date: September 13, 2012
Award Number: 1247489
Award Instrument: Standard Grant
Program Manager: Sylvia Spengler
sspengle@nsf.gov
 (703)292-7347
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: December 1, 2012
End Date: September 30, 2018 (Estimated)
Total Intended Award Amount: $894,892.00
Total Awarded Amount to Date: $894,892.00
Funds Obligated to Date: FY 2012 = $894,892.00
History of Investigator:
  • Christos Faloutsos (Principal Investigator)
    christos@cs.cmu.edu
  • Tom Mitchell (Co-Principal Investigator)
Recipient Sponsored Research Office: Carnegie-Mellon University
5000 FORBES AVE
PITTSBURGH
PA  US  15213-3815
(412)268-8746
Sponsor Congressional District: 12
Primary Place of Performance: Carnegie-Mellon University
PA  US  15213-3890
Primary Place of Performance
Congressional District:
12
Unique Entity Identifier (UEI): U3NKNFLNQ613
Parent UEI: U3NKNFLNQ613
NSF Program(s): Big Data Science &Engineering
Primary Program Source: 01001213DB NSF RESEARCH & RELATED ACTIVIT
01001213RB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 170E, 7433, 7924, 8083
Program Element Code(s): 808300
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Tensors are multi-dimensional generalizations of matrices, and so can have non-numeric entries. Extremely large and sparse coupled tensors arise in numerous important applications that require the analysis of large, diverse, and partially related data. The effective analysis of coupled tensors requires the development of algorithms and associated software that can identify the core relations that exist among the different tensor modes, and scale to extremely large datasets. The objective of this project is to develop theory and algorithms for (coupled) sparse and low-rank tensor factorization, and associated scalable software toolkits to make such analysis possible. The research in the project is centered on three major thrusts. The first is designed to make novel theoretical contributions in the area of coupled tensor factorization, by developing multi-way compressed sensing methods for dimensionality reduction with perfect latent model reconstruction. Methods to handle missing values, noisy input, and coupled data will also be developed. The second thrust focuses on algorithms and scalability on modern architectures, which will enable the efficient analysis of coupled tensors with millions and billions of non-zero entries, using the map-reduce paradigm, as well as hybrid multicore architectures. An open-source coupled tensor factorization toolbox (HTF- Hybrid Tensor Factorization) will be developed that will provide robust and high-performance implementations of these algorithms. Finally, the third thrust focuses on evaluating and validating the effectiveness of these coupled factorization algorithms on a NeuroSemantics application whose goal is to understand how human brain activity correlates with text reading & understanding by analyzing fMRI and MEG brain image datasets obtained while reading various text passages.

Given triplets of facts (subject-verb-object), like ('Washington' 'is the capital of' 'USA'), can we find patterns, new objects, new verbs, anomalies? Can we correlate these with brain scans of people reading these words, to discover which parts of the brain get activated, say, by tool-like nouns ('hammer'), or action-like verbs ('run')?
We propose a unified "coupled tensor" factorization framework to systematically mine such datasets. Unique challenges in these settings include
(a) tera- and peta-byte scaling issues,
(b) distributed fault-tolerant computation,
(c) large proportions of missing data, and
(d) insufficient theory and methods for big sparse tensors.
The Intellectual Merit of this effort is exactly the solution to the above four challenges.

The Broader Impact is the derivation of new scientific hypotheses on how the brain works and how it processes language (from the never-ending language learning (NELL) and NeuroSemantics projects) and the development of scalable open source software for coupled tensor factorization. Our tensor analysis methods can also be used in many other settings, including recommendation systems and computer-network intrusion/anomaly detection.

KEYWORDS:
Data mining; map/reduce; read-the-web; neuro-semantics; tensors.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Evangelos E. Papalexakis, Alona Fyshe, Nicholas Sidiropoulos, Partha Pratim Talukdar, Tom Mitchell,Christos Faloutsos, "Good-Enough Brain Model: Challenges, Algorithms and Discoveries in Multi-Subject Experiments" Big Data Journal , 2014 10.1089/big.2014.0044
Evangelos E. Papalexakis, Christos Faloutsos, and Nicholas D. Sidiropoulos "Tensors for Data Mining and Data Fusion: Models, Applications, and Scalable Algorithms" ACM Trans. Intell. Syst. Technol. , v.8 , 2016 , p.article 1 https://doi.org/10.1145/2915921
Evangelos E. Papalexakis, Christos Faloutsos, Nicholaos D. Sidiropoulos "ParCube: Sparse Parallelizable CANDECOMP-PARAFAC Tensor Decomposition" ACM Transactions on Knowledge Discovery from Data (TKDD) , v.10 , 2015 , p.3 http://dx.doi.org/10.1145/2729980
Evangelos E. Papalexakis, U Kang, Christos Faloutsos, Nicholas D. Sidiropoulos, Abhay Harpale "Large Scale Tensor Decompositions: Algorithmic Developments and Applications" IEEE Data Engineering Bulletin - Special Issue on Social Media , v.36 , 2013 , p.59
Evangelos Papalexakis, Tom Mitchell, Nicholas Sidiropoulos, Christos Faloutsos, Partha Pratim Talukdar, Brian Murphy "Turbo-SMT: Fast and Parallel Coupled Sparse Matrix-Tensor Factorizations and Applications" Statistical Analysis and Data Mining Journal , v.9 , 2016 10.1002/sam.11315
Faisal M. Almutairi, Fan Yang, Hyun Ah Song, Christos Faloutsos, Nicholas D. Sidiropoulos, Vladimir Zadorozhny "HomeRun: Scalable Sparse-Spectrum Reconstruction of Aggregated Historical Data" Proc. of VLDB (PVLDB) , v.11 , 2018 , p.1496 10.14778/3236187.3236201
Miguel Araujo, Stephan Günnemann, Spiros Papadimitriou, Christos Faloutsos, Prithwish Basu, Ananthram Swami, Evangelos E. Papalexakis and Danai Koutra "Discovery of ?comet? communities in temporal and labeled graphs (Com 2)" Knowledge and Information Systems , v.46 , 2016 10.1007/s10115-015-0847-2

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Suppose you see the word ''apple'' - which parts of your brain get activated?  Are they the same, if you read the word ''hammer''?  How can we find which parts of the brain get activated, for each different concept?  The goal is to correlate multiple sources, like fMRI brain-scan signals, properties of concepts (''apple'' is edible; ''hammer'' is a tool), to better understand how our brain reacts to different concepts (food, tools, etc).

We developed fast, scalable algorithms to process correlated (ie., ``coupled'') data. The main driving application is the analysis of brain-scan data, in conjunction with data from the web, so that we can understand which words activate which parts of the brain. During the progress of the research, we discovered that our algorithms can also be applied in diverse settings, like epidemiology  time sequences, as well as power grid measurements.

Specifically, our results are the following:


1) scalable algorithms for tensor decompositions and correlated-data analysis.


2) application of these algorithms in multiple, real-word settings, detecting anomalies, patterns and communities, in brain-scan signals, in power-grid measurements, in epidemiology time series.


3) a model for the brain functional connectivity, as part of the Neurosemantics task.

The work resulted in multiple dissertations from CMU and Univ. of Minnesota,
one of which attracted the 'KDD doctoral dissertation award' (runner-up).


Last Modified: 06/15/2019
Modified by: Christos Faloutsos

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page