Award Abstract # 1251110
BIGDATA: Small: DA: Collaborative Research: From Data To Users: Providing Interpretable and Verifiable Explanations in Data Mining

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: UNIVERSITY OF MASSACHUSETTS
Initial Amendment Date: September 13, 2013
Latest Amendment Date: September 13, 2013
Award Number: 1251110
Award Instrument: Standard Grant
Program Manager: Sylvia Spengler
sspengle@nsf.gov
 (703)292-7347
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: September 15, 2013
End Date: August 31, 2017 (Estimated)
Total Intended Award Amount: $250,000.00
Total Awarded Amount to Date: $250,000.00
Funds Obligated to Date: FY 2013 = $250,000.00
History of Investigator:
  • Andrew McGregor (Principal Investigator)
    mcgregor@cs.umass.edu
Recipient Sponsored Research Office: University of Massachusetts Amherst
101 COMMONWEALTH AVE
AMHERST
MA  US  01003-9252
(413)545-0698
Sponsor Congressional District: 02
Primary Place of Performance: University of Massachusetts Amherst
Amherst
MA  US  01003-9264
Primary Place of Performance
Congressional District:
02
Unique Entity Identifier (UEI): VGJHK59NMPK9
Parent UEI: VGJHK59NMPK9
NSF Program(s): Information Technology Researc
Primary Program Source: 01001314DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7433, 7923, 8083
Program Element Code(s): 164000
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

The fruits of data mining pervade every aspect of our lives. We have books and movies recommended; we are given differential pricing for insurance; screened for potential terror threats; diagnosed with various diseases; and targeted for political advertising. The ability to sift through massive data sets with sophisticated algorithms has resulted in applications with impressive predictive power. And yet there is still a gap between what such tools can deliver, and what the users of data mining really need. It is often hard to interpret the answers produced by a learning algorithm, due to its sophistication and the use of large data sets to build models. The results of mining are often "one-size-fits-all", and convincing a user that results are actually relevant to them is difficult. Finally, there is the important problem of validation. As the results of data mining affect more and more of our lives, the more crucial it is that the user be able to validate decisions made on their behalf and that affect them. The common theme tying these issues together is a user-centric perspective on the problems of data mining. Rather than asking "What patterns can be found in this mountain of data?" this work instead asks "What structures in this data affect me?" These issues arise precisely because of the vast amounts of data we now have the ability to mine, and the sophisticated methods at our disposal to analyze this data. In this research, the PIs develop a computational framework and key tools for user-centric data mining. A central theme in this research is the idea of interaction. In both machine learning and in the foundations of complexity theory, interaction has been used to allow a (weaker) entity to probe a much more powerful system and determine answers that it lacks the resources to compute directly itself. The PIs use formal interaction mechanisms both from the perspective of a user interacting with a powerful algorithm, as well as a client interacting with a computing source with access to large data, in order to enable the user to interpret and validate the results of data mining.

The goal of this project is to develop a computational framework for user-centric data mining that enables existing users to tailor data analysis to their needs and facilitates the use of data mining in new areas where existing The team proposes interactive mechanisms that start with the results of a learning process and, via interaction with the user, produce an explanation expressed in terms of meaningful features, drawing on ideas from active learning, feature selection, and domain adaptation. 2. Locality: Answers that are relevant. Here, the focus is on providing information that depends more on a user?s local neighborhood, achieved via a new local notion of stability. 3. Verifiability: Answers you can check. The team proposes a framework for the validation of computationally-intensive data mining by the computationally-weak user, with ideas from interactive proof theory and stream algorithms. Tools for analyzing patient medical data have become more sophisticated and individual medical profiles play a far more significant role in diagnosis and treatment.The research examines user-centric data mining via three core primitives (classification, regression and clustering), and studies the three problems of interpreting results, providing local explanations, and validating the results of data mining. Firstly, the research draws on ideas from active learning, feature selection and domain adaptation to build interpretable results via interaction with users. Secondly, it introduces local notions of stability as a way of validating predictions for a specific user. Finally, it develops a general framework for validation of an analysis by a computationally-weak user, by drawing on ideas from the theory of interactive proofs and streaming algorithms.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Amit Chakrabarti and Graham Cormode and Andrew McGregor and Justin Thaler "Annotations in Data Streams" {ACM} Transactions on Algorithms , v.11 , 2014 , p.7 10.1145/2636924
Amit Chakrabarti and Graham Cormode and Andrew McGregor and Justin Thaler and Suresh Venkatasubramanian "Verifiable Stream Computation and Arthur-Merlin Communication" 30th Conference on Computational Complexity, {CCC} 2015, June 17-19, 2015, Portland, Oregon, {USA} , 2015 , p.217--243 10.4230/LIPIcs.CCC.2015.217
Amit Chakrabarti and Graham Cormode and Andrew McGregor and Justin Thaler and Suresh Venkatasubramanian "Verifiable Stream Computation and Arthur-Merlin Communication" Electronic Colloquium on Computational Complexity {(ECCC)} , v.21 , 2014 , p.86
Andrew McGregor and Hoa T. Vu "Better Streaming Algorithms for the Maximum Coverage Problem" 20th International Conference on Database Theory, 2017. , 2017
Andrew McGregor and Hoa T. Vu "Evaluating Bayesian Networks via Data Streams" Computing and Combinatorics - 21st International Conference, {COCOON} 2015, Beijing, China, August 4-6, 2015, Proceedings , 2015 , p.731--743 10.1007/978-3-319-21398-9_57
Kook Jin Ahn and Graham Cormode and Sudipto Guha and Andrew McGregor and Anthony Wirth "Correlation Clustering in Data Streams" Proceedings of the 32nd International Conference on Machine Learning, {ICML} 2015, Lille, France, 6-11 July 2015 , 2015 , p.2237--224

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

The availability of massive amounts of data in various learning and data mining tasks has both advantages and disadvantages. With more data, we can hope to perform more sophisticated analysis and generate more accurate predictions. This could result in anything from better book recommendations from an online retailer to more detailed medical diagnoses and even to more accurate screening of potential terror threats. However, with more data comes more responsibility along with new computational challenges. Since the results of data mining can have negative consequences, e.g., an inappropriate treatment due to a misdiagnosis, we want to be able to validate the computations that led to the final results and understand the properties of the data set that may be relevant to the outcome.

The intellectual merit of this project revolved around a) developing techniques for validation of computation on large data sets and b) designing efficient algorithms for analyzing properties of the data sets that are relevant to various data mining tasks. In particular, we developed the annotated data stream framework which enables the validation of computationally-intensive data mining by a computationally-weak user that has limited memory and is only able to take a single pass through the data. We then extended this to also support protocols that could have multiple rounds of interaction that would allow the user to verify the computation more efficiently. We then designed new data stream algorithms with provable accuracy guarantees for various data mining tasks including clustering (where we try to group together similar entities while ensuring dissimilar entities are in different groups), estimating the correlation coefficient of a network (a measure of how transitive the relationships in the data are), and constructing Bayesian networks that model dependencies in the data. Finally, we presented new results on reducing the dimension of high dimensional data sets in such a way that information-theoretic distances are approximately preserved.

In terms of broader impact, the training of numerous graduate students was supported on this grant. In addition to publishing the results at conferences and journals, we also organized a tutorial at the leading machine learning conference. Finally, some of the results were included in a graduate course at the University of Massachusetts and the related teaching material has been made public for the benefit of other students and instructors.

 

 


Last Modified: 05/22/2018
Modified by: Andrew Mcgregor

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page