
NSF Org: |
IIS Division of Information & Intelligent Systems |
Recipient: |
|
Initial Amendment Date: | September 13, 2013 |
Latest Amendment Date: | September 13, 2013 |
Award Number: | 1251110 |
Award Instrument: | Standard Grant |
Program Manager: |
Sylvia Spengler
sspengle@nsf.gov (703)292-7347 IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | September 15, 2013 |
End Date: | August 31, 2017 (Estimated) |
Total Intended Award Amount: | $250,000.00 |
Total Awarded Amount to Date: | $250,000.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
101 COMMONWEALTH AVE AMHERST MA US 01003-9252 (413)545-0698 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
Amherst MA US 01003-9264 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Information Technology Researc |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
The fruits of data mining pervade every aspect of our lives. We have books and movies recommended; we are given differential pricing for insurance; screened for potential terror threats; diagnosed with various diseases; and targeted for political advertising. The ability to sift through massive data sets with sophisticated algorithms has resulted in applications with impressive predictive power. And yet there is still a gap between what such tools can deliver, and what the users of data mining really need. It is often hard to interpret the answers produced by a learning algorithm, due to its sophistication and the use of large data sets to build models. The results of mining are often "one-size-fits-all", and convincing a user that results are actually relevant to them is difficult. Finally, there is the important problem of validation. As the results of data mining affect more and more of our lives, the more crucial it is that the user be able to validate decisions made on their behalf and that affect them. The common theme tying these issues together is a user-centric perspective on the problems of data mining. Rather than asking "What patterns can be found in this mountain of data?" this work instead asks "What structures in this data affect me?" These issues arise precisely because of the vast amounts of data we now have the ability to mine, and the sophisticated methods at our disposal to analyze this data. In this research, the PIs develop a computational framework and key tools for user-centric data mining. A central theme in this research is the idea of interaction. In both machine learning and in the foundations of complexity theory, interaction has been used to allow a (weaker) entity to probe a much more powerful system and determine answers that it lacks the resources to compute directly itself. The PIs use formal interaction mechanisms both from the perspective of a user interacting with a powerful algorithm, as well as a client interacting with a computing source with access to large data, in order to enable the user to interpret and validate the results of data mining.
The goal of this project is to develop a computational framework for user-centric data mining that enables existing users to tailor data analysis to their needs and facilitates the use of data mining in new areas where existing The team proposes interactive mechanisms that start with the results of a learning process and, via interaction with the user, produce an explanation expressed in terms of meaningful features, drawing on ideas from active learning, feature selection, and domain adaptation. 2. Locality: Answers that are relevant. Here, the focus is on providing information that depends more on a user?s local neighborhood, achieved via a new local notion of stability. 3. Verifiability: Answers you can check. The team proposes a framework for the validation of computationally-intensive data mining by the computationally-weak user, with ideas from interactive proof theory and stream algorithms. Tools for analyzing patient medical data have become more sophisticated and individual medical profiles play a far more significant role in diagnosis and treatment.The research examines user-centric data mining via three core primitives (classification, regression and clustering), and studies the three problems of interpreting results, providing local explanations, and validating the results of data mining. Firstly, the research draws on ideas from active learning, feature selection and domain adaptation to build interpretable results via interaction with users. Secondly, it introduces local notions of stability as a way of validating predictions for a specific user. Finally, it develops a general framework for validation of an analysis by a computationally-weak user, by drawing on ideas from the theory of interactive proofs and streaming algorithms.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
The availability of massive amounts of data in various learning and data mining tasks has both advantages and disadvantages. With more data, we can hope to perform more sophisticated analysis and generate more accurate predictions. This could result in anything from better book recommendations from an online retailer to more detailed medical diagnoses and even to more accurate screening of potential terror threats. However, with more data comes more responsibility along with new computational challenges. Since the results of data mining can have negative consequences, e.g., an inappropriate treatment due to a misdiagnosis, we want to be able to validate the computations that led to the final results and understand the properties of the data set that may be relevant to the outcome.
The intellectual merit of this project revolved around a) developing techniques for validation of computation on large data sets and b) designing efficient algorithms for analyzing properties of the data sets that are relevant to various data mining tasks. In particular, we developed the annotated data stream framework which enables the validation of computationally-intensive data mining by a computationally-weak user that has limited memory and is only able to take a single pass through the data. We then extended this to also support protocols that could have multiple rounds of interaction that would allow the user to verify the computation more efficiently. We then designed new data stream algorithms with provable accuracy guarantees for various data mining tasks including clustering (where we try to group together similar entities while ensuring dissimilar entities are in different groups), estimating the correlation coefficient of a network (a measure of how transitive the relationships in the data are), and constructing Bayesian networks that model dependencies in the data. Finally, we presented new results on reducing the dimension of high dimensional data sets in such a way that information-theoretic distances are approximately preserved.
In terms of broader impact, the training of numerous graduate students was supported on this grant. In addition to publishing the results at conferences and journals, we also organized a tutorial at the leading machine learning conference. Finally, some of the results were included in a graduate course at the University of Massachusetts and the related teaching material has been made public for the benefit of other students and instructors.
Last Modified: 05/22/2018
Modified by: Andrew Mcgregor
Please report errors in award information by writing to: awardsearch@nsf.gov.