
NSF Org: |
IIS Division of Information & Intelligent Systems |
Recipient: |
|
Initial Amendment Date: | August 12, 2015 |
Latest Amendment Date: | August 12, 2015 |
Award Number: | 1526674 |
Award Instrument: | Standard Grant |
Program Manager: |
Maria Zemankova
IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | August 1, 2015 |
End Date: | January 31, 2019 (Estimated) |
Total Intended Award Amount: | $304,725.00 |
Total Awarded Amount to Date: | $304,725.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
10 W 35TH ST CHICAGO IL US 60616-3717 (312)567-3035 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
IL US 60616-3793 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Info Integration & Informatics |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Social media creates a new opportunity for public health research, giving greater reach at lower cost than traditional survey methods. Online content offers several potential advantages over traditional survey data; one can in real-time measure how behaviors and attitudes change in response to rare events such as legal changes, new products, and marketing campaigns. Machine learning techniques for classification can be used to tailor interventions that improve health outcomes while minimizing costs. However, online content is not a random sample, potentially biasing the outcomes. This proposal develops techniques to overcome this problem, enabling effective use of publicly available social media data for public health research. The approaches are evaluated against a traditional survey-based approach to evaluate end-to-end effectiveness in a real-world public health scenario, determining effectiveness of smoking cessation campaigns.
The project builds on well-grounded statistical approaches to eliminate classifier bias. Key innovations are extending this to the high-dimensional, noisy domain of textual social media data (specifically Twitter), robustness to confounding variables, and scalable methods to identify comparison groups. Noisy data will be addressed through advancing multiple imputation techniques. The project will develop a model-based approach to identifying comparison groups that addresses confounding variable issues. The methods will be evaluated in the context of an actual public health study of smoking cessation, based on historical Twitter data and traditional surveys conducted before and after a CDC campaign as well as a survey of smokers on perceived risk factors of e-cigarettes.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
The objective of our project was to investigate a suite of bias reduction techniques to improve the validity and robustness of public health studies conducted by applying statistical classifiers to social Internet data. Our overarching goal was to develop improved classification tools for the public health community while simultaneously providing fundamental advances in machine learning and natural language processing technology. The project has several significant results:
- We defined a new type of dataset shift - confounding shift - and observed many settings in which traditional supervised learning methods can degrade rapidly under such shift.
- We developed a suite of algorithms to identify and control for confounding shift, when confounds are observed or unobserved, resulting in significantly more robust and accurate classification algorithms.
- We developed a number of learning from label proportions approaches, which can make it much easier to train classifiers for social media classification. In many settings, the resulting accuracy is equal to or better than traditional fully supervised classification.
- We developed a method to sample queries from a classifier in order to identify samples that are both relevant to the target class and representative with respect to the larger population of documents. This methodology will make it easier for computational social scientists to identify data relevant to variables of interest, while reducing the bias introduced by sampling methodologies.
- We successfully applied our methodology to estimate the effect of CDC smoking cessation campaigns using Twitter data. Using a combination of machine learning and econometrics, we estimate that the national campaign resulted in an increase of about 10% in smoking cessation attempts, based on self-reports automatically identified on Twitter. An additional qualitative analysis allows us to further investigate these cessation attempts, which would be harder and more time consuming to do with traditional survey data approaches.
Last Modified: 03/02/2019
Modified by: Aron Culotta
Please report errors in award information by writing to: awardsearch@nsf.gov.