Award Abstract # 1526674
III: Small: Collaborative Research: Reducing Classifier Bias in Social Media Studies of Public Health

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: ILLINOIS INSTITUTE OF TECHNOLOGY
Initial Amendment Date: August 12, 2015
Latest Amendment Date: August 12, 2015
Award Number: 1526674
Award Instrument: Standard Grant
Program Manager: Maria Zemankova
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: August 1, 2015
End Date: January 31, 2019 (Estimated)
Total Intended Award Amount: $304,725.00
Total Awarded Amount to Date: $304,725.00
Funds Obligated to Date: FY 2015 = $304,725.00
History of Investigator:
  • Aron Culotta (Principal Investigator)
    aculotta@tulane.edu
Recipient Sponsored Research Office: Illinois Institute of Technology
10 W 35TH ST
CHICAGO
IL  US  60616-3717
(312)567-3035
Sponsor Congressional District: 01
Primary Place of Performance: Illinois Institute of Technology
IL  US  60616-3793
Primary Place of Performance
Congressional District:
07
Unique Entity Identifier (UEI): E2NDENMDUEG8
Parent UEI:
NSF Program(s): Info Integration & Informatics
Primary Program Source: 01001516DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7364, 7923
Program Element Code(s): 736400
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Social media creates a new opportunity for public health research, giving greater reach at lower cost than traditional survey methods. Online content offers several potential advantages over traditional survey data; one can in real-time measure how behaviors and attitudes change in response to rare events such as legal changes, new products, and marketing campaigns. Machine learning techniques for classification can be used to tailor interventions that improve health outcomes while minimizing costs. However, online content is not a random sample, potentially biasing the outcomes. This proposal develops techniques to overcome this problem, enabling effective use of publicly available social media data for public health research. The approaches are evaluated against a traditional survey-based approach to evaluate end-to-end effectiveness in a real-world public health scenario, determining effectiveness of smoking cessation campaigns.

The project builds on well-grounded statistical approaches to eliminate classifier bias. Key innovations are extending this to the high-dimensional, noisy domain of textual social media data (specifically Twitter), robustness to confounding variables, and scalable methods to identify comparison groups. Noisy data will be addressed through advancing multiple imputation techniques. The project will develop a model-based approach to identifying comparison groups that addresses confounding variable issues. The methods will be evaluated in the context of an actual public health study of smoking cessation, based on historical Twitter data and traditional surveys conducted before and after a CDC campaign as well as a survey of smokers on perceived risk factors of e-cigarettes.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Aron Culotta "Towards identifying leading indicators of smoking cessation attempts from social media" IEEE International Conference on Healthcare Informatics, Workshop on Health Data Science: Creation, Presentation, Analysis and Interpretation , 2016
Aron Culotta "Training a text classifier with a single word using Twitter Lists and domain adaptation" Social Network Mining and Analysis , v.6 , 2016 10.1007/s13278-016-0317-1
Aron Culotta and Nirmal Kumar Ravi and Jennifer Cutler "Predicting {T}witter User Demographics using Distant Supervision from Website Traffic Data" Journal of Artificial Intelligence Research , v.55 , 2016 , p.389--408
Ehsan Mohammady Ardehaly and Aron Culotta "Co-training for Demographic Classification Using Deep Learning from Label Proportions" ICDM'17 Workshop , 2017
Ehsan Mohammady Ardehaly and Aron Culotta "Domain Adaptation for Learning from Label Proportions Using Self-Training" IJCAI'16 , 2016 , p.3670
Ehsan Mohammady Ardehaly and Aron Culotta "Learning from noisy label proportions for classifying online social data" Social Network Analysis and Mining , v.8 , 2018
Shreesh Kumara Bhat and Aron Culotta "Identifying Leading Indicators of Product Recalls from Online Reviews using Positive Unlabeled Learning and Domain Adaptation" ICWSM'17 , 2017
Virgile Landeiro and Aron Culotta "Back-door Adjustment for Robust Text Classification under Confounding Shift" Journal of Artificial Intelligence Research , 2018
Virgile Landeiro and Aron Culotta "Controlling for Unobserved Confounds in Classification Using Correlational Constraints" ICWSM'17 , 2017

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

The objective of our project was to investigate a suite of bias reduction techniques to improve the validity and robustness of public health studies conducted by applying statistical classifiers to social Internet data. Our overarching goal was to develop improved classification tools for the public health community while simultaneously providing fundamental advances in machine learning and natural language processing technology. The project has several significant results:


- We defined a new type of dataset shift - confounding shift - and observed many settings in which traditional supervised learning methods can degrade rapidly under such shift.


- We developed a suite of algorithms to identify and control for confounding shift, when confounds are observed or unobserved, resulting in significantly more robust and accurate classification algorithms.

- We developed a number of learning from label proportions approaches, which can make it much easier to train classifiers for social media classification. In many settings, the resulting accuracy is equal to or better than traditional fully supervised classification.

- We developed a method to sample queries from a classifier in order to identify samples that are both relevant to the target class and representative with respect to the larger population of documents. This methodology will make it easier for computational social scientists to identify data relevant to variables of interest, while reducing the bias introduced by sampling methodologies.


- We successfully applied our methodology to estimate the effect of CDC smoking cessation campaigns using Twitter data. Using a combination of machine learning and econometrics, we estimate that the national campaign resulted in an increase of about 10% in smoking cessation attempts, based on self-reports automatically identified on Twitter. An additional qualitative analysis allows us to further investigate these cessation attempts, which would be harder and more time consuming to do with traditional survey data approaches.


Last Modified: 03/02/2019
Modified by: Aron Culotta

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page