Award Abstract # 1657338
CRII: III: Real-World Machine Learning: Adaptation Methods for Addressing Temporal, Geographic, and Demographic Confounds in User-Generated Content

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: THE REGENTS OF THE UNIVERSITY OF COLORADO
Initial Amendment Date: February 28, 2017
Latest Amendment Date: February 28, 2017
Award Number: 1657338
Award Instrument: Standard Grant
Program Manager: Sylvia Spengler
sspengle@nsf.gov
 (703)292-7347
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: September 1, 2017
End Date: August 31, 2020 (Estimated)
Total Intended Award Amount: $174,117.00
Total Awarded Amount to Date: $174,117.00
Funds Obligated to Date: FY 2017 = $174,117.00
History of Investigator:
  • Michael Paul (Principal Investigator)
    mpaul@colorado.edu
Recipient Sponsored Research Office: University of Colorado at Boulder
3100 MARINE ST
Boulder
CO  US  80309-0001
(303)492-6221
Sponsor Congressional District: 02
Primary Place of Performance: University of Colorado at Boulder
CO  US  80303-1058
Primary Place of Performance
Congressional District:
02
Unique Entity Identifier (UEI): SPVKK1RC2MZ3
Parent UEI:
NSF Program(s): CRII CISE Research Initiation
Primary Program Source: 01001718DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7364, 8228
Program Element Code(s): 026Y00
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

There is a rapidly growing body of research that uses user-generated content from the web, e.g., social media messages, to draw conclusions about the world. Using machine learning and natural language processing methods, it is possible to estimate public opinion, consumer sentiment, and population health based on what people are publicly sharing about their thoughts and actions online. For example, if someone writes that they have a fever, we might infer that they have the flu; if we aggregate all messages like this, we can track the prevalence and spread of the flu at a population level. However, a challenge with applying machine learning to user-generated content is that the characteristics of the content are highly dependent on the Who, When, and Where of the users. Online discussions evolve rapidly; a system built in one year might not work well in the next, and a system built for one community of users might not work for another. The proposed project seeks to create machine learning methods that are robust to variations in time, geography, and demographics of content and content creators. Related to domain adaptation techniques in machine learning, the PI proposes methods that learn to generalize across these various content attributes. The general goal is to create robust, open source tools that can be easily adopted by other researchers. One particular outcome of the project will be to improve the machine learning classifiers used in prior work on social media-based disease surveillance. The output of the PI's health analysis systems will be integrated into HealthTweets.org, a publicly accessible website that shares daily estimates of disease prevalence for other researchers and health officials.


The project will create hierarchical Bayesian models for training classifiers that can be adapted across different content attributes. The specific attributes of interest include time, geography, and demographic group of the author, but the proposed models do not depend on the specific attributes, and can be broadly applied to other machine learning settings. As a starting point, a predictive model (classification or regression) will be constructed that can be adapted across one attribute at a time. The PI will then create novel extensions to the model that can adapt across conjunctions of multiple attributes, such as time AND location. These extensions are related to the PI's prior work on building structured topic models that learn relationships between different features of content. Finally, in addition to creating predictive models, the PI will also build models of content that can be used to infer missing attributes (e.g., the location of a user if it is unknown), which can be combined with the predictive models to jointly perform inference and classification. Classification performance in new settings on a variety of datasets and exploration of the effects of, and sensitivity to, different parameters will be tested. Specific deliverables include the improvement a classifier for detecting influenza infection on Twitter, and integrating the classifier into the website, HealthTweets.org.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Huang, Xiaolei and Paul, Michael J "Neural Temporality Adaptation for Document Classification: Diachronic Word Embeddings and Domain Adaptation Models" Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , 2019 Citation Details
Huang, Xiaolei and Paul, Michael J. "Examining Temporality in Document Classification" Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , v.2 , 2018 Citation Details
Huang, Xiaolei and Paul, Michael J. "Neural User Factor Adaptation for Text Classification: Learning to Generalize Across Author Demographics" Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019) , 2019 10.18653/v1/S19-1015 Citation Details
Huang, Xiaolei and Smith, Michael C and Jamison, Amelia M and Broniatowski, David A and Dredze, Mark and Quinn, Sandra Crouse and Cai, Justin and Paul, Michael J "Can online self-reports assist in real-time identification of influenza vaccination uptake? A cross-sectional study of influenza vaccine-related tweets in the USA, 2013?2017" BMJ Open , v.9 , 2019 10.1136/bmjopen-2018-024018 Citation Details

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page