
NSF Org: |
IIS Division of Information & Intelligent Systems |
Recipient: |
|
Initial Amendment Date: | February 28, 2017 |
Latest Amendment Date: | February 28, 2017 |
Award Number: | 1657338 |
Award Instrument: | Standard Grant |
Program Manager: |
Sylvia Spengler
sspengle@nsf.gov (703)292-7347 IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | September 1, 2017 |
End Date: | August 31, 2020 (Estimated) |
Total Intended Award Amount: | $174,117.00 |
Total Awarded Amount to Date: | $174,117.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
3100 MARINE ST Boulder CO US 80309-0001 (303)492-6221 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
CO US 80303-1058 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | CRII CISE Research Initiation |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
There is a rapidly growing body of research that uses user-generated content from the web, e.g., social media messages, to draw conclusions about the world. Using machine learning and natural language processing methods, it is possible to estimate public opinion, consumer sentiment, and population health based on what people are publicly sharing about their thoughts and actions online. For example, if someone writes that they have a fever, we might infer that they have the flu; if we aggregate all messages like this, we can track the prevalence and spread of the flu at a population level. However, a challenge with applying machine learning to user-generated content is that the characteristics of the content are highly dependent on the Who, When, and Where of the users. Online discussions evolve rapidly; a system built in one year might not work well in the next, and a system built for one community of users might not work for another. The proposed project seeks to create machine learning methods that are robust to variations in time, geography, and demographics of content and content creators. Related to domain adaptation techniques in machine learning, the PI proposes methods that learn to generalize across these various content attributes. The general goal is to create robust, open source tools that can be easily adopted by other researchers. One particular outcome of the project will be to improve the machine learning classifiers used in prior work on social media-based disease surveillance. The output of the PI's health analysis systems will be integrated into HealthTweets.org, a publicly accessible website that shares daily estimates of disease prevalence for other researchers and health officials.
The project will create hierarchical Bayesian models for training classifiers that can be adapted across different content attributes. The specific attributes of interest include time, geography, and demographic group of the author, but the proposed models do not depend on the specific attributes, and can be broadly applied to other machine learning settings. As a starting point, a predictive model (classification or regression) will be constructed that can be adapted across one attribute at a time. The PI will then create novel extensions to the model that can adapt across conjunctions of multiple attributes, such as time AND location. These extensions are related to the PI's prior work on building structured topic models that learn relationships between different features of content. Finally, in addition to creating predictive models, the PI will also build models of content that can be used to infer missing attributes (e.g., the location of a user if it is unknown), which can be combined with the predictive models to jointly perform inference and classification. Classification performance in new settings on a variety of datasets and exploration of the effects of, and sensitivity to, different parameters will be tested. Specific deliverables include the improvement a classifier for detecting influenza infection on Twitter, and integrating the classifier into the website, HealthTweets.org.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
Please report errors in award information by writing to: awardsearch@nsf.gov.