
NSF Org: |
IIS Division of Information & Intelligent Systems |
Recipient: |
|
Initial Amendment Date: | July 27, 2017 |
Latest Amendment Date: | April 24, 2019 |
Award Number: | 1718262 |
Award Instrument: | Standard Grant |
Program Manager: |
Jie Yang
jyang@nsf.gov (703)292-4768 IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | August 1, 2017 |
End Date: | July 31, 2021 (Estimated) |
Total Intended Award Amount: | $449,978.00 |
Total Awarded Amount to Date: | $449,978.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
4200 FIFTH AVENUE PITTSBURGH PA US 15260-0001 (412)624-7400 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
Pittsburgh PA US 15213-2303 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Robust Intelligence |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
This project develops systems for analyzing and inferring the non-literal messages conveyed in the media through persuasive images and text. Computational representations of two persuasive strategies are devised to model the mapping between observable information and underlying messages. First, this project models "vividness": through analyses of how human subjects perceive images and text, the system aims to identify relevant regions in which creative techniques were used to draw the viewer's attention. Second, this project models "symbolism": through analyses of semantic relationships between concrete objects and abstract concepts, the system aims to decode symbolic associations that humans make. The ability to automatically understand vividness and symbolism is key to building computational intelligence that can make inferences about what the media implies. This interdisciplinary project also has an educational component of potentially increasing the media literacy of school students, and involving college students from diverse backgrounds into computational research. The work can be used to discover patterns in how the visual rhetoric in the media evolved over time or how it differs in different cultures.
This research pursues three directions. First, a framework for judging vividness (i.e., to what degree an image as a whole is vivid; what part of an image is vivid; and whether a text snippet is vivid) is developed. Data about the vividness of a variety of images and text is collected from human annotators. Cues and techniques such as saliency, attention, sentiment, memorability and abnormality are used to build prediction models for vividness. Second, two pipelines for detecting symbolic references are developed. One pipeline hypothesizes potential signifiers from an image, then uses textual resources to map these to signifieds. The other pipeline directly hypothesizes what the signifieds might be, and obtains training data for these from web resources. The outputs from these pipelines are combined to generate the signifier-signified pairs. Third, a method for generating explanations of the strategies is developed, using the vividness and symbolism outputs. Numerous resources to be shared with the research community are developed over the course of the project.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
This project transformed the computer vision field in two ways. First, it put an emphasis on the importance of analyzing persuasive media, e.g. advertisements and images in political articles. We developed a variety of techniques for visual reasoning that tackled the specific challenges of persuasive imagery. Second, the project allowed us to develop methods that examine and leverage the complementarity of visual and textual information, in persuasive media and general multimodal data. As a related problem, we also examined how visual reasoning methods overfit to specific aspects of the multimodal inputs (e.g. text) and how robust reasoning methods are to shallow changes in the input.
We focused on five separate research thrusts. The first line of research, resulting in a recent acceptance to ICCV 2021, was directly related to symbolism and focused on detecting persuasive atypicality in advertisement images. In an earlier work we annotated images in our advertisement dataset (CVPR 2017, TPAMI 2019) as showing objects in a typical or atypical (abnormal) manner. In this project, we developed a new technique to detect atypicality, based on the intuition that relative position of objects with respect to one another is a strong indicator of atypicality, and can be learned by modeling context. We are currently working on evaluating how well language models can parse symbolism, i.e. infer what is being symbolized by a symbol (e.g. "dragon symbolizes ___" where ___ is "danger").
In a second line of research (published in BMVC 2018), we examined the relationship between the image visuals, and the slogan appearing in the image. We studied the distinct ways in which these reinforce each other and jointly make a single argument, without necessarily making the exact same point and being redundant. The latter, i.e. literal alignment between image and text, has been commonly studied in vision-language tasks such as image captioning. In contrast, we find that the image and text pair for a single ad complement each other in more creative ways. For example, one of image/text can be purposefully ambiguous, in order to capture the viewer's attention in decoding the ambiguity; the image and text can even individually appear to contradict each other, but when viewed together, make a unified argument. In a follow-up work, appearing in a CVPR 2020 workshop, we tackled decoding the allusions that narratives make. Advertisements are a type of narrative, so as a form of preliminary exploration, we first focused on textual-only narratives, specifically choosing the correct ending for a story. We examined the connection between context and endings, by looking at any relationships between context/ending words, according to a knowledge base resource.
The third line of research focused on undestanding the relation between images and text in multimodal political articles. We extended our NeurIPS 2019 conference paper into a IJCV 2021 journal version. That work's goal was to infer political bias from images, by using text as an auxiliary modality. A follow-up work (presented at ECCV 2020) developed a cross-modal retrieval method which relied on within-modality constraints to help deal with the complementarity of image-text pairs and the diverse appearance of imagery that corresponds to the same topic.
A fourth line of work, inspired initially by our work on advertisements, was to train a scene graph generation model from weak supervision contained in captions. This was a follow-up to our prior work, which trained object detection algortihms from captions. It was presented at CVPR 2021.
Our fifth line of work was to examine robustness of visual question-answering (VQA) models. In one work, appearing in WACV 2021, investigated how to make use of external knowledge base information, for performing a visual reasoning task on our ads dataset. We discovered that because of the way the evaluation task is set up, it is easy for the model to find shallow "shortcuts" and ignore knowledge pieces, which are needed for reasoning on less-common brands. We tackled the problem through three stochastic masking techniques. In the follow-up work, appearing in AAAI 2021, we showed that the shortcut problem exists in other reasoning datasets and that reasoning methods (including the recent transformer models) suffer greatly from simple input changes that should not change the meaning of the question and answer. We proposed masking on a curriculum to ameliorate the issue. In our final work, appearing in CVPR 2021, we examined how robust VQA models are to training and testing on different datasets.
Our work funded four graduate students for multiple semesters, two of whom graduated, one is female, and one will pursue a career in academia. It also resulted in two publicly released datasets.
Last Modified: 10/29/2021
Modified by: Adriana Kovashka
Please report errors in award information by writing to: awardsearch@nsf.gov.