NSF Award Search: Award # 2046853

Award Abstract # 2046853

CAREER: Natural Narratives and Multimodal Context as Weak Supervision for Learning Object Categories

NSF Org:	IIS Division of Information & Intelligent Systems
Recipient:	UNIVERSITY OF PITTSBURGH - OF THE COMMONWEALTH SYSTEM OF HIGHER EDUCATION
Initial Amendment Date:	February 24, 2021
Latest Amendment Date:	February 15, 2025
Award Number:	2046853
Award Instrument:	Continuing Grant
Program Manager:	Jie Yang jyang@nsf.gov (703)292-4768 IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering
Start Date:	May 1, 2021
End Date:	April 30, 2026 (Estimated)
Total Intended Award Amount:	$547,138.00
Total Awarded Amount to Date:	$547,138.00
Funds Obligated to Date:	FY 2021 = $90,374.00 FY 2022 = $134,355.00 FY 2023 = $117,916.00 FY 2024 = $98,966.00 FY 2025 = $105,527.00
History of Investigator:	Adriana Kovashka (Principal Investigator) kovashka@cs.pitt.edu
Recipient Sponsored Research Office:	University of Pittsburgh 4200 FIFTH AVENUE PITTSBURGH PA US 15260-0001 (412)624-7400
Sponsor Congressional District:	12
Primary Place of Performance:	University of Pittsburgh - of the Commonwealth System of Higher 300 MURDC, 3420 Forbes Avenue Pittsburgh PA US 15213-3203
Primary Place of Performance Congressional District:	12
Unique Entity Identifier (UEI):	MKAGLD59JRL1
Parent UEI:
NSF Program(s):	Robust Intelligence
Primary Program Source:	01002122DB NSF RESEARCH & RELATED ACTIVIT 01002223DB NSF RESEARCH & RELATED ACTIVIT 01002324DB NSF RESEARCH & RELATED ACTIVIT 01002425DB NSF RESEARCH & RELATED ACTIVIT 01002526DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	1045, 7495
Program Element Code(s):	749500
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

This project develops a framework to train computer vision models for detection of objects from weak, naturally-occurring supervision of language (text or speech) and additional multimodal signals. It considers dynamic settings, where humans interact with their visual environment and refer to the encountered objects, e.g., ?Carefully put the tomato plants in the ground? and ?Please put the phone down and come set the table,? and captions written for a human audience to complement an image, e.g., news article captions. The challenge of using such language-based supervision for training detection systems is that along with useful signal, the speech contains many irrelevant tokens. The project will benefit society by exploring novel avenues for overcoming this challenge and reducing the need for expensive and potentially unnatural crowdsourced labels for training. It has the potential to make object detection systems more scalable and thus more usable by a broad user base in a variety of settings. The resources and tools developed would allow natural, lightweight learning in different environments, e.g., different languages or types of imagery where the well-known object categories are not useful or where there is a shift in both the pixels as well as the way in which humans refer to objects (different cultures, medicine, art). This project opens possibilities for learning in vivo rather than in vitro; while the focus here is on object categories, multimodal weak supervision is useful for a larger variety of tasks. Research and education are integrated through local community outreach and research mentoring for students from lesser-known universities, new programs for student training including honing graduate students' writing skills, and development of interactive educational modules and demos based on research findings.

This project creatively connects two domains, vision-and-language, and object detection, and pioneers training of object detection models with weak language supervision and a large vocabulary of potential classes. The impact of noise in the language channel will be mitigated through three complementary techniques that model visual concreteness of words, to what extent the text refers to the visual environment it appears with, and whether the weakly-supervised models that are learned are logically consistent. Two complementary word-region association mechanisms will be used (metric learning and cross-modal transformers), whose application is novel for weakly-supervised detection. Importantly, to make detection feasible, not only the semantics of image-text pairs, but their discourse relationship, will be captured. To facilitate and disambiguate the association of words to a physical environment, the latter will be represented through additional modalities, namely sound, motion, depth and touch, which are either present in the data or estimated. This project advances knowledge of how multimodal cues contextualize the relation between image and text; no prior work has modeled image-text relationships along multiple channels (sound, depth, touch, motion). Finally, to connect the appearance of objects to the purpose and use of these objects, relationships between objects, properties and actions will be semantically organized in a graph, and grammars to represent activities involving objects will be extracted, still maintaining the weakly-supervised setting.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Gungor, Cagri and Kovashka, Adriana "Boosting Weakly Supervised Object Detection using Fusion and Priors from Hallucinated Depth" Winter Applications of Computer Vision (WACV) , 2024 Citation Details

Gungor, Cagri and Kovashka, Adriana "Complementary Cues from Audio Help Combat Noise in Weakly-Supervised Object Detection" 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , 2023 https://doi.org/10.1109/WACV56688.2023.00222 Citation Details

Nebbia, Giacomo and Kovashka, Adriana "Hypernymization of named entity-rich captions for grounding-based multi-modal pretraining" ACM International Conference on Multimedia Retrieval , 2023 https://doi.org/10.1145/3591106.3592223 Citation Details

Rai, Arushi and Kovashka, Adriana "Improving language-supervised object detection with linguistic structure analysis" IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2023 https://doi.org/10.1109/cvprw59228.2023.00588 Citation Details

Rai, Arushi and Kovashka, Adriana "VEIL: Vetting Extracted Image Labels from In-the-Wild Captions for Weakly-Supervised Object Detection" , 2024 Citation Details

Unal, Mesut Erhan and Ye, Keren and Zhang, Mingda and Thomas, Christopher and Kovashka, Adriana and Li, Wei and Qin, Danfeng and Berent, Jesse "Learning to Overcome Noise in Weak Caption Supervision for Object Detection" IEEE Transactions on Pattern Analysis and Machine Intelligence , 2022 https://doi.org/10.1109/TPAMI.2022.3187350 Citation Details

Ye, Keren and Kovashka, Adriana "Weakly-Supervised Action Detection Guided by Audio Narration" 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) , 2022 https://doi.org/10.1109/CVPRW56347.2022.00159 Citation Details

Please report errors in award information by writing to: awardsearch@nsf.gov.

Success

Error