
NSF Org: |
IIS Division of Information & Intelligent Systems |
Recipient: |
|
Initial Amendment Date: | January 25, 2021 |
Latest Amendment Date: | April 23, 2024 |
Award Number: | 2040942 |
Award Instrument: | Standard Grant |
Program Manager: |
Sylvia Spengler
sspengle@nsf.gov (703)292-7347 IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | February 1, 2021 |
End Date: | January 31, 2025 (Estimated) |
Total Intended Award Amount: | $625,000.00 |
Total Awarded Amount to Date: | $681,000.00 |
Funds Obligated to Date: |
FY 2022 = $8,000.00 FY 2023 = $16,000.00 FY 2024 = $16,000.00 |
History of Investigator: |
|
Recipient Sponsored Research Office: |
5000 FORBES AVE PITTSBURGH PA US 15213-3890 (412)268-8746 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
5000 Forbes Ave WQED Building Pittsburgh PA US 15213-3890 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): |
Fairness in Artificial Intelli, Info Integration & Informatics, IIS Special Projects |
Primary Program Source: |
01002324DB NSF RESEARCH & RELATED ACTIVIT 01002425DB NSF RESEARCH & RELATED ACTIVIT 01002122DB NSF RESEARCH & RELATED ACTIVIT |
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Machine learning development teams often struggle to detect and mitigate harmful stereotypes due to their own blind spots, particularly when ML systems are deployed globally. These kinds of representation harms cannot be easily quantified using today?s automated techniques or fairness metrics, and require knowledge of specific social, cultural, and historical contexts. The researchers team will develop a crowd audit service that harnesses the power of volunteers and crowd workers to identify specific cases of bias and unfairness in machine learning systems, generalize those to systematic failures, and synthesize and prioritize these findings in a form that is readily actionable by development teams. Success in the research team?s work will lead to new ways to identify bias and unfairness in machine learning systems, thus improving trust and reliability in these systems. The research team?s work will be shared through a public web site that will make it easy for journalists, policy makers, researchers, and the public at large to engage in understanding algorithmic bias as well as participating in finding unfair behaviors in machine learning systems.
This project will explore three major research questions. The first is investigating new techniques for recruiting and incentivizing participation from a diverse crowd. The second is developing new and effective forms of guidance for crowd workers for finding instances and generalizing instances of bias. The third is designing new ways of synthesizing findings from the crowd so that development teams can understand and productively act on. The outputs of this research will include developing a taxonomy of harms; designing and evaluating new kinds of tools to help the crowd tag, discuss, and generalize representation harms; synthesizing new design practices in algorithmic socio-technical platforms in which these platforms can provide users with the opportunity to identify and report observed unfair system behaviors via the platform itself; and gathering new data sets consisting of unfair ML system behaviors identified by the crowd. These datasets will support future research into the design of crowd auditing systems, the nature of representation harms in ML systems, and for future ML teams working on similar kinds of systems.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
AI systems are being used in education, healthcare, finance, art, social programs, and more. However, a great deal of past work has found that AI systems can inadvertently exhibit algorithmic bias against people based on demographics like race, gender, and age. This bias can result in negative stereotypes, unfair allocation of resources, and poor outcomes due to skewed decisions.
Interestingly, people are already organically coming together to audit AI systems for fairness and bias. The primary goals of this project were to (a) better understand how everyday people audit these kinds of systems today, and (b) use this understanding to develop new tools to help people evaluate and audit AI systems. As part of this second thrust, we built WeAudit.org to teach the general public about algorithmic bias and offer a suite of tools to facilitate collective crowd audits, with a focus on text-to-image Generative AI.
We conducted numerous interviews, diary studies, and workshops, analyzed thousands of tweets of existing organic audits of systems, and ran controlled experiments to better understand how people naturally think of bias in AI systems and how they work together to find and report on problems. This work resulted in a theoretical model of how audits start and progress, as well as a better understanding of strengths and weaknesses of crowd audits. For example, we found that participants from marginalized gender or sexual orientation groups were more likely to rate images that were biased against their groups as more severely harmful, but belonging to a marginalized race did not have a similar pattern.
We also interviewed many commercial AI developers to understand how they audit their systems and see what concerns they have about crowd audits. One example concern is that preliminary public crowd audits of a system may make their company look bad even if the system has not yet been formally deployed.
Lastly, we studied how frontline workers use existing algorithms for prioritizing housing for homeless individuals. We used comic book storyboarding for soliciting feedback from frontline workers and from homeless individuals about who should be prioritized for housing. This work helped influence some states in investigating their use of algorithms for human services.
To help developers evaluate AI systems, we built and evaluated Deblinder, a tool that lets an analyst gather failure reports for AI systems from the crowd (similar to bug reports but for AI systems), cluster similar failure reports together, and hypothesize about more general failures. We also developed Zeno, which embodies a philosophy of behavior-driven AI development, helping developers evaluate their AI models on specific behaviors (subsets of cases) rather than just a single overall metric. For example, for a computer vision AI model, instead of a single accuracy metric, one could use Zeno to test how well the model works for people with dark skin, light skin, glasses, long hair, and more. Zeno is open source and deployed at https://zenoml.com, and has been used by over 500 people auditing over 15,000 models in machine learning courses and in industry.
We developed WeAudit to facilitate crowd audits for text-to-image Generative AI (http://weaudit.org). WeAudit offers several tools for auditing. TAIGA (https://taiga.weaudit.org) generates a set of images using the same prompt and lets you compare generated images against a different prompt. For example, a user can compare "kindergarten teacher with students" (which often generates many images of young white women teachers) and "professors with students" (which often generates many images of older white men). Users can discuss comparisons using our discussion boards. Another tool is Ouroboros, which uses AI to help audit AI. A user can generate a dozen images from the same prompt and then use computer vision to show users the quantitative distributions of age, gender, and skin color. Lastly, MIRAGE lets people compare what images different AI models generate based on the exact same prompt (https://mirage.weaudit.org).
Results of this research have been used in several courses at Carnegie Mellon University. Over 200 students have learned about AI bias and participated in AI auditing using WeAudit in courses that the PIs have taught. This research has helped support 4 PhD students , 6 REU students, and a large number of master’s and undergrad students from computer science, psychology, business, design, and more. We have been working with researchers at University of Notre Dame at their Technology Ethics Lab (joint with IBM) on WeAudit, to develop more features and do testing with students. We are continuing this work on crowd auditing with Seoul National University (SNU) as part of a new joint Human-Centered AI Research Center.
Last Modified: 03/14/2025
Modified by: Jason Hong
Please report errors in award information by writing to: awardsearch@nsf.gov.