
NSF Org: |
OAC Office of Advanced Cyberinfrastructure (OAC) |
Recipient: |
|
Initial Amendment Date: | July 9, 2021 |
Latest Amendment Date: | July 9, 2021 |
Award Number: | 2135446 |
Award Instrument: | Standard Grant |
Program Manager: |
Plato Smith
plsmith@nsf.gov (703)292-4278 OAC Office of Advanced Cyberinfrastructure (OAC) CSE Directorate for Computer and Information Science and Engineering |
Start Date: | July 15, 2021 |
End Date: | June 30, 2023 (Estimated) |
Total Intended Award Amount: | $299,858.00 |
Total Awarded Amount to Date: | $299,858.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
104 AIRPORT DR STE 2200 CHAPEL HILL NC US 27599-5023 (919)966-3411 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
Manning Hall Chapel Hill NC US 27599-3360 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | NSF Public Access Initiative |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
The Computer-Assisted Redaction and Anonymization of Scholarly Communications and Products (CARASCAP) project will produce a proof-of-concept open-source application stack to assist research teams and individual scholars in identifying, documenting, and redacting sensitive and personally identifying information within their research products. The potential presence of personally identifying information (PII) and other sensitive information is a significant inhibitor to public access to datasets and other products of publicly funded research. Without reliable and cost-effective processes for identifying such information, the default response is most often to indefinitely prevent the public from accessing entire collections of research products. By developing new components and tools for iterative redaction functions incorporated into workflows to prepare datasets for public dissemination, this project will foster a stronger ecosystem of research data publishing efforts.
The software will be developed primarily in Python, MIT Licensed, and packaged for distribution on the Python Package Index (PyPI). Independent modules will interpret and modify the source material data structures. For this prototype phase, the project will focus on formats likely to be of interest to a broad range of collections, including open text scraped from web pages at specific URLs, text formats, and modern office formats (e.g., PDF,.odt, .docx, .pst, .ost).
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
Computer-Assisted Redaction and Anonymization of Scholarly Communications and Products (CARASCAP) developed and tested new software for identification and redaction/filtering of sensitive information within scholarly communications and products (SCAP). Most modern redaction software is built using the same set of core technologies. This is typically some combination of document parser, optical character recognition, natural language processing (NLP) to identify entities of interest, pattern libraries to match common private and individually identifying information (PII) and custom string matching. Improving performance of these products is generally iterative, e.g., increasing document format coverage, expanding pattern libraries, or using improved NLP models. While performance and coverage are important factors for institutional workflows, they do not fundamentally answer new questions about how and why certain redaction actions are performed. CARASCAP introduced a new approach to add explainability to the redaction process, recording metadata that links redaction actions to specific rules and model behaviors. One can then use this information to validate software behaviors, compare those behaviors to actions performed by humans redacting manually, and train models tuned to specific redaction behaviors for collections of similar documents.
Products of the project (available at https://github.com/carascap) include software to support text analysis, reporting, and redaction workflows. We have also developed Jupyter notebooks and sample data to demonstrate how the software can be implemented into various workflows. Aplication of NLP often requires switching between models, which was challenging with existing software. We also produced a command-line tool view, install, and upgrade models for those using the powerful tool called SpaCy for named-entity recognization.
Last Modified: 01/22/2024
Modified by: Christopher Lee
Please report errors in award information by writing to: awardsearch@nsf.gov.