Award Abstract # 2135446
EAGER: Computer-Assisted Redaction and Anonymization of Scholarly Communications and Products (CARASCAP)

NSF Org: OAC
Office of Advanced Cyberinfrastructure (OAC)
Recipient: UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL
Initial Amendment Date: July 9, 2021
Latest Amendment Date: July 9, 2021
Award Number: 2135446
Award Instrument: Standard Grant
Program Manager: Plato Smith
plsmith@nsf.gov
 (703)292-4278
OAC
 Office of Advanced Cyberinfrastructure (OAC)
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: July 15, 2021
End Date: June 30, 2023 (Estimated)
Total Intended Award Amount: $299,858.00
Total Awarded Amount to Date: $299,858.00
Funds Obligated to Date: FY 2021 = $299,858.00
History of Investigator:
  • Christopher Lee (Principal Investigator)
    callee@ils.unc.edu
Recipient Sponsored Research Office: University of North Carolina at Chapel Hill
104 AIRPORT DR STE 2200
CHAPEL HILL
NC  US  27599-5023
(919)966-3411
Sponsor Congressional District: 04
Primary Place of Performance: University of North Carolina at Chapel Hill
Manning Hall
Chapel Hill
NC  US  27599-3360
Primary Place of Performance
Congressional District:
04
Unique Entity Identifier (UEI): D3LHU66KBLD5
Parent UEI: D3LHU66KBLD5
NSF Program(s): NSF Public Access Initiative
Primary Program Source: 01002122DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7916
Program Element Code(s): 741400
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

The Computer-Assisted Redaction and Anonymization of Scholarly Communications and Products (CARASCAP) project will produce a proof-of-concept open-source application stack to assist research teams and individual scholars in identifying, documenting, and redacting sensitive and personally identifying information within their research products. The potential presence of personally identifying information (PII) and other sensitive information is a significant inhibitor to public access to datasets and other products of publicly funded research. Without reliable and cost-effective processes for identifying such information, the default response is most often to indefinitely prevent the public from accessing entire collections of research products. By developing new components and tools for iterative redaction functions incorporated into workflows to prepare datasets for public dissemination, this project will foster a stronger ecosystem of research data publishing efforts.

The software will be developed primarily in Python, MIT Licensed, and packaged for distribution on the Python Package Index (PyPI). Independent modules will interpret and modify the source material data structures. For this prototype phase, the project will focus on formats likely to be of interest to a broad range of collections, including open text scraped from web pages at specific URLs, text formats, and modern office formats (e.g., PDF,.odt, .docx, .pst, .ost).

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Computer-Assisted Redaction and Anonymization of Scholarly Communications and Products (CARASCAP) developed and tested new software for identification and redaction/filtering of sensitive information within scholarly communications and products (SCAP). Most modern redaction software is built using the same set of core technologies. This is typically some combination of document parser, optical character recognition, natural language processing (NLP) to identify entities of interest, pattern libraries to match common private and individually identifying information (PII) and custom string matching. Improving performance of these products is generally iterative, e.g., increasing document format coverage, expanding pattern libraries, or using improved NLP models. While performance and coverage are important factors for institutional workflows, they do not fundamentally answer new questions about how and why certain redaction actions are performed. CARASCAP introduced a new approach to add explainability to the redaction process, recording metadata that links redaction actions to specific rules and model behaviors. One can then use this information to validate software behaviors, compare those behaviors to actions performed by humans redacting manually, and train models tuned to specific redaction behaviors for collections of similar documents. 

Products of the project (available at https://github.com/carascap) include software to support text analysis, reporting, and redaction workflows.  We have also developed Jupyter notebooks and sample data to demonstrate how the software can be implemented into various workflows.  Aplication of NLP often requires switching between models, which was challenging with existing software.  We also produced a command-line tool view, install, and upgrade models for those using the powerful tool called SpaCy for named-entity recognization.


Last Modified: 01/22/2024
Modified by: Christopher Lee

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page