Award Abstract # 1451380
EAGER: Example-based Audio Editing

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: UNIVERSITY OF ILLINOIS
Initial Amendment Date: August 21, 2014
Latest Amendment Date: August 21, 2014
Award Number: 1451380
Award Instrument: Standard Grant
Program Manager: Ephraim Glinert
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: September 1, 2014
End Date: May 31, 2018 (Estimated)
Total Intended Award Amount: $150,000.00
Total Awarded Amount to Date: $150,000.00
Funds Obligated to Date: FY 2014 = $150,000.00
History of Investigator:
  • Paris Smaragdis (Principal Investigator)
    paris@illinois.edu
Recipient Sponsored Research Office: University of Illinois at Urbana-Champaign
506 S WRIGHT ST
URBANA
IL  US  61801-3620
(217)333-2187
Sponsor Congressional District: 13
Primary Place of Performance: University of Illinois at Urbana-Champaign
IL  US  61820-7473
Primary Place of Performance
Congressional District:
13
Unique Entity Identifier (UEI): Y8CWNJRCNN91
Parent UEI: V2PHZ2CSCH63
NSF Program(s): HCC-Human-Centered Computing
Primary Program Source: 01001415DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7367, 7916
Program Element Code(s): 736700
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Contemporary users of technology interact with photos and video by editing them, but still use audio only passively, by capturing, storing, transmitting, and playing it back. These two different ways of interacting with contemporary media persist because current software tools make it very difficult for general users to manipulate audio. This project will develop novel technologies that will make audio editing and manipulation accessible to non-experts. These tools will allow a user to guide the software with audio editing requests by vocalizing the desired edits, providing before/after examples of the desired effects, or by presenting other recordings that exhibit the desired audio manipulations. For example, a user might issue a command to the software to equalize sounds by using a booming voice for more bass, or a nasal tone for middle frequencies; to add echoes by mimicking the desired effect by uttering "hello, hello, hello ..." with each successive "hello" in a lower volume; or to add reverb by providing examples of recordings with the desired reverb. Making it easier for general computer users to manipulate and edit audio recordings can impact many fields, such as medical bioacoustics, seismic signal analysis, underwater monitoring, audio forensics, surveillance applications, oil exploration probing, conversational data gathering, and mechanical vibration measuring. The goals of this project are to provide novel and practical audio tools that will allow non-expert practitioners from these fields to easily achieve required audio manipulations.

The project will exploit modern signal processing and machine learning techniques to produce more intuitive interfaces that help people accomplish what are currently difficult audio editing tasks. This will include developing novel estimators to extract editing-intent parameters directly from audio recordings. The project will focus on three different editing operations: equalization, noise control, and echo/reverberation. A number of different approaches will be explored for each operation. For example, for equalization, one approach will have users select before and after sounds to identify their desired modification, and the system will then use spectral deconvolution estimations to directly compute the transfer function that maps the spectrum of the before sound to that of the after sound, and apply that function to the audio recording that the user is editing. For noise control, one approach will have users vocalize what types of noise to remove, and then match the user's input with the corresponding component in the recording that is being edited by using low-rank spectral decomposition. For reverb and echo, one approach will have users utter "one, two, three, ..." to illustrate the desired number of repetitions, temporal spacing, and attenuation between echoes, and then use voice detection measurements to extract the echo parameters, while correcting for vocalization errors such as random inconsistency in the echo spacing. The project will create new theories of how human guidance and automated audio-intelligent processing can work in tandem to solve fundamental and practical problems.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Prem Seetharaman, Gautham J. Mysore, Paris Smaragdis, Bryan Pardo "BLIND ESTIMATION OF THE SPEECH TRANSMISSION INDEX FOR SPEECH QUALITY PREDICTION" IEEE ICASSP , 2018
Shrikant Venkataramani, Paris Smaragdis, Gautham Mysore "AutoDub: Automatic Redubbing for Voiceover Editing" UIST , 2017

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Making an audio or speech recording sound good, or even ok, is not easy. From YouTube bloggers, to politicians speaking on stage, to local TV stations, we often hear poor quality audio that is either hard to parse, or just bad enough to tip us to change the channel.

 

Being able to make clean and appealing audio recordings is key to good communication, especially today where anyone can produce content and instantly share it globally. Being able to produce appealing recordings can allow scientists to better disseminate their results, journalists to maintain an interested audience, budding artists and media companies to get started, and of course to serve the massive amount of people who share their thoughts online, whether that involves teaching courses, talking to family, or producing content for massive consumption.

 

Making a recording of palatable quality requires a strong background in recording technology, an experienced ear, and plenty of time. Unfortunately, these are luxuries that are available only to high-end productions, leaving the rest of us to struggle trying to obtain the sound we want.

 

In this project we examined alternatives to existing audio recording technology that can result in increased audio recording quality without necessitating that a user is a recording expert. Our approach  uses recording examples that the user can use to point an intelligent system towards how they want their recording to sound.

 

For example, if a voice recording sounds too nasal, the user can point our system to a recording of James Earl Jones, and have it automatically equalized to match his boomy voice. Providing an example of a richly reverberant recording from a cathedral, would make the recording have the same lush quality. So instead of requiring that a user has the expertise to properly position a microphone, equalize a recording, and properly adjust to account for noise and echoes, we allow the user to provide examples of the desired sound quality and automatically match them.

 

We show that this idea carries out to various types of operations. We can use examples to apply global effects, e.g. remove or match noise and reverberation and match the equalization of a recording; or we can use this approach for local editing. We have successfully shown how a user can easily replace a small mistake in a long recording by simply speaking the sentence they want fixed, and our system automatically matches to the right part and replaces it.

 

In the process of designing these features we also needed to characterize multiple recording quality attributes. For the final year of this project we developed a deep learning algorithm that can characterize various attributes of a recording and provide real-time feedback to the user. This allows a user to quickly act to address problems in a recording. For example, the system can detect if the user is too far from a microphone and prompt him/her to move closer.

 

What we have shown in this project is that there is a lot of room to rethink how users can make recordings, and bypass the need for technical expertise. Just as cameras are becoming smart enough to take good pictures even when misused by novices, we have shown that we can do the same thing with audio recordings and provided a glimpse of new ways to design audio recording systems. Thanks for a fruitful collaboration with industry partners we are expecting to see some of this technology become broadly available within the next couple of years.

 


Last Modified: 09/15/2018
Modified by: Paris Smaragdis

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page