
NSF Org: |
IIS Division of Information & Intelligent Systems |
Recipient: |
|
Initial Amendment Date: | August 21, 2014 |
Latest Amendment Date: | August 21, 2014 |
Award Number: | 1451380 |
Award Instrument: | Standard Grant |
Program Manager: |
Ephraim Glinert
IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | September 1, 2014 |
End Date: | May 31, 2018 (Estimated) |
Total Intended Award Amount: | $150,000.00 |
Total Awarded Amount to Date: | $150,000.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
506 S WRIGHT ST URBANA IL US 61801-3620 (217)333-2187 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
IL US 61820-7473 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | HCC-Human-Centered Computing |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Contemporary users of technology interact with photos and video by editing them, but still use audio only passively, by capturing, storing, transmitting, and playing it back. These two different ways of interacting with contemporary media persist because current software tools make it very difficult for general users to manipulate audio. This project will develop novel technologies that will make audio editing and manipulation accessible to non-experts. These tools will allow a user to guide the software with audio editing requests by vocalizing the desired edits, providing before/after examples of the desired effects, or by presenting other recordings that exhibit the desired audio manipulations. For example, a user might issue a command to the software to equalize sounds by using a booming voice for more bass, or a nasal tone for middle frequencies; to add echoes by mimicking the desired effect by uttering "hello, hello, hello ..." with each successive "hello" in a lower volume; or to add reverb by providing examples of recordings with the desired reverb. Making it easier for general computer users to manipulate and edit audio recordings can impact many fields, such as medical bioacoustics, seismic signal analysis, underwater monitoring, audio forensics, surveillance applications, oil exploration probing, conversational data gathering, and mechanical vibration measuring. The goals of this project are to provide novel and practical audio tools that will allow non-expert practitioners from these fields to easily achieve required audio manipulations.
The project will exploit modern signal processing and machine learning techniques to produce more intuitive interfaces that help people accomplish what are currently difficult audio editing tasks. This will include developing novel estimators to extract editing-intent parameters directly from audio recordings. The project will focus on three different editing operations: equalization, noise control, and echo/reverberation. A number of different approaches will be explored for each operation. For example, for equalization, one approach will have users select before and after sounds to identify their desired modification, and the system will then use spectral deconvolution estimations to directly compute the transfer function that maps the spectrum of the before sound to that of the after sound, and apply that function to the audio recording that the user is editing. For noise control, one approach will have users vocalize what types of noise to remove, and then match the user's input with the corresponding component in the recording that is being edited by using low-rank spectral decomposition. For reverb and echo, one approach will have users utter "one, two, three, ..." to illustrate the desired number of repetitions, temporal spacing, and attenuation between echoes, and then use voice detection measurements to extract the echo parameters, while correcting for vocalization errors such as random inconsistency in the echo spacing. The project will create new theories of how human guidance and automated audio-intelligent processing can work in tandem to solve fundamental and practical problems.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
Making an audio or speech recording sound good, or even ok, is not easy. From YouTube bloggers, to politicians speaking on stage, to local TV stations, we often hear poor quality audio that is either hard to parse, or just bad enough to tip us to change the channel.
Being able to make clean and appealing audio recordings is key to good communication, especially today where anyone can produce content and instantly share it globally. Being able to produce appealing recordings can allow scientists to better disseminate their results, journalists to maintain an interested audience, budding artists and media companies to get started, and of course to serve the massive amount of people who share their thoughts online, whether that involves teaching courses, talking to family, or producing content for massive consumption.
Making a recording of palatable quality requires a strong background in recording technology, an experienced ear, and plenty of time. Unfortunately, these are luxuries that are available only to high-end productions, leaving the rest of us to struggle trying to obtain the sound we want.
In this project we examined alternatives to existing audio recording technology that can result in increased audio recording quality without necessitating that a user is a recording expert. Our approach uses recording examples that the user can use to point an intelligent system towards how they want their recording to sound.
For example, if a voice recording sounds too nasal, the user can point our system to a recording of James Earl Jones, and have it automatically equalized to match his boomy voice. Providing an example of a richly reverberant recording from a cathedral, would make the recording have the same lush quality. So instead of requiring that a user has the expertise to properly position a microphone, equalize a recording, and properly adjust to account for noise and echoes, we allow the user to provide examples of the desired sound quality and automatically match them.
We show that this idea carries out to various types of operations. We can use examples to apply global effects, e.g. remove or match noise and reverberation and match the equalization of a recording; or we can use this approach for local editing. We have successfully shown how a user can easily replace a small mistake in a long recording by simply speaking the sentence they want fixed, and our system automatically matches to the right part and replaces it.
In the process of designing these features we also needed to characterize multiple recording quality attributes. For the final year of this project we developed a deep learning algorithm that can characterize various attributes of a recording and provide real-time feedback to the user. This allows a user to quickly act to address problems in a recording. For example, the system can detect if the user is too far from a microphone and prompt him/her to move closer.
What we have shown in this project is that there is a lot of room to rethink how users can make recordings, and bypass the need for technical expertise. Just as cameras are becoming smart enough to take good pictures even when misused by novices, we have shown that we can do the same thing with audio recordings and provided a glimpse of new ways to design audio recording systems. Thanks for a fruitful collaboration with industry partners we are expecting to see some of this technology become broadly available within the next couple of years.
Last Modified: 09/15/2018
Modified by: Paris Smaragdis
Please report errors in award information by writing to: awardsearch@nsf.gov.