Award Abstract # 1420971
CHS: Small: Robust Interactive Audio Source Separation

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: NORTHWESTERN UNIVERSITY
Initial Amendment Date: August 1, 2014
Latest Amendment Date: June 4, 2015
Award Number: 1420971
Award Instrument: Standard Grant
Program Manager: Ephraim Glinert
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: October 1, 2014
End Date: September 30, 2018 (Estimated)
Total Intended Award Amount: $498,736.00
Total Awarded Amount to Date: $514,261.00
Funds Obligated to Date: FY 2014 = $498,736.00
FY 2015 = $15,525.00
History of Investigator:
  • Bryan Pardo (Principal Investigator)
    pardo@northwestern.edu
Recipient Sponsored Research Office: Northwestern University
633 CLARK ST
EVANSTON
IL  US  60208-0001
(312)503-7955
Sponsor Congressional District: 09
Primary Place of Performance: Northwestern University
2145 Sheridan Road, Tech
Evanston
IL  US  60208-3109
Primary Place of Performance
Congressional District:
09
Unique Entity Identifier (UEI): EXZVPWZBLUE8
Parent UEI:
NSF Program(s): HCC-Human-Centered Computing
Primary Program Source: 01001415DB NSF RESEARCH & RELATED ACTIVIT
01001516DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7367, 7923, 9251
Program Element Code(s): 736700
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Algorithms to separate audio sources have many potential uses such as to extract important audio data from historic recordings or to help people with hearing impairments select what to amplify and what to suppress in their hearing aids. Computer processing of audio content can potentially be used to isolate the sound sources of interest and to improve the audio clarity any time that the content exhibits interference from multiple sound sources, such as to extract a single voice of interest from a room full of voices. However, current sound source identification and separation methods are only reliable when there is a single predominant sound. This project will develop the science and technology that is needed to more easily isolate a single sound source from audio content with multiple competing sources, and that is needed to build interactive computer systems that will guide users though an interactive source separation process, to permit the separation and recombining of sound sources in a manner that is beyond the reach of existing audio software. The outcomes of the project will improve the possibility of speech recognition in environments with multiple talkers, will be useful for many scientific inquiries such as in biodiversity monitoring through the automated analysis of field recordings, and will be broadly useful any time that manual tagging of audio data is not practical.

While many computational auditory scene analysis algorithms have been proposed to separate audio scenes into individual sources, current methods are brittle and difficult to use and as a result have not been broadly adopted by potential users. The methods are brittle in that each algorithm relies on a single cue to separate sources and if the cue is not reliable then the method fails. The methods are difficult to use because the algorithms cannot predict which audio scenes any specific algorithm is likely to work on, and so the user does not know which method to apply in any given case. They are also difficult to use because their control parameters are hard to understand for users who lack expertise in signal processing. This project will research how to integrate multiple source separation algorithms into a single framework, and how to improve the ease of use by exploring interfaces that permit users to interactively define what they wish to isolate in audio scenes, and that permit systems to provide users with guidance on selecting a tool and setting the necessary parameters. The project will produce an open-source audio source separation tool that embodies these scientific research outcomes.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 12)
E. Manilow, P. Seetharaman, and B. Pardo, "The Northwestern University Source Separation Library" Proceedings of the 19th International Society of Music Information Retrieval Conference (ISMIR 2018), Paris, France, September 23-27, 2018 , 2018
Ethan Manilow Bryan Pardo "Leveraging Repetition to Do Audio Imputation" ,? IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA 2017), New Paltz, NY, USA, October 15-18, 2017 , 2017
Ethan ManilowPrem SeetharamanFatemeh PishdadianBryan Pardo "Predicting Algorithm Efficacy for Adaptive Multi-Cue Source Separation" IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA 2017), New Paltz, NY, USA, October 15-18, 2017 , 2017
Fatemeh PishdadianBryan PardoAntoine Liutkus "A MULTI-RESOLUTION APPROACH TO COMMON FATE-BASED AUDIO SEPARATION" Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2017 10.1109/ICASSP.2017.7952219
J. Wilkins, P. Seetharaman, A. Wahl, and B. Pardo "VocalSet: A Singing Voice Dataset" Proceedings of the 19th International Society of Music Information Retrieval Conference (ISMIR 2018), Paris, France, September 23-27, 2018 , 2018
Mark CartwrightBryan PardoGautham Mysore "Fast and Easy Crowdsourced Perceptual Audio Evaluation" IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2016
Mark CartwrightBryan PardoGautham MysoreMatthew Hoffman "Fast and easy crowdsourced perceptual audio evaluation" Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2016 10.1109/ICASSP.2016.7471749
M. Cartwright, B. Pardo, and G. Mysore "Crowdsourced Pairwise Comparison for Source Separation Evaluation" Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, Alberta, Canada, April 15-20, 2018 , 2018
Prem SeetharamanBryan Pardo "Simultaneous Separation and Segmentation in Layered Music" Proceedings of the 17th International Society of Music Information Retrieval Conference (ISMIR 2016), , 2016
Prem SeetharamanFatemeh PishdadianBryan Pardo "Music/Voice Separation using the 2D Fourier Transform" IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA 2017), New Paltz, NY, USA, October 15-18, 2017 , 2017
Zafar RafiiZhiyao DuanBryan Pardo "Combining Rhythm-Based and Pitch-Based Methodsfor Background and Melody Separation" IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, , v.VOL. 22 , 2014 10.1109/TASLP.2014.2354242
(Showing: 1 - 10 of 12)

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Overview: A fundamental problem in Computer Audition is that of audio source separation. This is the process of extracting elements of interest (like individual voices) from an audio scene (like a cocktail party).  While many algorithms have been proposed to separate audio scenes into individual sources, these methods are brittle and difficult to use. Because of this, potential users have not broadly adopted the technology. Audio source separation methods are brittle because each algorithm relies on a single cue to separate sources. When the cue is not reliable, the method fails. Methods are difficult to use because algorithms cannot predict which audio scenes they are likely to work on. Therefore, the user does not know which method to apply in any given case. They are also difficult to use because their control parameters are hard to understand for those not expert in signal processing. 

In this work we addressed algorithm brittleness by developing methods to integrate multiple source separation algorithms into a single framework. We also developed new interfaces that let the user easily and interactively define what they wish to separate from the audio scene. We also developed methods for computer algorithms to automatically learn evaluation measures for audio quality so that a system could provide guidance to the user on tool selection and parameter settings. 

The outcomes of this research were: (1) Audio source separation algorithms that adaptively combine multiple cues to robustly separate sounds in cases where single-cue approaches fail; (2) Interfaces that let the user guide the separation process towards a goal, without having to understand the complex internals of the algorithms; (3) Methods to automatically learn evaluation measures from past user interactions so systems can suggest approaches and settings likely to work on the current interaction; and (4) Open-source audio source separation tools that embodies these outcomes.

This work is intellectually transformative in bringing together techniques from the disparate fields signal processing and human computer interaction in an integrated and synergistic manner. The work advanced knowledge by developing a new unified framework for multi-approach source separation, as well as new approaches to deeply integrate cutting-edge signal processing and interactive interfaces that learn from users. This is of interest to researchers in signal processing, artificial intelligence, speech recognition, multimedia processing, and human computer interaction.  

This work has broad impact because current sound tagging methods are only reliable where there is a single predominant sound. Robust source separation can facilitate segmentation of audio into meaningful chunks to be recognized and transcribed with existing techniques. This will be transformative for speech recognition in environments with multiple talkers, transformative for biodiversity monitoring (automatic ID of species in field recordings with multiple concurrent species) and transformative for search through existing audio/video collections where manual tagging of the data is not practical.  

Algorithms to separation audio sources could be used to remix existing legacy audio content, upmixing stereo to surround sound, or helping the hearing-impaired select what to amplify and what to suppress in audio. Musicians could remix and edit recordings without need for an individual microphone on each musician. More broadly, source separation algorithms can be applied anywhere signals exhibit interference from multiple sources, including biomedical imaging and telecommunications. 

The work funded by this grant has resulted in 15 peer-reviewed publications presented in internationally-respected journals and conferences. These publications can be found on the Northwestern University Interactive Audio Lab website: music.cs.northwestern.edu

In addition to publications, this grant has funded a number of software products. The reader can find our software on the GitHub online open source software repository (www.github.com). The three primary repositories are Web Unmixing Toolbox (WUT), The Northwestern University Source Separation Library (nussl), and the Crowdsourced Audio Quality Evaluation (CAQE) Toolkit. 

The Web Unmixing Toolbox (WUT)  is an open source, browser-based, interactive source separation application for end users. This is the first application that lets the end user edit the audio using a two dimensional Fourier transform of the spectrogram.

The Northwestern University Source Separation Library (nussl) is a flexible,object oriented python audio source separation library created by the PI?s lab.  It provides implementations of common source separation algorithms as well as an easy-to-use framework for prototyping and adding new algorithms. 

The Crowdsourced Audio Quality Evaluation (CAQE) Toolkit is a software package thatenables researchers to easily run perceptual audio quality evaluations over the web. 

 


Last Modified: 12/30/2018
Modified by: Bryan A Pardo

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page